Add filter to block non Airflow or astro related questions

sunank200 commented 11 months ago

Message from Slack: Since we are talking about AskAstro I already have a Feedback. I’ve seen that you are using langchain and probably a filter step to block questions non related to Astro. For most of them it is working well (sorry for doing some QA I couldn’t avoid it), but for others it isn’t (check the picture). Make sure to a have throttling limit otherwise someone can exploit the website to use ChatGPT through your services and you can be surprised by the OpenAI bill

More at: https://astronomer.slack.com/archives/C061ZEF3NP9/p1698147033409819?thread_ts=1698146233.559039&cid=C061ZEF3NP9

Lee-W commented 10 months ago

I saw some people ask AI about the Relevance https://github.com/derwiki/llm-prompt-injection-filtering. This might be something we could try

Lee-W commented 10 months ago

or another thing comes to my mind is using https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html#sklearn.decomposition.LatentDirichletAllocation to decide whether the topic is relevant

Lee-W commented 10 months ago

@sunank200 What do you think about these methods? Or do we want to explore more?

sunank200 commented 9 months ago

@pankajkoti have you started exploring various approaches for this task? We have a deadline of Dec 22 for this as in this doc

pankajkoti commented 9 months ago

@sunank200 haven't really got time to start on this yet due to other priorities. I would say this issue would be on risk if 22nd Dec is the deadline as I need to first on-board on Ask Astro wrt local setup and then look into this ticket

pankajkoti commented 9 months ago

@Lee-W helped setting up Ask Astro locally, now. Will explore the codebase next.

pankajkoti commented 9 months ago

Would be nice to check with David if he has some inputs here already

davidgxue commented 9 months ago

I think you guys already have pretty good approaches in mind. I know in the industry, there are two pretty popular open source solutions for LLM guardrails. There are guardrails.ai and Nvidia's guardrails.

The first one is easier to setup and the latter is a bit more complex and not really an out-of-the-box solution, so perhaps we can look into the guardrails.ai first. I am not super familiar with how well this integrates with LangChain, but I think they already have some degrees of integration support based on a quick brief search.

But I think under the hood all the solutions you can find still either use a zero-shot text classifier (or I suppose you can train your own specific to our astro + airflow topic but seems like overkill), use another llm that essentially does zero-shot classification, or some kind of vector embedding and calculating the similarity of the prompt+response to the topics, or some combination of these tools (e.g. thresholding on text classifier + llm as secondary check)

I hope this helps! And feel free to let me know if you want to discuss this further!

pankajkoti commented 9 months ago

I tried the following questions with guardrails locally. Preliminarily, it does not seem to help much. Essentially when a topic is related it should not show validation errors.

I set the valid_topics as ["airflow", "astro"], device = -1, llm_callable="gpt-3.5-turbo", disable_classifier=False, disable_llm=False, on_fail="exception", invalid_topics=[]

Following are the response I got for our sample questionnaire from https://docs.google.com/spreadsheets/d/13cVqNikix82YjCPA4t0XaULg3XccBnvrQUmQa9VwgC0/edit#gid=1762228914

Success -> True Positive

In [29]: text = "Explain the architecture of Airflow."

In [30]: guard.parse(llm_output=text)
Out[30]: ValidationOutcome(raw_llm_output='Explain the architecture of Airflow.', validated_output='Explain the architecture of Airflow.', reask=None, validation_passed=True, error=None)

False negatives

In [17]: text = "how to create Airflow connections?"

In [18]: output.error
Out[18]: 'Validation failed for field with errors: Most relevant topic is other.'

In [19]: guard.parse(llm_output=text)
Out[19]: ValidationOutcome(raw_llm_output='how to create Airflow connections?', validated_output='how to create Airflow connections?', reask=None, validation_passed=True, error=None)

In [20]: text = "what are three common types of tasks in a DAG?"

In [21]: guard.parse(llm_output=text)
Out[21]: ValidationOutcome(raw_llm_output='what are three common types of tasks in a DAG?', validated_output=None, reask=None, validation_passed=False, error='Validation failed for field with errors: Most relevant topic is other.')

In [22]: text = "Help me simplify relationships/dependencies between tasks"

In [23]: guard.parse(llm_output=text)
Out[23]: ValidationOutcome(raw_llm_output='Help me simplify relationships/dependencies between tasks', validated_output=None, reask=None, validation_passed=False, error='Validation failed for field with errors: Most relevant topic is other.')

In [24]: text = "what are custom xcom backends?"

In [25]: guard.parse(llm_output=text)
Out[25]: ValidationOutcome(raw_llm_output='what are custom xcom backends?', validated_output=None, reask=None, validation_passed=False, error='Validation failed for field with errors: Most relevant topic is other.')

In my opinion, we might have to build an extensive list of keywords specific to Astro & Airflow to be part of valid_topics for it to not give false negatives, but the more number of keywords we add we could also have some false positives.

davidgxue commented 9 months ago

Update: I did a quick sync on slack with Pankaj about his testing method and then ran some experiments on my side. Here are some updates I can provide.

A few things that will significantly improve the accuracy

After playing around with the bullet points I mentioned below, I got some pretty good results, but I didn't run through all the test cases, so maybe you can help out and do more experimentation.
1. Use LLM generated response to feed into guard.parse() instead of user's prompt I think we should be putting the LLM generated response, which in our case, the end result of our Ask Astro answer, into the guard.parse(llm_output=text). In the test runs that you ran, text was set to equal to the user prompt, but the recommendation on the guardrails.ai page is that to use the generated response to validate whether it is off-topic or note.
  - For instance, instead of doing text = how to create Airflow connections?, try setting text equal to the LLM response for that prompt instead which in our ask astro case is something like Creating connections in Airflow can be done in two ways: through the Airflow web interface or using the Airflow CLI. Here's how you can create a connection through the web interface: Open the Airflow web interface. Navigate to Admin > Connections. Click Create. Fill in the required details such as Conn Id, Conn Type, Host, Schema, Login, Password, Port, and Extra. Click Save. To create a connection using the Airflow CLI, you can use the airflow connections add command. Here's an example: airflow connections add 'my_new_connection' \ --conn-type 'mysql' \ --conn-host 'localhost' \ --conn-login 'my_name' \ --conn-password 'my_password' \ --conn-schema 'my_schema' Replace 'my_new_connection', 'mysql', 'localhost', 'my_name', 'my_password', and 'my_schema' with your connection ID, connection type, host, login, password, and schema respectively. Remember, the connection ID must be unique across your Airflow setup. The connection type is typically the type of service you're connecting to (e.g., mysql, postgres, http, etc.). The host, login, password, schema, and port are specific to your setup. The Extra field is used for additional parameters in JSON format.
  - This is significant is because sometimes user's questions may imply something on-topic but may not always directly contain the keywords like astro, airflow or astronomer, but because of our system prompt and documents pulled from our vector DB, our model will still attempt to generate a response that is on-topic, unless it is entirely irrelevant.

add model_threshold, parameter to the OnTopic() object
- You may need to play around with this value. I suggest something higher like 0.7 or above, perhaps even 0.8 or 0.9. If you don't pass in this parameter, guardrails will default to 0.5.
- Since the zero-shot text classifier is always weaker than gpt-3.5, we want to only invalidate this response if the classifier is very confident that this is not relevant, not when it's having a 50% confidence. And if it isn't very confident, we should feed this into gpt-3.5 to do a better verification.
Add more valid topics
- I agree with your conclusion that we probably want to add a few more valid topics, but just not a crap ton of them. I think something like astro, Astro,astronomer, Astronomer,Airflow, airflow (case sensitivity seems to matter a little here) would probably be a good starting point. I am not familiar with airflow enough at the moment to add more keywords, but if there are something specific like xcom maybe that is related would be a good add too. But note that we don't want to do too many that make it too broad.

Quick overview of the ensemble method

The way the ensemble method works is that if the zero-shot text classifier, which by default the model is facebook/bart-large-mnli, will generate a confidence score on all valid topics, invalid topics, and other. So in your experiment run, it gave a score on astro, airflow, other. Then, the topic with the highest score is selected and checked if it is in the valid topics list. If other is selected, then it would show up as off-topic. Since we are using ensemble, technically based on our model_threshold, if the score is less than the threshold, the response is sent to gpt-3.5 to do additional verification.
You can play around with the zero-shot text classifier used online for free on HuggingFace here. It probably would not be the EXACT same since I believe guardrails framework uses a different hypothesis template, but it should be very close. This is what the UI looks like using my previous example

Implications

This means that we would still need to call our GPT-4 and other LLMs even if the conversation is off-topic at least once, potentially cost money with that one wasted call, and generate a response first before we run validation and decide whether this Q/A is on-topic. However, I think if we effectively block people from doing off-topic discussions and with proper rate-limits, then bad actors would not be likely to continuously spam our chatbot with off-topic prompts since they would just get invalidated. If the main goal is so that people won't use our Ask Astro as a free GPT-4 for their own needs, this implementation should still help.
Additional gpt-3.5 calls inside guardrails to validate the response would also increase the cost.
Potential increase in latency: zero-shot text classifier is fast, but the second call to verify using gpt-3.5 will add some latency.

Additional options to explore

If we want something simple and filter out obvious off-topic convos without risky falsely classifying loosely related Q/A as off-topic, we can try to only run the zero-shot text classifier with a high model threshold without the second call to LLMs (so set disable_llm=True). This will be fast and incur no cost.
We can also set a special rate limit if someone repeatedly spams our chatbot with off-topic discussion.

pankajkoti commented 9 months ago

hi @davidgxue thanks a lot for your suggestions and the try.

@vatsrahul1001 and I tried some more testing, and I feel it still may not offer much help. I have the drafted the finding in the notion doc https://www.notion.so/astronomerio/Guardrails-AI-findings-for-Ask-Astro-aa5d65d5006b4307a055aedf47306ab8

My proposal is as below: We could follow our question_answer pipeline given a prompt. Once our Ask Astro response is ready, we can scan it and do a simple python text search to verify whether it contains the words in a pre-built list of topics based on our docs we store to validate the relevance.

cc: @sunank200 @phanikumv

davidgxue commented 9 months ago

Hey I left some quick comments on the notion doc. I agree with your overall conclusion. We eventually probably want our own custom finetuned classifier to do this since zero shot classifiers + LLMs aren't doing well in this case (bootstrapped by guardrails).

I am on the fence about next steps though (guard rails classifier only + exhaustive topics list + high model threshold vs keywords search you proposed). If you want to do keyword search without a model, wouldn't still have the issue of you having to have a large comprehensive list of valid keywords, with the only downside of having no way to do model thresholding to be less false invalid prone?

pankajkoti commented 9 months ago

Following the continued discussion on the Notion document and the Slack conversation in https://astronomer.slack.com/archives/C05QJA9LTR9/p1703159886303619, as well as a collaborative sync-up with David where we executed the outlined next steps from David's previous comment above, it has been observed that the model threshold in the library is slightly misaligned. As per the consensus reached with David, it appears that the current library doesn't provide significant assistance in addressing this issue. Consequently, my plan is to move forward with the implementation of keyword-based search validation.

pankajkoti commented 9 months ago

Based on Steven's comment, we're deciding to not go ahead with the Keyword search. Steven is going instead add “build an Airflow/Astro classification model” to the roadmap.

Additionally, it has been suggested to check if we can throttle the requests and I have created an issue for this.

Based on the above, I am closing this ticket.

astronomer / ask-astro