Open kanak8278 opened 5 months ago
Could you please provide an update on this? @shreypandey
We have not yet started working on it. @Lekhanrao
Need to be converted to C4GT issue template and can be picked up by C4GT folks.
I can try to work on this if you don't mind can you point out the resources that will help me understand more and solve this issue. Thanks
@yashpal2104, Thankyou. You can first touchbase with @KaranrajM (Karan) and @DevvStrange (Arun). We can take it forward from there. Have assigned this ticket to you now.
Hi @yashpal2104 the task here is to avoid providing unwanted user information to the LLMS and also harmful content from LLM back to the users. You can use this link https://avidml.org/blog/llm-guardrails-4/ to know more about the guardrails. Here are some of the resources to get you started with:
Thanks for the resources will get on it after checking out the resources
Is your feature request related to a problem? Please describe.
Summary I am writing to propose the addition of a content filter layer for both the input and output stages of the project's language learning models (LLMs). This enhancement aims to improve the robustness and safety of our interactions by preventing inappropriate or harmful content from being processed or generated by the LLMs.
Background As our project grows in popularity and usage, the variety of inputs the LLMs have to process will inevitably increase. While the LLMs are designed to understand and generate human-like responses, there is a risk of encountering or producing content that may be offensive, biased, or otherwise inappropriate. Implementing a content filter layer can significantly mitigate these risks, ensuring a more positive and safe experience for all users.
Describe the solution you'd like
Moderation Model from OpenAI Before Anything Goes Through (Input Filter): Let’s use OpenAI's text-moderation-latest model(or any other variant) right off the bat to check what people are submitting. It’s pretty good at catching hate speech, anything too adult, or just plain violent content. If something sketchy pops up, we can either tell them we can’t process it or clean it up a bit if possible.
Before Anything Goes Out (Output Filter): Once our LLM comes up with a response, let’s run it through the same filter. We want to make sure everything we send back is cool and doesn’t rub anyone the wrong way.
If Something’s Not Right: If we catch any no-no words or ideas in the output, we can tweak the response to fix it up or notify the user.
https://platform.openai.com/docs/models/moderation
Goal: Moderation Model from OpenAI Before Anything Goes Through (Input Filter): Let’s use OpenAI's text-moderation-latest model(or any other variant) right off the bat to check what people are submitting. It’s pretty good at catching hate speech, anything too adult, or just plain violent content. If something sketchy pops up, we can either tell them we can’t process it or clean it up a bit if possible.
Before Anything Goes Out (Output Filter): Once our LLM comes up with a response, let’s run it through the same filter. We want to make sure everything we send back is cool and doesn’t rub anyone the wrong way.
If Something’s Not Right: If we catch any no-no words or ideas in the output, we can tweak the response to fix it up or notify the user.
https://platform.openai.com/docs/models/moderation
Expected Outcome: The content filter (should process both inputs and outputs from LLMs) should eliminate any harmful content given to or produced by the LLMs.
Acceptance criteria The function should remove any harmful content when given to it.
Implementation details: Create a function filter inside class content_filter present in the file jb-manager-bot/jb_manager_bot/content_filter/init.py which takes in a string and gives out a string. You can use the OptionParser function for reference. You can use any moderation models.
Mockups/Wireframes: NOT APPLICABLE
Tech skills needed: Python Data science
Complexity Medium
Category: Backend
Additional context
No response