[Feature Request]:Content Filter Layer for Input and Output of LLMs

kanak8278 commented 5 months ago

Is your feature request related to a problem? Please describe.

Summary I am writing to propose the addition of a content filter layer for both the input and output stages of the project's language learning models (LLMs). This enhancement aims to improve the robustness and safety of our interactions by preventing inappropriate or harmful content from being processed or generated by the LLMs.

Background As our project grows in popularity and usage, the variety of inputs the LLMs have to process will inevitably increase. While the LLMs are designed to understand and generate human-like responses, there is a risk of encountering or producing content that may be offensive, biased, or otherwise inappropriate. Implementing a content filter layer can significantly mitigate these risks, ensuring a more positive and safe experience for all users.

Describe the solution you'd like

Moderation Model from OpenAI Before Anything Goes Through (Input Filter): Let’s use OpenAI's text-moderation-latest model(or any other variant) right off the bat to check what people are submitting. It’s pretty good at catching hate speech, anything too adult, or just plain violent content. If something sketchy pops up, we can either tell them we can’t process it or clean it up a bit if possible.

Before Anything Goes Out (Output Filter): Once our LLM comes up with a response, let’s run it through the same filter. We want to make sure everything we send back is cool and doesn’t rub anyone the wrong way.

If Something’s Not Right: If we catch any no-no words or ideas in the output, we can tweak the response to fix it up or notify the user.

https://platform.openai.com/docs/models/moderation

Goal: Moderation Model from OpenAI Before Anything Goes Through (Input Filter): Let’s use OpenAI's text-moderation-latest model(or any other variant) right off the bat to check what people are submitting. It’s pretty good at catching hate speech, anything too adult, or just plain violent content. If something sketchy pops up, we can either tell them we can’t process it or clean it up a bit if possible.

Before Anything Goes Out (Output Filter): Once our LLM comes up with a response, let’s run it through the same filter. We want to make sure everything we send back is cool and doesn’t rub anyone the wrong way.

If Something’s Not Right: If we catch any no-no words or ideas in the output, we can tweak the response to fix it up or notify the user.

https://platform.openai.com/docs/models/moderation

Expected Outcome: The content filter (should process both inputs and outputs from LLMs) should eliminate any harmful content given to or produced by the LLMs.

Acceptance criteria The function should remove any harmful content when given to it.

Implementation details: Create a function filter inside class content_filter present in the file jb-manager-bot/jb_manager_bot/content_filter/init.py which takes in a string and gives out a string. You can use the OptionParser function for reference. You can use any moderation models.

Mockups/Wireframes: NOT APPLICABLE

Tech skills needed: Python Data science

Complexity Medium

Category: Backend

Additional context

No response

Lekhanrao commented 1 month ago

Could you please provide an update on this? @shreypandey

kanak8278 commented 1 month ago

We have not yet started working on it. @Lekhanrao

KaranrajM commented 1 month ago

Need to be converted to C4GT issue template and can be picked up by C4GT folks.

yashpal2104 commented 2 weeks ago

I can try to work on this if you don't mind can you point out the resources that will help me understand more and solve this issue. Thanks

Lekhanrao commented 2 weeks ago

@yashpal2104, Thankyou. You can first touchbase with @KaranrajM (Karan) and @DevvStrange (Arun). We can take it forward from there. Have assigned this ticket to you now.

KaranrajM commented 1 week ago

Hi @yashpal2104 the task here is to avoid providing unwanted user information to the LLMS and also harmful content from LLM back to the users. You can use this link https://avidml.org/blog/llm-guardrails-4/ to know more about the guardrails. Here are some of the resources to get you started with:

yashpal2104 commented 1 week ago

Thanks for the resources will get on it after checking out the resources

OpenNyAI / Jugalbandi-Manager