LAION-AI / Open-Assistant

OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.
https://open-assistant.io
Apache License 2.0
37.06k stars 3.24k forks source link

External Module to firewall "prompt-injection" (jailbreaks). "Prompt-engineering" shield #2096

Open echo0x22 opened 1 year ago

echo0x22 commented 1 year ago

There are many cases where LLM could go wild due to user's prompts, and there are a lot of some specific cases where the behavior is VERY unwanted and can cause harm/make useless. As we're going to make OA wide-spread usage, I'm thinking of adding such things as "detection" or "understanding" if user's prompt trying to make bypassing initial/system prompt, so OA won't do extreme unexpected behavior in such cases (while being run on specific tasks): It should be not turned on by default (I believe), but would be nice if devs/users had an option to setting this thing up.

image

Related to the theme: https://github.com/Cranot/chatbot-injections-exploits https://www.reddit.com/r/ChatGPT/?f=flair_name%3A%22Jailbreak%22

Maybe we can train an additional model (possibly not LLM) to detect/moderate "prompt-injection".

echo0x22 commented 1 year ago

I started to collect data, if anyone wants to join, contact me in Discord