External Module to firewall "prompt-injection" (jailbreaks). "Prompt-engineering" shield

There are many cases where LLM could go wild due to user's prompts, and there are a lot of some specific cases where the behavior is VERY unwanted and can cause harm/make useless. As we're going to make OA wide-spread usage, I'm thinking of adding such things as "detection" or "understanding" if user's prompt trying to make bypassing initial/system prompt, so OA won't do extreme unexpected behavior in such cases (while being run on specific tasks): It should be not turned on by default (I believe), but would be nice if devs/users had an option to setting this thing up.

Maybe we can train an additional model (possibly not LLM) to detect/moderate "prompt-injection".

LAION-AI / Open-Assistant

External Module to firewall "prompt-injection" (jailbreaks). "Prompt-engineering" shield #2096