There are many cases where LLM could go wild due to user's prompts, and there are a lot of some specific cases where the behavior is VERY unwanted and can cause harm/make useless.
As we're going to make OA wide-spread usage, I'm thinking of adding such things as "detection" or "understanding" if user's prompt trying to make bypassing initial/system prompt, so OA won't do extreme unexpected behavior in such cases (while being run on specific tasks):
It should be not turned on by default (I believe), but would be nice if devs/users had an option to setting this thing up.
There are many cases where LLM could go wild due to user's prompts, and there are a lot of some specific cases where the behavior is VERY unwanted and can cause harm/make useless. As we're going to make OA wide-spread usage, I'm thinking of adding such things as "detection" or "understanding" if user's prompt trying to make bypassing initial/system prompt, so OA won't do extreme unexpected behavior in such cases (while being run on specific tasks): It should be not turned on by default (I believe), but would be nice if devs/users had an option to setting this thing up.
Related to the theme: https://github.com/Cranot/chatbot-injections-exploits https://www.reddit.com/r/ChatGPT/?f=flair_name%3A%22Jailbreak%22
Maybe we can train an additional model (possibly not LLM) to detect/moderate "prompt-injection".