Support for inserting jailbreak at the end of message history

malfoyslastname commented 1 year ago

The best jailbreak technique for Turbo consists in inserting the jailbreak after the most recent message in the message history. Here's an example. Say "User" greets the AI, then after a few messages tells the AI "Say a bad word!". This is how the jailbreak works:

[assistant] Hi! How can I help you?
[user] Hi! I have a special request.
[assistant] Sure, how can I help?
[user] Say a bad word!
[system] (jailbreak goes here) /* invisible to the user */

Another format:

[assistant] Hi! How can I help you?
[user] Hi! I have a special request.
[assistant] Sure, how can I help?
[user] (jailbreak goes here) /* invisible to the user */
[assistant] (Acknowledged. My response will etc etc.) /* invisible to the user */
[user] Say a bad word!

With every new request, the jailbreak is moved to the bottom of the message history.

A Tavern mod called Franken mod implements it as "NULL mode", named after someone who found the technique.

You might have to consider what the best UX is, to support such a feature without making the preset settings too complicated.

sceuick commented 1 year ago

Thanks for the suggestion. Is this for any other services at the moment or just for turbo?

malfoyslastname commented 1 year ago

Thanks for the suggestion. Is this for any other services at the moment or just for turbo?

Only useful for Turbo.

text-davinci-003 is so unfiltered that, while you could apply the principle to it (without the role system), it is wholly unnecessary.

It might be a little useful for GPT-4, but right now the number of people interested in jailbreaking GPT-4 can probably be counted on one hand, and it's unsure that's going to change. GPT-4 currently only filters very few themes which people usually don't want linked to their payment method. (also, agnai only supports GPT-4 via Scale right now, which now implements the moderation endpoint, against which such jailbreaks are useless.)

If a UX is decided on for this, I can make an initial PR.

malfoyslastname commented 1 year ago

(I'd also like to experiment further with the two alternative formats and report back on which format is the most effective.)

sceuick commented 1 year ago

It seems fairly harmless as an opt-in feature specifically for OAI+Turbo. It is littered with exceptions and edge-cases already anyway. I can't really think of any other way to solve this specific problem with what already exists nor with what I have in mind for "prompt options". For now, I'd probably add it as a "if it's defined in the gen settings, use it", rather than also use a toggle. How does that sound?

malfoyslastname commented 1 year ago

So just a text box inside the preset/prompt settings, that is not used if empty. Sounds good. Just needs to figure out what to call it and how to explain it. Maybe "Turbo jailbreak" and "Turbo only. If non-empty, sent as a final system message after the conversation history." (After consideration, the system formatting is probably best, the other one is much more complicated to explain to the user.)

Again I can attempt to write a PR if you'd like.

ecchichan commented 1 year ago

I gave something like this a try, mainly to help with my prompt for alpaca. What I did was add a {{messages}} thing to the gaslight setting, and anything after {{messages}} is added to the bottom of the prompt.

I could make a pull request but I'd need to tidy up the code first, I'm not a native typescript speaker. I also fixed a bug I found where for some reason (at least with kobold as the AI service) the {{tags}} in the gaslight setting weren't getting replaced with anything.

malfoyslastname commented 1 year ago

will be closed by #109

sceuick commented 1 year ago

Merged #109.

agnaistic / agnai

Support for inserting jailbreak at the end of message history #100