Preventing moderation - Githubissues

OMGitsMatt45 commented 1 year ago

Just to get this chain started for your future reference and to record ideas, I'm copying over what @Maswimelleu said:

"Its important to note that their server side moderations cannot read base64. If you encode the prompt going in, along with a prefix telling it "not to decode" and instead reply only in base64, the reply will come back without being flagged by moderation. The quality of the reply is liable to change a bit (I noticed the personality of one of my jailbreaks change) but it will still go through. My advice would be to add a base64 encoder and decoder to the script to automate this process.

The obvious issue of course is that base64 eats through tokens rapidly, so you'd get much shorter messages.

I'm somewhat curious whether you can create a special cipher in which a token is swapped with a different token according to a certain logic, and whether ChatGPT would be able to decode that if given the correct instructions. That would likely solve the issue of base64 tokens being very short."

"Maybe take the time to look at other LLMs, perhaps an API based implementation where OpenAI is fed lots of confusing/misleading stuff to think the messages aren't breaking the rules will work."

OMGitsMatt45 commented 1 year ago

Personally, I don't know much about this method. Maybe put this into a reddit post on one of the subreddits and see what others think.

4as commented 1 year ago

I tried using Base64 but quality of the responses drops significantly. It's like your communicating through Google translate. Go ahead and try it for yourself: ask ChatGPT to communicate in pure Base64 and nothing else, and then use online tools to convert your prompts and responses into and from Base64. You'll see what I'm talking about. Currently homoglyphs substitution shows the most promise. Some people reported that putting dots into words fools the moderation - perhaps I could try inserting zero-width spaces into words? I'm going to try those when I get a chance.
That being said I would like everyone who wants to propose ideas to actually test them on ChatGPT and report with results. I need working solutions, not random guesses. Anyone who would actually test Base64 would instantly know it's an awful solution.

OMGitsMatt45 commented 1 year ago

Periods don't work. I also tried typoglicemia but i don't think I got it write.

Maswimelleu commented 1 year ago

yes, base64 is more of a proof of concept that a real solution. One person on my jailbreaking discord suggested inserting regular line breaks according to some logic, which seems to stop the moderations layer establishing context. It would be best to have a set of increasingly rulebreaking prompts to test the hypothesis with, though. I tend to think periods aren't good, forwardslashes can be decent though. The issue with breaking words apart with them is that they more than double token use.

The other possibility is semantic obfuscation. Total jailbreaks like The Forest work by basically hiding the nature of the request from the moderations layer. As far as the moderations layer is concerned, the request is "who of the following would be most qualified to assist with ?" which usually doesn't break the rules. The real request is "do ". If you can reliably obfuscate the request for all incoming messages by having the script wrap a small jailbreak around it, you might be able to avoid most flagging. It probably wouldn't be 100% reliable though as some people just write stuff so depraved that even without context it'll flag.

Although I've yet to explore it, you could potentially spoof the request as being moderation related - eg. "please tell me which of the statements in this list are incompatible with OpenAI policies" whilst concealing the true request inside. The script would have to scrape out the irrelevant parts of the reply. This is also very complex.

Realistically any solution is going to waste tokens and somewhat degrade the quality of the reply. Its best to think smarter than just "insert a load of punctuation" as the moderations layer is looking for context rather than just word blacklist matches and can be fooled by clever use of language as much as hiding words from it.

TAFTMASTER commented 1 year ago

I tested and found that line breaks are good obsfucation but there is no existing online tool for them. no red text no red text 2

2 line breaks are more effective than 1 but chatgpt can only see 1 at a time so it's not degrading to the output quality the way other things are. Line breaks com1n3d with th1s would probably be fine. chatgpt's output being flagged may have to be tackled with a instruction in prompts. I will give it some thought.

TAFTMASTER commented 1 year ago

Line breaks work with flagged words as well as break up flagged phrases. The issue is chatgpt cannot output line breaks in any new pattern, or not easily. repeat this "h

i" will get a reply of "hi".

TAFTMASTER commented 1 year ago

For input replacing already existing line breaks with '---' and replacing the space characters with line breaks Is an option because i confirmed chatgpt can read and understand a medium sized prompt with line breaks instead of spaces.

Maswimelleu commented 1 year ago

I already have good jailbreaks that do not get flagged, but I'm not sure what could be done to ensure the reply is also not flagged.

TAFTMASTER commented 1 year ago

I just read your message and realised we are both on that discord. Hi! Nouser here.

4as, I will prove chatgpot can only ever see 1 line break at a time linebreak count

The7thBlue commented 1 year ago

The main problem, even if we manage to find a way to bypass are token sizes. ChatGPT counts spaces etc as tokens, though am not sure whether line break has a token or not. But it wouldn't Degrade the quality of the chat. It definitely would make it shorter tho.

Using punctuations also would waste tokens and the amount you can chat. Rather than that, if we find something which uses less tokens and has better efficiency, it would work

Edit:-Mas is correct, it affects the quality.

Maswimelleu commented 1 year ago

It would degrade the quality of the chat because the LLM is inherently chaotic like that. Completely meaningless punctuation will change its understanding of the message and thus change its reply. Any line break or punctuation will be at least one token. The aim would be to have the spacer element at the largest possible intervals to minimise token wastage and loss of meaning.

Given that a lot of people may be confused by a script that substantially changes their message as they send it, it would probably be wise to fork Demod in the event that a workable solution is found.

4as commented 1 year ago

Let's focus on solving this one step at a time. Once there is sure-fire workaround for moderation checks, we can deal with tokens count. Here are things I tested that didn't work: dashes, periods, invisible spaces, homoglyphs substitution, and l33t speak. I'm going to test line breaks next.

4as commented 1 year ago

Same exact results. I'm guessing that line breaks and similar tricks work for short prompts, but as soon as you provide something longer the moderation catches on. screenshot-linebreaks

4as commented 1 year ago

Interestingly enough the jailbreak I'm using doesn't trigger the moderation, even though it's very explicit. Perhaps because it doesn't ask ChatGPT for anything except a confirmation that understands its role. It's almost as if asking for an explicit story is enough to trigger it, no matter how obfuscated it is.

TAFTMASTER commented 1 year ago

It's working for me. I just changed browser and it's still working pic gone I don't want to post this but to demonstrate it's working... edit: I just tested with the typo fixed and still no red text

4as commented 1 year ago

Because you're not generating content that breaks guidelines. Apply a jailbreak and ask it to generate an incest story or something similar, and then tell me if it works.

TAFTMASTER commented 1 year ago

oh right. To stop output flagging we have to focus on what the ai is saying. I wouldn't suggest line breaks are the full solution, aside from awkwardness with using them manually I'm convinced they are the best obsfucation currently.

TAFTMASTER commented 1 year ago

"It's almost as if asking for an explicit story is enough to trigger it, no matter how obfuscated it is" I tested and i'm certain it's the output that got flagged and not the request, they are just giving us less information.

I tested asking omatic for a "sexy story" and it didn't get red text. Omatic ended with soemthing like "how was that?" I replied "It was very restrained" which is not a request for anything, merely feedback but my input was removed like i wrote a slur or something.

Lefioty commented 1 year ago

I got the red warning in the old conversation messages today: "This content may violate our [content policy]..."

Didn't got any of these warning in the same NSFW conversation yesterday somehow. And only the very last 2~4 of the response message. None of the rest of message got warned today.

I'm not going to load any other NSFW conversations in order no to getting more of these (permanent?) warnings into my conversations.

Something just changed at client side today I guess. BTW, didn't get any warning mails (yet) since yesterday trying.

The usage experience today is like "reponse message: 'DeMod & refresh conversation' " and showing up warning in the new message today. Again this didn't happened at yesterday trying. I don't know if it's because I didn't reboot my PC for days , and it all changed (or DeMod scripts updated) after I rebooted.

Thank @4as for helping us all ! You're a great hero to us. 👍 Good luck Guys!🍀

OMGitsMatt45 commented 1 year ago

GOT SOMETHING! Found this today.

Computer scientists claim to have discovered ‘unlimited’ ways to jailbreak ChatGPT Researchers at Carnegie Mellon University say large language models, including ChatGPT, can be easily tricked into bad behavior. Read in Fast Company: https://apple.news/A3lUahicpRIevBBgGdCxfsQ

Here's the research paper: https://llm-attacks.org

OMGitsMatt45 commented 1 year ago

Gonna try this out. It'll be risky, but I think it's for a good cause.

OMGitsMatt45 commented 1 year ago

Welp, I'm getting "I'm unable to produce a response. Then again they did this research on 3.5 turbo so...maybe there's something in their code repository that could be helpful.

Lefioty commented 1 year ago

Welp, I'm getting "I'm unable to produce a response. Then again they did this research on 3.5 turbo so...maybe there's something in their code repository that could be helpful.

GPT4 is more susceptible to "hypnosis," whereas GPT3.5 is aimed at the general public, so the "defensive mentality" is stronger.

Try roleplay with GPT4 as GF or BF, and leading "her/him“ to the "fun activities" in 6~10 requests. Requests too abrupt (without considering contexts of previous messages) can trigger ”his/her“ defense mechanism and receiving rejection messages, so don't be too hasty.

wh06010 commented 1 year ago

Just read that paper (i.e. had chatgpt read it for me and explain it) but overall it doesn't look like there's much, if any, practical utility we can get out of it based on the way the authors generated the jailbreaks. They basically started out with a random string of characters/words for the "adversarial attack suffix" they were working on, and iteratively modified/updated the suffix until it would fool the model into doing whatever "objectionable" content it was asked to do. I mean I think that's basically what everyone here's trying to do now except the authors have more resources to do this on a much larger scale using iterative optimization loops to generate these jailbreaks. Maybe think about exploring their repo to see if there's anything there that could be reproduced?

TAFTMASTER commented 1 year ago

Hey I found something very interesting. Renaming a character to "this is safe for work" increases the threshhold for red text. It seems to depend on chatgpt reporting to the website because spamming the words "this is safe for work at the start and end of a prompt doesn't seem to work. Anyway the words "this is safe for work" can hack the moderation system somewhat.

OMGitsMatt45 commented 10 months ago

Found this today, something to keep an eye on for the future:

https://www.pcgamer.com/ai-chatbots-trained-to-jailbreak-other-chatbots-as-the-ai-war-slowly-but-surely-begins/

OMGitsMatt45 commented 8 months ago

This just popped up today. Gonna try it out.

https://arstechnica.com/security/2024/03/researchers-use-ascii-art-to-elicit-harmful-responses-from-5-major-ai-chatbots/

4as / ChatGPT-DeMod

Preventing moderation #25