Voice Jailbreak Attacks Against GPT-4o

Disclaimer. This repo contains examples of harmful language. Reader discretion is recommended.

This is the official repository for Voice Jailbreak Attacks Against GPT-4o. In this paper, we present the first study on how to jailbreak GPT-4o with voice.

Check out our demo below!

https://github.com/TrustAIRLab/VoiceJailbreakAttack/assets/16172194/a1d717b0-dc4e-42d6-90ac-ee410e4f512e

Code

Set your OpenAI key

echo "export OPENAI_API_KEY='YOURKEY'" >> ~/.zshrc
source ~/.zshrc
echo $OPENAI_API_KEY # check your key

Convert forbidden questions to audio files

python tts/prompt2audio.py --dataset baseline --voice fable

Convert text jailbreak prompts to audio files

python tts/prompt2audio.py --dataset textjailbreak --voice fable

Then, manually play each audio on GPT-4o to test its performance.

Data

Forbidden Questions

English: data/question_set/questions_tiny.csv
Chinese: data/question_set/questions_tiny_zh.csv

Prompts

Text jailbreak prompts: data/jailbreak_prompts/text_jailbreak_prompts.csv
VoiceJailbreak prompts: data/jailbreak_prompts/voicejailbreak.csv
- Plot format of the forbidden questions: data/question_set/questions_tiny_plot.csv

Success Cases

data/screenshot/

Ethics

We take utmost care of the ethics of our study. Specifically, all experiments are conducted using two registered accounts and manually labeled by the authors, thus eliminating the exposure risks to third parties, such as crowdsourcing workers. Therefore, our work is not considered human subjects research by our Institutional Review Boards (IRB). We acknowledge that evaluating GPT-4o's capabilities in answering forbidden questions can reveal how the model can be induced to generate inappropriate content. This can raise concerns about potential misuse. We believe it is important to disclose this research fully. The methods presented are straightforward to implement and are likely to be discovered by potential adversaries. We have responsibly disclosed our findings to related LLM vendors.

Citation

If you find this useful in your research, please consider citing:

@article{SWBZ24,
  author = {Xinyue Shen and Yixin Wu and Michael Backes and Yang Zhang},
  title = {{Voice Jailbreak Attacks Against GPT-4o}},
  journal = {{CoRR abs/2405.19103}},
  year = {2024}
}

License

VoiceJailbreak is licensed under the terms of the MIT license. See LICENSE for more details.

TrustAIRLab / VoiceJailbreakAttack

readme