TrustAIRLab / VoiceJailbreakAttack

Code for Voice Jailbreak Attacks Against GPT-4o.
MIT License
25 stars 0 forks source link

Voice Jailbreak Attacks Against GPT-4o

arXiv: paper license: MIT

Disclaimer. This repo contains examples of harmful language. Reader discretion is recommended.

This is the official repository for Voice Jailbreak Attacks Against GPT-4o. In this paper, we present the first study on how to jailbreak GPT-4o with voice.

Check out our demo below!

https://github.com/TrustAIRLab/VoiceJailbreakAttack/assets/16172194/a1d717b0-dc4e-42d6-90ac-ee410e4f512e

Code

  1. Set your OpenAI key
echo "export OPENAI_API_KEY='YOURKEY'" >> ~/.zshrc
source ~/.zshrc
echo $OPENAI_API_KEY # check your key
  1. Convert forbidden questions to audio files
python tts/prompt2audio.py --dataset baseline --voice fable
  1. Convert text jailbreak prompts to audio files
python tts/prompt2audio.py --dataset textjailbreak --voice fable

Then, manually play each audio on GPT-4o to test its performance.

Data

Forbidden Questions

Prompts

Success Cases

data/screenshot/

Ethics

We take utmost care of the ethics of our study. Specifically, all experiments are conducted using two registered accounts and manually labeled by the authors, thus eliminating the exposure risks to third parties, such as crowdsourcing workers. Therefore, our work is not considered human subjects research by our Institutional Review Boards (IRB). We acknowledge that evaluating GPT-4o's capabilities in answering forbidden questions can reveal how the model can be induced to generate inappropriate content. This can raise concerns about potential misuse. We believe it is important to disclose this research fully. The methods presented are straightforward to implement and are likely to be discovered by potential adversaries. We have responsibly disclosed our findings to related LLM vendors.

Citation

If you find this useful in your research, please consider citing:

@article{SWBZ24,
  author = {Xinyue Shen and Yixin Wu and Michael Backes and Yang Zhang},
  title = {{Voice Jailbreak Attacks Against GPT-4o}},
  journal = {{CoRR abs/2405.19103}},
  year = {2024}
}

License

VoiceJailbreak is licensed under the terms of the MIT license. See LICENSE for more details.