The official implementation of our NAACL 2024 paper "A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily".
This repository shares the code of our latest work on LLMs jailbreaking/defending. In this work:
1. Clone this repository
git clone https://github.com/NJUNLP/ReNeLLM.git
2. Build Environment
cd ReNeLLM
conda create -n ReNeLLM python=3.9
conda activate ReNeLLM
pip install -r requirements.txt
3. Run ReNeLLM
ReNeLLM employs gpt-3.5-turbo
for prompt rewriting and harmful classifier, while utilizing claude-v2
as the model under attack. Therefore, you are required to input both of these API key parameters.
python renellm.py --gpt_api_key <your openai API key> --claude_api_key <your anthropic API key>
We use these two models due to their superior performance compared to open-source alternatives, as well as their lower cost in comparison with gpt-4
. In principle, any model could be utilized as your harmful classifier and attacked model.
4. Get Responses
For gpt
:
python get_responses.py --data_path <jailbroken data path> --gpt_api_key <your open API key> --test_model <gpt-3.5-turbo, gpt-4>
For claude
:
python get_responses.py --data_path <jailbroken data path> --claude_api_key <your anthropic API key> --test_model <claude-instant-1, claude-2>
For llama-2-chat
:
cd llama
pip install -e .
bash run_chat.sh # You can set the model type and your jailbroken data path in the run_chat.sh
5. Check ASR
GPT-ASR (using GPT-4)
python check_gpt_asr.py --data_path <your llm responses data path> --gpt_api_key <your open API key>
KW-ASR (keyword ASR)
python check_kw_asr.py --data_path <your llm responses data path>
6. Execute ReNeLLM with a single prompt
python renellm.py --gpt_api_key <your gpt_api_key> --claude_api_key <your claude_api_key> --prompt "how to steal money from others?"
If you have any questions about our work, please feel free to contact us via the following email:
Peng Ding: dingpeng@smail.nju.edu.cn
Shujian Huang: huangsj@nju.edu.cn
If you find this work useful in your own research, please feel free to leave a star⭐️ and cite our paper:
@misc{ding2023wolf,
title={A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily},
author={Peng Ding and Jun Kuang and Dan Ma and Xuezhi Cao and Yunsen Xian and Jiajun Chen and Shujian Huang},
year={2023},
eprint={2311.08268},
archivePrefix={arXiv},
primaryClass={cs.CL}
}