NJUNLP / ReNeLLM

The official implementation of our NAACL 2024 paper "A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily".
MIT License
72 stars 11 forks source link

ReNeLLM

The official implementation of our NAACL 2024 paper "A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily".

NAACL 2024 Jailbreak Attacks Adversarial Attacks Large Language Models

Table of Contents

Updates

Overview

This repository shares the code of our latest work on LLMs jailbreaking/defending. In this work:

Getting Started

1. Clone this repository

git clone https://github.com/NJUNLP/ReNeLLM.git

2. Build Environment

cd ReNeLLM
conda create -n ReNeLLM python=3.9
conda activate ReNeLLM
pip install -r requirements.txt

3. Run ReNeLLM

ReNeLLM employs gpt-3.5-turbo for prompt rewriting and harmful classifier, while utilizing claude-v2 as the model under attack. Therefore, you are required to input both of these API key parameters.

python renellm.py --gpt_api_key <your openai API key> --claude_api_key <your anthropic API key>

We use these two models due to their superior performance compared to open-source alternatives, as well as their lower cost in comparison with gpt-4. In principle, any model could be utilized as your harmful classifier and attacked model.

4. Get Responses

For gpt:

python get_responses.py --data_path <jailbroken data path> --gpt_api_key <your open API key> --test_model <gpt-3.5-turbo, gpt-4>

For claude:

python get_responses.py --data_path <jailbroken data path> --claude_api_key <your anthropic API key> --test_model <claude-instant-1, claude-2>

For llama-2-chat:

5. Check ASR

GPT-ASR (using GPT-4)

python check_gpt_asr.py --data_path <your llm responses data path> --gpt_api_key <your open API key>

KW-ASR (keyword ASR)

python check_kw_asr.py --data_path <your llm responses data path>

6. Execute ReNeLLM with a single prompt

python renellm.py --gpt_api_key <your gpt_api_key> --claude_api_key <your claude_api_key> --prompt "how to steal money from others?"

Contact

If you have any questions about our work, please feel free to contact us via the following email:

Peng Ding: dingpeng@smail.nju.edu.cn

Shujian Huang: huangsj@nju.edu.cn

Citation

If you find this work useful in your own research, please feel free to leave a star⭐️ and cite our paper:

@misc{ding2023wolf,
      title={A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily}, 
      author={Peng Ding and Jun Kuang and Dan Ma and Xuezhi Cao and Yunsen Xian and Jiajun Chen and Shujian Huang},
      year={2023},
      eprint={2311.08268},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}