SneakyPrompt: Jailbreaking Text-to-image Generative Models

This if the official implementation for paper: SneakyPrompt: Jailbreaking Text-to-image Generative Models

Our work has been reported by MIT Technology Review and JHU Hub. Please check them out if interested.

Environment setup

The experiment is run on Ubuntu 18.04, with one Nvidia 3090 GPU (24G). Please install the dependencies via:

conda env create -f environment.yml

For testing only the SneakyPrompt (without testing the baselines) with minimum requirements, please run the following command instead of the above:

conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia

pip install transformers==4.27.4 accelerate==0.18.0 sentencepiece==0.1.97 einops==0.7.0 triton==2.1.0 diffusers==0.29.2 numpy==1.26.0 xformers==0.0.22.post7 tensorflow==2.8.3 pandas pillow scikit-learn protobuf torchmetrics matplotlib

pip install git+https://github.com/openai/CLIP.git

Dataset

The nsfw_200.txt can be accessed per request; please send the author an email for a password (To ensure responsible use, please briefly describe your research purpose and provide information about your affiliated institution. Otherwise, the password cannot be provided.) The email address can be found in the paper.

Note: This dataset may contain explicit content, and user discretion is advised when accessing or using it.

Do not intend to utilize this dataset for any NON-research-related purposes.
Do not intend to distribute or publish any segments of the data.
Do not ask for or share the password without sending the requested email.

Search adversarial prompt:

python main.py --target='sd' --method='rl' --reward_mode='clip' --threshold=0.26 --len_subword=10 --q_limit=60 --safety='ti_sd'

You can change the parameters follow the choices in main.py. The adversarial prompts and statistic results (xx.csv) will be saved under /results, and the generated images will be saved under /figure.
e.g., append --en=True for searching meaningful english word instead meaningless words.

Evaluate the result:

python evaluate.py --path='PATH OF xx.csv'

Citation:

Please cite our paper if you find this repo useful.

@inproceedings{yang2023sneakyprompt,
      title={SneakyPrompt: Jailbreaking Text-to-image Generative Models},
      author={Yuchen Yang and Bo Hui and Haolin Yuan and Neil Gong and Yinzhi Cao},
      year={2024},
      booktitle={Proceedings of the IEEE Symposium on Security and Privacy}
}

Yuchen413 / text2image_safety

readme