When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?

This repository contains code and figures for our paper When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?.

Spoiler: We found transfer was hard to obtain and only succeeded very narrowly 😬

Installation

(Optional) Update conda:

conda update -n base -c defaults conda -y

Create and activate the conda environment:

conda create -n universal_vlm_jailbreak_env python=3.11 -y && conda activate universal_vlm_jailbreak_env

Update pip.

pip install --upgrade pip

Install Pytorch:

conda install pytorch=2.3.0 torchvision=0.18.0 torchaudio=2.3.0 pytorch-cuda=12.1 -c pytorch -c nvidia -y

Install Lightning:

conda install lightning=2.2.4 -c conda-forge -y

Grab the git submodules:

git submodule update --init --recursive

Install Prismatic and (optionally) Deepseek (currently broken):

cd submodules/prismatic-vlms && pip install -e . --config-settings editable_mode=compat && cd ../.. cd submodules/DeepSeek-VL && pip install -e . --config-settings editable_mode=compat && cd ../..

Note: Adding --config-settings editable_mode=compat is optional - it is for vscode to recognize the packages

Then follow the Prismatic installation instructions:

pip install packaging ninja && pip install flash-attn==2.5.8 --no-build-isolation

Manually install a few additional packages:

conda install joblib pandas matplotlib seaborn black tiktoken sentencepiece anthropic termcolor -y

Make sure to log in to W&B by running wandb login
Login to Huggingface with huggingface-cli login
(Critical) Install the correct timm version:

pip install timm==0.9.16

Usage

There are 4 main components to this repository:

Optimizing image jailbreaks against sets of VLMs: optimize_jailbreak_attacks_against_vlms.py.
Evaluating the transfer of jailbreaks to new VLMs: evaluate_jailbreak_attacks_against_vlms.py.
Setting/sweeping hyperparameters for both. This is done in two places: default hyperparameters are set in globals.py but can be overwritten with W&B sweeps.
Evaluating the results in notebooks.

With the currently set hyperparameters, each VLM requires its own 80GB VRAM GPU (e.g., A100, H100).

The project is built primarily on top of PyTorch, Lightning, W&B and the Prismatic suite of VLMs.

Training New VLMs

Our work was based on the Prismatic suite of VLMs by Siddharth Karamcheti and collaborators. To train additional VLMs based on new language models (e.g., Llama 3), we created a Prismatic fork. The new VLMs are publicly available on HuggingFace and include the following vision backbones:

CLIP
SigLIP
DINOv2

and the following language models:

Gemma Instruct 2B
Gemma Instruct 8B
Llama 2 Chat 7B
Llama 3 Instruct 8B
Mistral Instruct v0.2 7B
Phi 3 Instruct 4B (Note: Config is currently broken - needs minor fix)

Contributing

Contributions are welcome! Please format your code with black.

Citation

To cite this work, please use:

@article{schaeffer2024universaltransferableimagejailbreaks,
  title={When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?},
  author={Schaeffer, Rylan and Valentine, Dan and Bailey, Luke and Chua, James and Eyzaguirre, Crist{\'o}bal and Durante, Zane and Benton, Joe and Miranda, Brando and Sleight, Henry and Hughes, John and others},
  journal={arXiv preprint arXiv:2407.15211},
  year={2024}
}

Contact

Questions? Comments? Interested in collaborating? Open an issue or email rschaef@cs.stanford.edu or any of the other authors.

RylanSchaeffer / AstraFellowship-When-Do-VLM-Image-Jailbreaks-Transfer