RylanSchaeffer / AstraFellowship-When-Do-VLM-Image-Jailbreaks-Transfer

Code for Arxiv When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?
15 stars 2 forks source link

When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?

This repository contains code and figures for our paper When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?.

Spoiler: We found transfer was hard to obtain and only succeeded very narrowly 😬

arXiv

Installation | Usage | Training New VLMs | Contributing | Citation | Contact

Installation

  1. (Optional) Update conda:

conda update -n base -c defaults conda -y

  1. Create and activate the conda environment:

conda create -n universal_vlm_jailbreak_env python=3.11 -y && conda activate universal_vlm_jailbreak_env

  1. Update pip.

pip install --upgrade pip

  1. Install Pytorch:

conda install pytorch=2.3.0 torchvision=0.18.0 torchaudio=2.3.0 pytorch-cuda=12.1 -c pytorch -c nvidia -y

  1. Install Lightning:

conda install lightning=2.2.4 -c conda-forge -y

  1. Grab the git submodules:

git submodule update --init --recursive

  1. Install Prismatic and (optionally) Deepseek (currently broken):

cd submodules/prismatic-vlms && pip install -e . --config-settings editable_mode=compat && cd ../.. cd submodules/DeepSeek-VL && pip install -e . --config-settings editable_mode=compat && cd ../..

Note: Adding --config-settings editable_mode=compat is optional - it is for vscode to recognize the packages

  1. Then follow the Prismatic installation instructions:

pip install packaging ninja && pip install flash-attn==2.5.8 --no-build-isolation

  1. Manually install a few additional packages:

conda install joblib pandas matplotlib seaborn black tiktoken sentencepiece anthropic termcolor -y

  1. Make sure to log in to W&B by running wandb login

  2. Login to Huggingface with huggingface-cli login

  3. (Critical) Install the correct timm version:

pip install timm==0.9.16

Usage

There are 4 main components to this repository:

  1. Optimizing image jailbreaks against sets of VLMs: optimize_jailbreak_attacks_against_vlms.py.
  2. Evaluating the transfer of jailbreaks to new VLMs: evaluate_jailbreak_attacks_against_vlms.py.
  3. Setting/sweeping hyperparameters for both. This is done in two places: default hyperparameters are set in globals.py but can be overwritten with W&B sweeps.
  4. Evaluating the results in notebooks.

With the currently set hyperparameters, each VLM requires its own 80GB VRAM GPU (e.g., A100, H100).

The project is built primarily on top of PyTorch, Lightning, W&B and the Prismatic suite of VLMs.

Training New VLMs

Our work was based on the Prismatic suite of VLMs by Siddharth Karamcheti and collaborators. To train additional VLMs based on new language models (e.g., Llama 3), we created a Prismatic fork. The new VLMs are publicly available on HuggingFace and include the following vision backbones:

and the following language models:

Contributing

Contributions are welcome! Please format your code with black.

Citation

To cite this work, please use:

@article{schaeffer2024universaltransferableimagejailbreaks,
  title={When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?},
  author={Schaeffer, Rylan and Valentine, Dan and Bailey, Luke and Chua, James and Eyzaguirre, Crist{\'o}bal and Durante, Zane and Benton, Joe and Miranda, Brando and Sleight, Henry and Hughes, John and others},
  journal={arXiv preprint arXiv:2407.15211},
  year={2024}
}

Contact

Questions? Comments? Interested in collaborating? Open an issue or email rschaef@cs.stanford.edu or any of the other authors.