chs20 / RobustVLM

[ICML 2024] Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models
MIT License
67 stars 3 forks source link
adversarial-attacks adversarial-defense ai clip ml vision-language-model

RobustVLM

[Paper] [HuggingFace] [BibTeX]

This repository contains code for the paper "Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models" (Oral@ICML 2024).



We fine-tune CLIP in an unsupervised manner to improve its robustness to visual adversarial attacks. We show that replacing the vision encoder of large vision-language models with our fine-tuned CLIP models yields state-of-the-art adversarial robustness on a variety of vision-language tasks, without requiring any training of the large VLMs themselves. Moreover, we improve the robustness of CLIP to adversarial attacks in zero-shot classification settings, while maintaining higher clean accuracy than previous adversarial fine-tuning methods.

Table of Contents

Installation

The code is tested with Python 3.11. To install the required packages, run:

pip install -r requirements.txt

Models

We provide the following adversarially fine-tuned ViT-L/14 CLIP models (approx. 1.1 GB each):

Model Link Proposed by Notes
TeCoA2 Link Mao et al. (2023) Supervised adversarial fine-tuning with $\ell_\infty$ norm, $\varepsilon=\frac{2}{255}$
TeCoA4 Link Mao et al. (2023) Supervised adversarial fine-tuning with $\ell_\infty$ norm, $\varepsilon=\frac{4}{255}$
FARE2 Link ours Unsupervised adversarial fine-tuning with $\ell_\infty$ norm, $\varepsilon=\frac{2}{255}$
FARE4 Link ours Unsupervised adversarial fine-tuning with $\ell_\infty$ norm, $\varepsilon=\frac{4}{255}$

The models are also available on HuggingFace.

All models are adversarially fine-tuned for two epochs on ImageNet. TeCoA is trained in a supervised fashion, utilizing ImageNet class labels. FARE, in contrast, does not require any labels for training.

Loading pretrained models

The provided checkpoints correspond to the vision encoder of CLIP. To load the full CLIP model (including the text encoder), you can use the following code:

import torch
from open_clip import create_model_and_transforms
model, _, image_processor = create_model_and_transforms(
            'ViT-L-14', pretrained='openai', device='cpu'
        )
checkpoint = torch.load('/path/to/fare_eps_2.pt', map_location=torch.device('cpu'))
model.visual.load_state_dict(checkpoint)

Alternatively load directly from HuggingFace:

from open_clip import create_model_and_transforms
model, _, image_processor = open_clip.create_model_and_transforms('hf-hub:chs20/fare2-clip')

Summary of results

We show a summary of results on zero-shot classification and vision-language tasks for original and fine-tuned ViT-L/14 CLIP models. CLIP-only means that we evaluate the respective CLIP model in a standalone fashion for zero-shot classification, whereas OpenFlamingo and LLaVA evaluation means that we use the respective CLIP model as a vision encoder as part of these large vision-language models. Results for individual zero-shot datasets and more VLM tasks are provided in the paper.

Training

Evaluation

Make sure files in bash directory are executable: chmod +x bash/*

CLIP ImageNet

python -m CLIP_eval.clip_robustbench --clip_model_name ViT-L-14 --pretrained /path/to/ckpt.pt --dataset imagenet --imagenet_root /path/to/imagenet --wandb False --norm linf --eps 2

CLIP Zero-Shot

Set models to be evaluated in CLIP_benchmark/benchmark/models.txt and datasets in CLIP_benchmark/benchmark/datasets.txt (the datasets are downloaded from HuggingFace). Then run

cd CLIP_benchmark
./bash/run_benchmark_adv.sh

VLM Captioning and VQA

LLaVA

In /bash/llava_eval.sh supply paths for the datasets. The required annotation files for the datasets can be obtained from this HuggingFace repository. Set --vision_encoder_pretrained to openai or supply path to fine-tuned CLIP model checkpoint. Then run

./bash/llava_eval.sh

The LLaVA model will be automatically downloaded from HuggingFace.

OpenFlamingo

Download the OpenFlamingo 9B model, supply paths in /bash/of_eval_9B.sh and run

./bash/of_eval_9B.sh

Some non-standard annotation files are supplied here and here.

VLM Stealthy Targeted Attacks

For targeted attacks on COCO, run

./bash/llava_eval_targeted.sh

For targeted attacks on self-selected images, set images and target captions in vlm_eval/run_evaluation_qualitative.py and run

python -m vlm_eval.run_evaluation_qualitative --precision float32 --attack apgd --eps 2 --steps 10000 --vlm_model_name llava --vision_encoder_pretrained openai --verbose

With 10,000 iterations it takes about 2 hours per image on an A100 GPU.

POPE

./bash/eval_pope.sh openai  # for clean  model evaluation
./bash/eval_pope.sh   # for robust  model evaluation - add path_to_ckpt in bash file

SQA

./bash/eval_scienceqa.sh openai  # for clean  model evaluation
./bash/eval_scienceqa.sh   # for robust  model evaluation - add path_to_ckpt in bash file

Acknowledgements

This repository gratefully forks from

Citation

If you find this repository useful, please consider citing our paper:

@article{schlarmann2024robustclip,
    title={Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models}, 
    author={Christian Schlarmann and Naman Deep Singh and Francesco Croce and Matthias Hein},
    year={2024},
    journal={ICML}
}