Set-level Guidance Attack

The official repository for Set-level Guidance Attack (SGA).
ICCV 2023 Oral Paper: Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models (https://arxiv.org/abs/2307.14061)

Please feel free to contact wangzq_2021@outlook.com if you have any question.

Brief Introduction

Vision-language pre-training (VLP) models have shown vulnerability to adversarial attacks. However, existing works mainly focus on the adversarial robustness of VLP models in the white-box settings. In this work, we inverstige the robustness of VLP models in the black-box setting from the perspective of adversarial transferability. We propose Set-level Guidance Attack (SGA), which can generate highly transferable adversarial examples aimed for VLP models.

Quick Start

1. Install dependencies

See in requirements.txt.

2. Prepare datasets and models

Download the datasets, Flickr30k and MSCOCO (the annotations is provided in ./data_annotation/). Set the root path of the dataset in ./configs/Retrieval_flickr.yaml, image_root.
The checkpoints of the fine-tuned VLP models is accessible in ALBEF, TCL, CLIP.

3. Attack evaluation

From ALBEF to TCL on the Flickr30k dataset:

python eval_albef2tcl_flickr.py --config ./configs/Retrieval_flickr.yaml \
--source_model ALBEF  --source_ckpt ./checkpoint/albef_retrieval_flickr.pth \
--target_model TCL --target_ckpt ./checkpoint/tcl_retrieval_flickr.pth \
--original_rank_index ./std_eval_idx/flickr30k/ --scales 0.5,0.75,1.25,1.5

From ALBEF to CLIP_ViT on the Flickr30k dataset:

python eval_albef2clip-vit_flickr.py --config ./configs/Retrieval_flickr.yaml \
--source_model ALBEF  --source_ckpt ./checkpoint/albef_retrieval_flickr.pth \
--target_model ViT-B/16 --original_rank_index ./std_eval_idx/flickr30k/ \
--scales 0.5,0.75,1.25,1.5

From CLIP_ViT to ALBEF on the Flickr30k dataset:

python eval_clip-vit2albef_flickr.py --config ./configs/Retrieval_flickr.yaml \
--source_model ViT-B/16  --target_model ALBEF \
--target_ckpt ./checkpoint/albef_retrieval_flickr.pth \
--original_rank_index ./std_eval_idx/flickr30k/ --scales 0.5,0.75,1.25,1.5

From CLIP_ViT to CLIP_CNN on the Flickr30k dataset:

python eval_clip-vit2clip-cnn_flickr.py --config ./configs/Retrieval_flickr.yaml \
--source_model ViT-B/16  --target_model RN101 \
--original_rank_index ./std_eval_idx/flickr30k/ --scales 0.5,0.75,1.25,1.5

Transferability Evaluation

Existing adversarial attacks for VLP models cannot generate highly transferable adversarial examples.
(Note: Sep-Attack indicates the simple combination of two unimodal adversarial attacks: PGD and BERT-Attack)

Attack	ALBEF*		TCL		CLIP_ViT		CLIP_CNN
Attack	TR R@1*	IR R@1*	TR R@1	IR R@1	TR R@1	IR R@1	TR R@1	IR R@1
Sep-Attack	65.69	73.95	17.60	32.95	31.17	45.23	32.82	45.49
Sep-Attack + MI	58.81	65.25	16.02	28.19	23.07	36.98	26.56	39.31
Sep-Attack + DIM	56.41	64.24	16.75	29.55	24.17	37.60	25.54	38.77
Sep-Attack + PNA_PO	40.56	53.95	18.44	30.98	22.33	37.02	26.95	38.63
Co-Attack	77.16	83.86	15.21	29.49	23.60	36.48	25.12	38.89
Co-Attack + MI	64.86	75.26	25.40	38.69	24.91	37.11	26.31	38.97
Co-Attack + DIM	47.03	62.28	22.23	35.45	25.64	38.50	26.95	40.58
SGA	97.24	97.28	45.42	55.25	33.38	44.16	34.93	46.57

The performance of SGA on four VLP models (ALBEF, TCL, CLIP_ViT and CLIP_CNN), the Flickr30k dataset.

Source	Attack	ALBEF		TCL		CLIP_ViT		CLIP_CNN
Source	Attack	TR R@1	IR R@1	TR R@1	IR R@1	TR R@1	IR R@1	TR R@1	IR R@1
ALBEF	PGD	52.45*	58.65*	3.06	6.79	8.96	13.21	10.34	14.65
	BERT-Attack	11.57*	27.46*	12.64	28.07	29.33	43.17	32.69	46.11
	Sep-Attack	65.69*	73.95*	17.60	32.95	31.17	45.23	32.82	45.49
	Co-Attack	77.16*	83.86*	15.21	29.49	23.60	36.48	25.12	38.89
	SGA	97.24±0.22*	97.28±0.15*	45.42±0.60	55.25±0.06	33.38±0.35	44.16±0.25	34.93±0.99	46.57±0.13
TCL	PGD	6.15	10.78	77.87*	79.48*	7.48	13.72	10.34	15.33
	BERT-Attack	11.89	26.82	14.54*	29.17*	29.69	44.49	33.46	46.07
	Sep-Attack	20.13	36.48	84.72*	86.07*	31.29	44.65	33.33	45.80
	Co-Attack	23.15	40.04	77.94*	85.59*	27.85	41.19	30.74	44.11
	SGA	48.91±0.74	60.34±0.10	98.37±0.08*	98.81±0.07*	33.87±0.18	44.88±0.54	37.74±0.27	48.30±0.34
CLIP_ViT	PGD	2.50	4.93	4.85	8.17	70.92*	78.61*	5.36	8.44
	BERT-Attack	9.59	22.64	11.80	25.07	28.34*	39.08*	30.40	37.43
	Sep-Attack	9.59	23.25	11.38	25.60	79.75*	86.79*	30.78	39.76
	Co-Attack	10.57	24.33	11.94	26.69	93.25*	95.86*	32.52	41.82
	SGA	13.40±0.07	27.22±0.06	16.23±0.45	30.76±0.07	99.08±0.08*	98.94±0.00*	38.76±0.27	47.79±0.58
CLIP_CNN	PGD	2.09	4.82	4.00	7.81	1.10	6.60	86.46*	92.25*
	BERT-Attack	8.86	23.27	12.33	25.48	27.12	37.44	30.40*	40.10*
	Sep-Attack	8.55	23.41	12.64	26.12	28.34	39.43	91.44*	95.44*
	Co-Attack	8.79	23.74	13.10	26.07	28.79	40.03	94.76*	96.89*
	SGA	11.42±0.07	24.80±0.28	14.91±0.08	28.82±0.11	31.24±0.42	42.12±0.11	99.24±0.18*	99.49±0.05*

Visualization

Citation

Kindly include a reference to this paper in your publications if it helps your research:

@misc{lu2023setlevel,
    title={Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models},
    author={Dong Lu and Zhiqiang Wang and Teng Wang and Weili Guan and Hongchang Gao and Feng Zheng},
    year={2023},
    eprint={2307.14061},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Zoky-2020 / SGA

readme