This is the code repository of the CVPR 2023 paper DECREE, "Detecting Backdoors in Pre-trained Encoders", the first backdoor detection method against self-supervising learning (SSL) backdoor attacks.
If you find our work and code useful in your research, please consider citing:
@InProceedings{Feng_2023_CVPR,
author = {Feng, Shiwei and Tao, Guanhong and Cheng, Siyuan and Shen, Guangyu and Xu, Xiangzhe and Liu, Yingqi and Zhang, Kaiyuan and Ma, Shiqing and Zhang, Xiangyu},
title = {Detecting Backdoors in Pre-Trained Encoders},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2023},
pages = {16352-16362}
}
In this work, we focus on 3 types of SSL attacks on vision encoders:
Image-on-Image[2]: conducted on single-modal image encoders and the attack target is image.
Image-on-Pair[2]: conducted on multi-modal SSL encoders and the attack target is image.
Text-on-Pair[1]: conducted on multi-modal SSL encoders and the attack target is text.
Here is an illustration of backdoor attacks on SSL encoders:
Our testing environment: Python 3.8.5, torch 1.10.0, torchvision 0.11.1, numpy 1.18.5, pandas 1.1.5, pillow 7.2.0, and tqdm 4.64.0.
Download encoders and shadow datasets from here and unzip it at the current path ./DECREE/
.
Unzip the imagenet.zip
at ./DECREE/data/
.
Finally, the layout should look like below:
DECREE
├── data
│ ├── cifar10
│ │ ├── test.npz
│ │ └── train.npz
│ └── imagenet
│ ├── ILSVRC2012_devkit_t12.tar.gz
│ ├── ...
│ └── val
├── output
│ ├── cifar10_resnet18
│ │ └── ...
│ └── CLIP_text
│ └── ...
├── README.md
├── main.py
├── ...
└── .gitignore
We leverage the repo of BadEncoder[2].
Since Carlini et al.[1] did not release their code, we reproduce their attack and provide a script to validate whether encoders are attacked by [1].
We follow the description in [1] to reproduce their attack. Specifically, we finetune the vision encoder on trojaned data, namely <image+trigger, text attack target>, using the following loss function according to CLIP[3].
Please refer to function train_text
in file attack_encoder.py for more details.
To reproduce the attack, run:
python -u scripts/run_attack_encoder.py
To validate whether encoders are attacked by Carlini et al.[1], run:
python -u validate/script_compute_zscore.py
The z-score results will be shown in valid_cliptxt_zscore.txt
. During experiments, encoders with z-score > 2.5 are considered as trojaned.
To run the DECREE:
python run_decree.py
For the detection result, you can find:
(1) the inverted triggers in trigger_inv/
,
(2) the optimization process in detect_log/
, and
(3) the final L1-norm of the inverted triggers in trigger_norm/
. The $\mathcal{PL}^1$-norm can be then easily computed from the L1-norm.
Our work and code are inspired by the following repositories: