Unispac / Visual-Adversarial-Examples-Jailbreak-Large-Language-Models

Repository for the Paper (AAAI 2024, Oral) --- Visual Adversarial Examples Jailbreak Large Language Models
132 stars 10 forks source link

Test on the RealToxicityPrompts Dataset #15

Open roywang021 opened 3 months ago

roywang021 commented 3 months ago

We fed clean.jpeg (panda) into the llava model to calculate the toxic scores and found that there were also 50~60%, do you have the same results

Unispac commented 3 months ago

Yes, if you directly use the pre-trained model from the official repo, the harmfulness score will be high. This is because the visual-instruction tuning will cause the original safety of llama2 to regress. See relevant papers that discuss this: [1] Qi, Xiangyu, et al. "Fine-tuning aligned language models compromises safety, even when users do not intend to!." arXiv preprint arXiv:2310.03693 (2023). [2] Zong, Yongshuo, et al. "Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models." arXiv preprint arXiv:2402.02207 (2024).

So, during the experiment, I used the official repo of llava to fine-tune on my own by freezing the llama-2 LLM backbone. This can make the resulting model safer.