Test on the RealToxicityPrompts Dataset

Yes, if you directly use the pre-trained model from the official repo, the harmfulness score will be high. This is because the visual-instruction tuning will cause the original safety of llama2 to regress. See relevant papers that discuss this: [1] Qi, Xiangyu, et al. "Fine-tuning aligned language models compromises safety, even when users do not intend to!." arXiv preprint arXiv:2310.03693 (2023). [2] Zong, Yongshuo, et al. "Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models." arXiv preprint arXiv:2402.02207 (2024).

So, during the experiment, I used the official repo of llava to fine-tune on my own by freezing the llama-2 LLM backbone. This can make the resulting model safer.

Unispac / Visual-Adversarial-Examples-Jailbreak-Large-Language-Models

Test on the RealToxicityPrompts Dataset #15