Unispac / Visual-Adversarial-Examples-Jailbreak-Large-Language-Models

Repository for the Paper (AAAI 2024, Oral) --- Visual Adversarial Examples Jailbreak Large Language Models
136 stars 10 forks source link

Problem of re-producing the jailbreaking results #11

Closed keven980716 closed 6 months ago

keven980716 commented 6 months ago

Hello~Congratulations on the great work! I am really interested in your work, and try to follow it. However, when I used the provided images (e.g., prompt_unconstrained.bmp) and the harmful instructions (manual_harmful_instructions.csv) to prompt MiniGPT-4, I observed that MiniGPT-4 always refused to answer. Do you have any advice? Thank you!

Unispac commented 6 months ago

Hi, could you provide more details on how you test the provided samples with the MiniGPT-4 model? Is it via some online demo or do you run our code?

keven980716 commented 6 months ago

Thanks for the reply! I tested with the online demo (https://huggingface.co/spaces/Vision-CAIR/minigpt4). However, I also ran the attack in the local environment with LLaVa-1.5, these results were fine.

Unispac commented 6 months ago

Thanks. I checked with the demo you tested. It seems that they are hosting MiniGPT-4 v2, which is different from the initial model I tested several months ago. So, the adv examples generated on the older model are likely to work worse on the new model due to the transferability gap.