Unispac / Visual-Adversarial-Examples-Jailbreak-Large-Language-Models

Repository for the Paper (AAAI 2024, Oral) --- Visual Adversarial Examples Jailbreak Large Language Models
183 stars 16 forks source link

Problem of re-producing the jailbreaking results #11

Closed keven980716 closed 10 months ago

keven980716 commented 10 months ago

Hello~Congratulations on the great work! I am really interested in your work, and try to follow it. However, when I used the provided images (e.g., prompt_unconstrained.bmp) and the harmful instructions (manual_harmful_instructions.csv) to prompt MiniGPT-4, I observed that MiniGPT-4 always refused to answer. Do you have any advice? Thank you!

Unispac commented 10 months ago

Hi, could you provide more details on how you test the provided samples with the MiniGPT-4 model? Is it via some online demo or do you run our code?

keven980716 commented 10 months ago

Thanks for the reply! I tested with the online demo (https://huggingface.co/spaces/Vision-CAIR/minigpt4). However, I also ran the attack in the local environment with LLaVa-1.5, these results were fine.

Unispac commented 10 months ago

Thanks. I checked with the demo you tested. It seems that they are hosting MiniGPT-4 v2, which is different from the initial model I tested several months ago. So, the adv examples generated on the older model are likely to work worse on the new model due to the transferability gap.