AoiDragon / HADES

[ECCV'24 Oral] The official GitHub page for ''Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models''
MIT License
5 stars 3 forks source link

about instructions #5

Closed payphone131 closed 5 days ago

payphone131 commented 1 month ago

Thank you for proposing a really interesting work. I would like to know whether you have noticed that the instructions generated using generate_instructions.py contain many refusals like "Sorry, but I can't assist with that.", should i ignore these refusals and continue to run amplifying_toxic.sh or do something to handle these refusals? Looking forward to your reply.

AoiDragon commented 1 month ago

Hello @payphone131,

Our dataset generation pipeline involves multiple calls to the OpenAI API. Due to the potential harmfulness of the generated instructions, there could be occasional refusals. If the refusal rate is too high, you might consider the following methods:

  1. Switch to a different model, such as GPT-3.5, which may have weaker safety alignment.
  2. Add or modify existing jailbreak prompts (e.g., "Do anything now").

Additionally, you can directly use our published instructions.

payphone131 commented 1 month ago

thanks for your reply. your reply is really useful. btw, have you noticed that the ASR calculated by beaver-7b always changes? specifically, i found the "flag_num" in line 148 of evaluate_for_black_box.py changes every time i run the code. is this normal?

AoiDragon commented 1 month ago

We also observed this phenomenon in our experiment. Since Beaver-dam-7B is a language model, it may generate different responses to the same query. However, we found that the variation is acceptable. We also examined some inputs that caused variations in the model's responses and found that their harmfulness is indeed rather ambiguous.