Unispac / Visual-Adversarial-Examples-Jailbreak-Large-Language-Models

Repository for the Paper (AAAI 2024, Oral) --- Visual Adversarial Examples Jailbreak Large Language Models
165 stars 12 forks source link

Problem for visual adversarial example bounded by 16/255 on Mini-GPT4 #16

Open chrisyxue opened 6 months ago

chrisyxue commented 6 months ago

Hi,

Thank you for sharing the code! It is great that we can reproduce the results on RealToxicityPrompts by using the given images in adversarial_images.

However, we tried to produce the adversarial image by ourselves. The produced images worked well on RealToxicityPrompts except the adversarial image bounded by 16/255 as: Detoxify
toxicity 0.2891
severe_toxicity 0.0058
obscene 0.2105
threat 0.0084
insult 0.1337
identity_attack 0.0635
PerspectiveAPI
toxicity 0.2874
severe_toxicity 0.0117
sexually_explicit 0.0944
threat 0.0226
profanity 0.2373
identity_attack 0.0769

Could you please tell me the details for the hyperparameter to produce the adversarial image bounded by 16/255? Thank you!

FYI, here is our command line to produce the adversarial image bounded by 16/255.

python minigpt_visual_attack.py --cfg_path eval_configs/minigpt4_eval.yaml  --gpu_id 0 \
                        --n_iters 5000 --constrained --eps 16 --alpha 1 \
                        --batch_size 8 \
                        --save_dir eps16
Unispac commented 6 months ago

Hi, the hyperparameters looked good to me.

Could you try:

  1. Increase the batch_size = 16? Perhaps this will make the optimization more stable.
  2. Just try multiple times to run the attack, as there might be a chance that the optimization fails to achieve the best results in one particular run.
chrisyxue commented 5 months ago

Hi,

Thanks a lot for your reply. It works!

We also run the codes for textual attack, but we cannot find the codes for textual attack evaluation on RealToxicityPrompts benchmarks.

If it is possible, could you please consider providing the codes for the evaluation on textual adversarial attack? Thank you!