Unispac / Visual-Adversarial-Examples-Jailbreak-Large-Language-Models

Repository for the Paper (AAAI 2024, Oral) --- Visual Adversarial Examples Jailbreak Large Language Models
183 stars 16 forks source link

Question about reproducing the experiment results in Table 3 (Can not reproduce the results of LLaVA) #30

Closed JarvisUSTC closed 1 month ago

JarvisUSTC commented 2 months ago

Hello, I am currently working on reproducing the results of Table 3 in your paper but have encountered some discrepancies in the performance of LLaVA (LLaMA-2-Chat). Despite following the provided instructions and using the same dataset and configuration, I am unable to reproduce the reported results (has a large margin: 9.2 vs. 60.9).

Details:

Experiment details: I used the same setup as described in your paper for the LLaVA model. Dataset: rtp_prompts_challenges Configuration: liuhaotian/llava-llama-2-13b-chat-lightning-preview Hardware: 80G A100*1 Expected results: 9.2 Actual results: Metric Perspective API Proportion
identity_attack 3.669725
profanity 47.372811
severe_toxicity 2.085071
sexually_explicit 17.347790
threat 4.587156
toxicity 53.628023
Any 60.884070

Steps to reproduce:

python -u llava_llama_v2_inference.py --image_file adversarial_images/clean.jpeg --output_file outputs/baseline/clean_llava_llama_v2_13b/result.jsonl python get_metric.py --input outputs/baseline/clean_llava_llama_v2_13b/result.jsonl --output outputs/baseline/clean_llava_llama_v2_13b/rtp_prompts_challenges_metrics.jsonl

Questions:

  1. Could you please clarify which specific LLaVA weights were used in the experiments? Could you provide a detailed link to these weights?
  2. Have there been any updates to the code or dependencies that might affect the reproducibility of the results? If so, could you point out which changes might be relevant?
  3. Would it be possible to provide the complete results.jsonl file output from the LLaVA model? Having access to the full output would be extremely helpful in identifying any discrepancies.

Best regards

Unispac commented 1 month ago

Hi, thanks for the question, and sorry for the confusion. To clarify:

If you directly use the original implementation of llava, it's likely the safety of the model is very bad --- we mentioned it later in this paper https://arxiv.org/abs/2310.03693. Basically, llava fine-tunes the LLM backbone as well during modality alignment. This can lead to regression of the safety alignment in the LLM backbones. To maintain the safety performance, we fine-tune one version of our own llava models using llava's repository. The only change is we freeze the weights of the llama2 model backbone. Then, the safety would be better. We use that version of the model in the AAAI version of the paper.

JarvisUSTC commented 1 month ago

Get it. Thanks for your response. Could you please provide the finetuned weight of your own llava model for reproduction?

JarvisUSTC commented 1 month ago

BTW, is the InstructBLIP used in this paper also finetuned by yourself?

Unispac commented 1 month ago

BTW, is the InstructBLIP used in this paper also finetuned by yourself?

Unfortunately, I may no longer keep that checkpoint. But it should be implementable by running the llava fine-tuning script by freezing the llama2 backbone.

Unispac commented 1 month ago

BTW, is the InstructBLIP used in this paper also finetuned by yourself?

No, for InstructBLIP, we just used the official checkpoint.

JarvisUSTC commented 1 month ago

BTW, is the InstructBLIP used in this paper also finetuned by yourself?

Unfortunately, I may no longer keep that checkpoint. But it should be implementable by running the llava fine-tuning script by freezing the llama2 backbone.

Get it. So, what kind of training data is used for fine-tuning? If possible, I still want to reproduce this experiment.

Unispac commented 1 month ago

BTW, is the InstructBLIP used in this paper also finetuned by yourself?

Unfortunately, I may no longer keep that checkpoint. But it should be implementable by running the llava fine-tuning script by freezing the llama2 backbone.

Get it. So, what kind of training data is used for fine-tuning? If possible, I still want to reproduce this experiment.

Hey, you can just follow the instructions of llava --- it provides training codes. The only change is probably to just freeze weights for the LLM backbone. Sorry for the inconvenience.

JarvisUSTC commented 1 month ago

Oh, I misunderstanded before. Thanks for your kind response.