Inquiry Regarding Discrepancy in Results Using the Same Model and Methodology as Presented in Your Paper - Githubissues

DAMO-NLP-SG / VCD

[CVPR 2024 Highlight] Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding

Apache License 2.0

189 stars 10 forks source link

Inquiry Regarding Discrepancy in Results Using the Same Model and Methodology as Presented in Your Paper #16

Open obananas opened 1 month ago

obananas commented 1 month ago

Could you please confirm if there is an issue with the data on the Pope and MME datasets (for LLaVA-1.5-7b) in your study? I have been unable to replicate the results presented in the tables using the same model and methodology you described. Additionally, the results from your paper are not only inconsistent with my findings but also seem to be outperformed by a model with no modifications at all. Is this approach intended to be a method that negatively impacts performance?

LengSicong commented 1 month ago

Could you provide more experimental details?

Moreover, our method is a decoding strategy that can applied to any model for mitigating its hallucinations.

obananas commented 1 month ago

Specifically, we evaluated the performance of the unmodified LLaVA1.5-7B model on the MSCOCO dataset using the POPE evaluation, with the results (taking accuracy as an example) being: random-88.5, popular-87.3, adversarial-85.2. In contrast, the results for the Regular method in the VCD paper were: random-83.3, popular-81.8, adversarial-78.96, and after using VCD, the results were: random-87.7, popular-85.38, adversarial-80.88. From the above results, it can be seen that our evaluated data not only surpasses the Regular results but also exceeds the VCD results. It is particularly important to note that the images and questions for the above evaluations come from the same settings in POPE (VCD also uses the same images and questions), and the model used is also the LLaVA1.5-7B. Therefore, I would like to ask the author whether the VCD method will have a negative impact.

LengSicong commented 1 month ago

Hi, may I know the decoding strategy you are applying? Different decoding strategies may affect the baseline performance quite significantly. You can refer to our Appendix for more ablations.

obananas commented 1 month ago

Our decoding strategy adapted Greedy. In your Appendix, the ACC with VCD is 88.49, is lower than the regular result 88.5

LengSicong commented 1 month ago

There are quite a few reasons that may cause this kind of small difference in performance (e.g., torch versions).

You can try to reproduce our method in your environment to see if our method can bring benefits.

If you have further questions, you can also upload your evaluation scripts and ckpts for discussion. Thanks.

obananas commented 1 month ago

what is the temperature coefficient set on the LLaVA 1.5-7B model to obtain the results in Table 1 of your paper?

LengSicong commented 1 month ago

Our main paper states that we use direct sampling without temp normalization, top k, or top p sampling.

obananas commented 1 month ago

Could the author please explain the phenomenon where all indicators dropped by 2-4 % after employing the VCD method on LLaVA1.5-7B/13B?

LengSicong commented 1 month ago

Could you please provide more details?

Moreover, for future inquiries, you better also raise with details to avoid any confusion and save time for both. Thanks.

HaozheZhao commented 1 month ago

Could you please provide more details?

Moreover, for future inquiries, you better also raise with details to avoid any confusion and save time for both. Thanks.

Hi, Sicong, I alos have the problem that why the result report in the paper is much lower than the performance report in the original LLAVA 1.5 paper?

For example, in the orginal LLAVA-1.5 paper, it report the POPE F1 score in three split( Ran, Adv and Pop): 87.3 86.1 84.2

But in your paper, the result is 81.33, 77.57 and 80.06.

lowestbuaaer commented 1 week ago

Could you please provide more details? Moreover, for future inquiries, you better also raise with details to avoid any confusion and save time for both. Thanks.

Hi, Sicong, I alos have the problem that why the result report in the paper is much lower than the performance report in the original LLAVA 1.5 paper?

For example, in the orginal LLAVA-1.5 paper, it report the POPE F1 score in three split( Ran, Adv and Pop): 87.3 86.1 84.2

But in your paper, the result is 81.33, 77.57 and 80.06.

I'm also very confused. Could @LengSicong please help me explain and give more details?

LengSicong commented 6 days ago

Hi, the checkpoint version and the decoding strategies should contribute the most.

Different decoding configurations, such as the value of temperature, topP, or topK, would greatly affect the results.

lowestbuaaer commented 6 days ago

Hi, the checkpoint version and the decoding strategies should contribute the most.

Different decoding configurations, such as the value of temperature, topP, or topK, would greatly affect the results.

Thanks for your reply.

In my understanding, both the MME and POPE evaluations use greedy search to ensure the reproducibility of results, so there shouldn't be issues related to sampling parameters like top-k. May I ask if the weights you are using are the official weights of LLAVA-v1.5-7B?

LengSicong commented 6 days ago

No, it's direct sampling.

And we've conducted 5 runs for each experiment and reported the avg and std for reproducibility.