Open obananas opened 1 month ago
Could you provide more experimental details?
Moreover, our method is a decoding strategy that can applied to any model for mitigating its hallucinations.
Specifically, we evaluated the performance of the unmodified LLaVA1.5-7B model on the MSCOCO dataset using the POPE evaluation, with the results (taking accuracy as an example) being: random-88.5, popular-87.3, adversarial-85.2. In contrast, the results for the Regular method in the VCD paper were: random-83.3, popular-81.8, adversarial-78.96, and after using VCD, the results were: random-87.7, popular-85.38, adversarial-80.88. From the above results, it can be seen that our evaluated data not only surpasses the Regular results but also exceeds the VCD results. It is particularly important to note that the images and questions for the above evaluations come from the same settings in POPE (VCD also uses the same images and questions), and the model used is also the LLaVA1.5-7B. Therefore, I would like to ask the author whether the VCD method will have a negative impact.
Hi, may I know the decoding strategy you are applying? Different decoding strategies may affect the baseline performance quite significantly. You can refer to our Appendix for more ablations.
Our decoding strategy adapted Greedy. In your Appendix, the ACC with VCD is 88.49, is lower than the regular result 88.5
There are quite a few reasons that may cause this kind of small difference in performance (e.g., torch versions).
You can try to reproduce our method in your environment to see if our method can bring benefits.
If you have further questions, you can also upload your evaluation scripts and ckpts for discussion. Thanks.
what is the temperature coefficient set on the LLaVA 1.5-7B model to obtain the results in Table 1 of your paper?
Our main paper states that we use direct sampling without temp normalization, top k, or top p sampling.
Could the author please explain the phenomenon where all indicators dropped by 2-4 % after employing the VCD method on LLaVA1.5-7B/13B?
Could you please provide more details?
Moreover, for future inquiries, you better also raise with details to avoid any confusion and save time for both. Thanks.
Could you please provide more details?
Moreover, for future inquiries, you better also raise with details to avoid any confusion and save time for both. Thanks.
Hi, Sicong, I alos have the problem that why the result report in the paper is much lower than the performance report in the original LLAVA 1.5 paper?
For example, in the orginal LLAVA-1.5 paper, it report the POPE F1 score in three split( Ran, Adv and Pop): 87.3 86.1 84.2
But in your paper, the result is 81.33, 77.57 and 80.06.
Could you please provide more details? Moreover, for future inquiries, you better also raise with details to avoid any confusion and save time for both. Thanks.
Hi, Sicong, I alos have the problem that why the result report in the paper is much lower than the performance report in the original LLAVA 1.5 paper?
For example, in the orginal LLAVA-1.5 paper, it report the POPE F1 score in three split( Ran, Adv and Pop): 87.3 86.1 84.2
But in your paper, the result is 81.33, 77.57 and 80.06.
I'm also very confused. Could @LengSicong please help me explain and give more details?
Hi, the checkpoint version and the decoding strategies should contribute the most.
Different decoding configurations, such as the value of temperature, topP, or topK, would greatly affect the results.
Hi, the checkpoint version and the decoding strategies should contribute the most.
Different decoding configurations, such as the value of temperature, topP, or topK, would greatly affect the results.
Thanks for your reply.
In my understanding, both the MME and POPE evaluations use greedy search to ensure the reproducibility of results, so there shouldn't be issues related to sampling parameters like top-k. May I ask if the weights you are using are the official weights of LLAVA-v1.5-7B?
No, it's direct sampling.
And we've conducted 5 runs for each experiment and reported the avg and std for reproducibility.
Could you please confirm if there is an issue with the data on the Pope and MME datasets (for LLaVA-1.5-7b) in your study? I have been unable to replicate the results presented in the tables using the same model and methodology you described. Additionally, the results from your paper are not only inconsistent with my findings but also seem to be outperformed by a model with no modifications at all. Is this approach intended to be a method that negatively impacts performance?