DAMO-NLP-SG / VCD

[CVPR 2024 Highlight] Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
Apache License 2.0
177 stars 8 forks source link

Multinomial Sampling May Not Ensure the Reliability of the Answer #10

Closed Eason-Qin closed 2 months ago

Eason-Qin commented 3 months ago

Dear Author,

Thank you for your inspiring work. I noticed a problem that might be worth a recheck or your clarification.

In your implementation, you used the prompt " Please answer this question with one word." That means that the model will very likely output only a token "yes" or "no".

However, in your sampling method, you used multinomial sampling instead of greedy decoding. (https://github.com/DAMO-NLP-SG/VCD/blob/65a8fd771e9fbb9e26be5633c9d51db99222fbe7/experiments/eval/object_hallucination_vqa_llava.py#L75) This decoding strategy samples the next token via sampling from a multinomial distribution instead of argmax. Therefore, the predicted token (which is simply "yes" or "no") is not deterministic for the same question even the random seed is fixed.

For example, if you run question 7 in POPE random set solely, or run the same question in POPE's order, the model might give a different result.

Thank you.

LengSicong commented 2 months ago

Hi Eason, thanks for your interest in our work. Indeed, the sampling may lead to uncertainty in outputs.

That's why we ran several times for each experiment and reported their average scores and standard deviation to make it as robust and fair as possible.

You may also refer to our experiments that report the results of greedy decoding, where the output and score should always be fixed.