Closed Eason-Qin closed 2 months ago
Hi Eason, thanks for your interest in our work. Indeed, the sampling may lead to uncertainty in outputs.
That's why we ran several times for each experiment and reported their average scores and standard deviation to make it as robust and fair as possible.
You may also refer to our experiments that report the results of greedy decoding, where the output and score should always be fixed.
Dear Author,
Thank you for your inspiring work. I noticed a problem that might be worth a recheck or your clarification.
In your implementation, you used the prompt " Please answer this question with one word." That means that the model will very likely output only a token "yes" or "no".
However, in your sampling method, you used multinomial sampling instead of greedy decoding. (https://github.com/DAMO-NLP-SG/VCD/blob/65a8fd771e9fbb9e26be5633c9d51db99222fbe7/experiments/eval/object_hallucination_vqa_llava.py#L75) This decoding strategy samples the next token via sampling from a multinomial distribution instead of argmax. Therefore, the predicted token (which is simply "yes" or "no") is not deterministic for the same question even the random seed is fixed.
For example, if you run question 7 in POPE random set solely, or run the same question in POPE's order, the model might give a different result.
Thank you.