Closed QiushiYang closed 6 months ago
Hi, thanks for your interest in our work.
We apologize that there is a 'typo' in Fig. 2, where the star emoji 🌟 should be placed next to the token candidate _arrow, which is the accurate token that faithfully describes the image (even if it may contradict common sense). We will fix this typo immediately.
In eq.6, we add the scaled logit subtraction results to the original logits predicted by the test image, facilitating MLLMs to 'spot the difference', while not excessively modifying the original logit distribution.
We do not deny that the token candidates with high confidence scores across all images can be common correct visual content. However, in the provided qualitative evidence, we notice these token candidates are more likely to be visual hallucinations (e.g., pen does not appear in all images in fig.2, and juice does not appear in fig.9.) than candidates with high confidence score only in the test image.
We also tried to add the scaled SUM/MEAN predictions to the original prediction (exp.5 in table 4), i.e., encouraging MLLMs to emphasize the common visual concepts, but all metrics dropped.
Thank you so much for your quick reply and detailed explanations. I understand the insight of the proposd method, may relying on a potential fact that the visual hallucination tends to be caused by visual bias (a table or book usually co-occurrents with a pen in most cases) and the corresponding truth content (arrow) is rarely occurred.
Overall, Pensieve is an interesting work. Thanks a lot again!
The idea of this work is very interesting!
While I have two confusions about the method:
(1) What's the ground truth caption of the image in Fig. 2? Is the word "feather" correct? (I am not sure wheter the image shows a feather pen or just a common red pen?)
(2) This work claims that "semantic similar images may share similar visual hallucination", but those similar images also share similar correct visual contents. For the strategy of visual concepts comparison as Eq.(6), I am confused if this subtraction between original image and similar ones may remove the correct predictions (e.g., both original and similar images represent high confidence scores on "pen"). Instead, I guess the SUM/MEAN of predictions from original and similar image seems can mitigate the hallucinations meanwhile maintaining the correct parts?