Open laserwave opened 4 months ago
Hi, I apologize for the delayed reply as I am currently occupied with graduation preparations and related travels.
Thanks for your kind opinion. In my view, the POPE benchmark may not be optimal for evaluating hallucination due to its excessively high scores and minimal variability. Alternative benchmarks may indeed be more suitable for these assessments (for more information, please refer to https://arxiv.org/pdf/2312.00849). After my vacation, I will augment the evaluation results from these related benchmarks if possible.
Hi,nice work.
In table 7, you report the POPE result, which decreased in some sets of experiments(comparing with and without). As your method assigns low weights to contradictory text tokens, an increase of hallucination benchmark metrics is expected in my opinion.
Do you have any comments on this, thank you.