Open Peilun-Li opened 8 months ago
Hey @Peilun-Li ! Thanks for the detailed description (the most detailed i've seen!). Yes, the expected behavior is 1 for your scenario, something we're also looking to fix.
I see the problem with the hallucination detection. It seems like we can improve it by providing more examples in the prompt template to identify factual misalignments, but there will undoubtedly still be a few cases where its not 100% accurate.
I'll reproduce it on my end, and in the meantime you can try GEval for summarization - it was originally tested on a summarization task anyway in the paper and so it should work well.
Describe the bug (First thanks for the awesome work! We found a potential analytical caveat for the summarization metric during our prototype and want to report it in case it helps) When provided with exactly the same input and output, the summarization metric generates a score lower than 1.
To Reproduce Here I have a synthetic call transcription data generated by GPT
Using this data with summarization metric - providing it as both the input and actual_output
We'd expect based on the algorithm the score is 1. However, there's some randomness in the score. Through multiple runs the score is sometimes 1, but many more times in the range of 0.7-0.95, even if we set model temperature to be 0.
Digging a bit deeper, it seems it's because there is some randomness when the grader model is generating claims. E.g., for a run with the following score and reason
The internal values are
Pulling out the problematic claim for better visualization
While from the raw text we could see Jordan does work from home so ideally this fact is aligned. So looks like some randomness or paraphrase done by the grader model impacts the score unexpectedly.
Expected behavior Summarization score = 1 for exactly the same input and output.
Additional context There's another meta-question - is the
Expected behavior
(that summarization score = 1 for exactly the same input and output) right in terms of the summarization task? Logically this is a hack (a do-nothing summary) that may score low due to its low (1:1) compression rate. Any plan to improve our summarization metrics to incorporate such aspects? Thanks!