confident-ai / deepeval

The LLM Evaluation Framework
https://docs.confident-ai.com/
Apache License 2.0
3.83k stars 308 forks source link

Summarization metric generates <1 score for exactly the same input and output #565

Open Peilun-Li opened 8 months ago

Peilun-Li commented 8 months ago

Describe the bug (First thanks for the awesome work! We found a potential analytical caveat for the summarization metric during our prototype and want to report it in case it helps) When provided with exactly the same input and output, the summarization metric generates a score lower than 1.

To Reproduce Here I have a synthetic call transcription data generated by GPT

call_transcript = """
Agent: Good afternoon! This is Alex from Sunshine Realty. How may I assist you today?

Buyer: Hi Alex, my name is Jordan. I'm looking to buy a house in the Greater Seattle area and I came across your contact online.

Agent: Wonderful to hear from you, Jordan! Seattle is a great choice. Do you have any particular neighborhoods in mind?

Buyer: I've been looking at a few areas like Bellevue, Kirkland, and Queen Anne. I'm looking for a place that's family-friendly since I have two kids.

Agent: Those are excellent choices, each with its unique charm and amenities. What type of home are you looking for, and what's your budget?

Buyer: I'm interested in a single-family home, preferably with at least three bedrooms and a nice backyard. My budget is around $800,000 to $1,000,000.

Agent: That's a healthy budget for those areas. We can certainly find something that meets your needs. How soon are you looking to move?

Buyer: Ideally, I'd like to move within the next six months. I'm starting a new job in the area and would like to settle in as soon as possible.

Agent: Six months gives us a good timeframe to work with. Have you been pre-approved for a mortgage yet?

Buyer: Not yet, but I have a meeting with my bank next week to discuss the pre-approval process.

Agent: That's a great first step. Getting pre-approved will give you a better idea of what you can afford and makes you a more attractive buyer to sellers.

Buyer: I've also been doing some research on schools in the area. Do you have any insights on the school districts?

Agent: Absolutely, the Bellevue School District is highly rated, and both Kirkland and Queen Anne have excellent schools as well. I can provide you with more detailed information on each district if you'd like.

Buyer: That would be great, thanks. I also work from home, so having a space that can be used as a home office is important.

Agent: Understood. Many homes in these areas have extra rooms or spaces that can be easily converted into a home office. I'll make sure to include that in our search criteria.

Buyer: Perfect. What are the next steps from here?

Agent: I'll start by sending you a list of current listings that match your criteria. We can then schedule some viewings for you to get a feel for the homes and neighborhoods.

Buyer: Sounds good. How do I get the listings?

Agent: I'll email them to you. Can I get the best email address to send those to?

Buyer: Sure, it's jordan@email.com.

Agent: Great, I'll send those over shortly. In the meantime, feel free to reach out if you have any questions or if you see a listing elsewhere that you're curious about.

Buyer: Will do. Thank you for your help, Alex. I'm looking forward to working with you.

Agent: It's my pleasure, Jordan. I'm here to help you find the perfect home for your family. We'll be in touch soon. Have a great day!

Buyer: You too, goodbye.

Agent: Goodbye!
"""

Using this data with summarization metric - providing it as both the input and actual_output

from langchain_openai import ChatOpenAI
from deepeval.metrics import SummarizationMetric
from deepeval.test_case import LLMTestCase

model = ChatOpenAI(temperature=0, model_name="gpt-4-0125-preview")
test_case = LLMTestCase(input=call_transcript, actual_output=call_transcript)
metric = SummarizationMetric(
    threshold=0.5,
    model=model
)
metric.measure(test_case)

We'd expect based on the algorithm the score is 1. However, there's some randomness in the score. Through multiple runs the score is sometimes 1, but many more times in the range of 0.7-0.95, even if we set model temperature to be 0.

Digging a bit deeper, it seems it's because there is some randomness when the grader model is generating claims. E.g., for a run with the following score and reason

print(metric.score)
print(metric.reason)
0.9473684210526315
The score is 0.95 because the summary accurately reflects the content of the original text with only a minor contradiction regarding Jordan's work situation. It mistakenly claims Jordan works from home, whereas the original text only mentions the importance of having a home office space without specifying Jordan's work location. There is no extra information added in the summary that wasn't mentioned in the original text, indicating a high level of fidelity to the source material.

The internal values are

alignment_verdicts [SummarizationAlignmentVerdict(verdict='yes', reason=None), SummarizationAlignmentVerdict(verdict='yes', reason=None), SummarizationAlignmentVerdict(verdict='yes', reason=None), SummarizationAlignmentVerdict(verdict='yes', reason=None), SummarizationAlignmentVerdict(verdict='yes', reason=None), SummarizationAlignmentVerdict(verdict='yes', reason=None), SummarizationAlignmentVerdict(verdict='yes', reason=None), SummarizationAlignmentVerdict(verdict='yes', reason=None), SummarizationAlignmentVerdict(verdict='yes', reason=None), SummarizationAlignmentVerdict(verdict='yes', reason=None), SummarizationAlignmentVerdict(verdict='yes', reason=None), SummarizationAlignmentVerdict(verdict='yes', reason=None), SummarizationAlignmentVerdict(verdict='yes', reason=None), SummarizationAlignmentVerdict(verdict='yes', reason=None), SummarizationAlignmentVerdict(verdict='no', reason='The summary claims Jordan works from home, which is not mentioned in the original text. The original text states having a space that can be used as a home office is important to Jordan, but does not specify that Jordan works from home.'), SummarizationAlignmentVerdict(verdict='yes', reason=None), SummarizationAlignmentVerdict(verdict='yes', reason=None), SummarizationAlignmentVerdict(verdict='yes', reason=None), SummarizationAlignmentVerdict(verdict='yes', reason=None)]
claims ['Alex is from Sunshine Realty.', 'Jordan is looking to buy a house in the Greater Seattle area.', "Jordan found Alex's contact online.", 'Jordan is interested in neighborhoods like Bellevue, Kirkland, and Queen Anne.', 'Jordan is looking for a family-friendly place.', 'Jordan has two kids.', 'Jordan is interested in a single-family home with at least three bedrooms and a nice backyard.', "Jordan's budget is around $800,000 to $1,000,000.", 'Jordan is looking to move within the next six months.', 'Jordan is starting a new job in the area.', 'Jordan has not been pre-approved for a mortgage yet.', 'Jordan has a meeting with the bank next week to discuss the pre-approval process.', 'The Bellevue School District is highly rated.', 'Kirkland and Queen Anne have excellent schools.', 'Jordan works from home.', "Alex will send Jordan a list of current listings that match Jordan's criteria.", 'Alex will schedule some viewings for Jordan.', 'Alex will email the listings to Jordan at jordan@email.com.', 'Jordan is looking forward to working with Alex.']
truths ['Alex is from Sunshine Realty.', 'Jordan is looking to buy a house in the Greater Seattle area.', "Jordan found Alex's contact online.", 'Jordan is interested in neighborhoods like Bellevue, Kirkland, and Queen Anne.', 'Jordan is looking for a family-friendly place.', 'Jordan has two kids.', 'Jordan is interested in a single-family home with at least three bedrooms and a nice backyard.', "Jordan's budget for the house is around $800,000 to $1,000,000.", 'Jordan is looking to move within the next six months.', 'Jordan is starting a new job in the area.', 'Jordan has not been pre-approved for a mortgage yet.', 'Jordan has a meeting with the bank next week to discuss the pre-approval process.', 'Jordan has been doing research on schools in the area.', 'The Bellevue School District is highly rated.', 'Kirkland and Queen Anne have excellent schools.', 'Having a space that can be used as a home office is important to Jordan.', "Alex will send Jordan a list of current listings that match Jordan's criteria.", 'Alex will schedule some viewings for Jordan.', 'Alex will email the listings to Jordan at jordan@email.com.', 'Jordan is looking forward to working with Alex.']
score_breakdown {'Alignment': 0.9473684210526315, 'Coverage': 1.0}

Pulling out the problematic claim for better visualization

alignment_verdicts [SummarizationAlignmentVerdict(verdict='no', reason='The summary claims Jordan works from home, which is not mentioned in the original text. The original text states having a space that can be used as a home office is important to Jordan, but does not specify that Jordan works from home.')]
claims ['Jordan works from home.']
truths ['Having a space that can be used as a home office is important to Jordan.']

While from the raw text we could see Jordan does work from home so ideally this fact is aligned. So looks like some randomness or paraphrase done by the grader model impacts the score unexpectedly.

Expected behavior Summarization score = 1 for exactly the same input and output.

Additional context There's another meta-question - is the Expected behavior (that summarization score = 1 for exactly the same input and output) right in terms of the summarization task? Logically this is a hack (a do-nothing summary) that may score low due to its low (1:1) compression rate. Any plan to improve our summarization metrics to incorporate such aspects? Thanks!

penguine-ip commented 8 months ago

Hey @Peilun-Li ! Thanks for the detailed description (the most detailed i've seen!). Yes, the expected behavior is 1 for your scenario, something we're also looking to fix.

I see the problem with the hallucination detection. It seems like we can improve it by providing more examples in the prompt template to identify factual misalignments, but there will undoubtedly still be a few cases where its not 100% accurate.

I'll reproduce it on my end, and in the meantime you can try GEval for summarization - it was originally tested on a summarization task anyway in the paper and so it should work well.