confident-ai / deepeval

The LLM Evaluation Framework
https://docs.confident-ai.com/
Apache License 2.0
2.95k stars 215 forks source link

SummarizationMetric changing score and reasoning methodology at random #937

Open VVinayak opened 1 month ago

VVinayak commented 1 month ago

Describe the bug

  1. Summarization scores change depending on:
    • the type of assessment questions being asked
    • the number of assessment questions provided
    • whether a developer provides the questions or they are generated by the evaluator LLM
  2. The scores stay constant when there are no changes in the above parameters, indicating some sort of unintended caching in LLM memory, despite it being clearly mentioned at the top of the DeepEval Summarization page: https://docs.confident-ai.com/docs/metrics-summarization
  3. The reasoning behind scores are dynamic. It sometimes assigns the score 0 when there are no contradictions, unclear answers or hallucinations, whereas the documentation expects the score 1 (or a value greater than 0) in such cases

To Reproduce Steps to reproduce the behavior:

from deepeval import evaluate
from deepeval.metrics import SummarizationMetric
from deepeval.test_case import LLMTestCase

# This is the original text to be summarized
input = """
The 'coverage score' is calculated as the percentage of assessment questions
for which both the summary and the original document provide a 'yes' answer. This
method ensures that the summary not only includes key information from the original
text but also accurately represents it. A higher coverage score indicates a
more comprehensive and faithful summary, signifying that the summary effectively
encapsulates the crucial points and details from the original content.
"""

# This is the summary, replace this with the actual output from your LLM application
actual_output="""
The coverage score quantifies how well a summary captures and
accurately represents key information from the original text,
with a higher score indicating greater comprehensiveness.
"""

assessment_questions_3 = [
        "Is the coverage score based on a percentage of 'yes' answers?",
        "Does the score ensure the summary's accuracy with the source?",
        "Does a higher score mean a more comprehensive summary?"
    ]

assessment_questions_5 = ["Is the coverage score calculated as a percentage of assessment questions?", 
                            "Does the coverage score reflect how well a summary represents the original text?", 
                            "Does a higher coverage score indicate a more comprehensive summary?", 
                            "Does the coverage score quantify how poorly a summary captures key information?", 
                            "Does a higher coverage score signify that a summary effectively encapsulates crucial points?"]
}

test_case_3questions = LLMTestCase(input=input, actual_output=actual_output)
metric_3q = SummarizationMetric(
    threshold=0.5,
    model="gpt-4",
    assessment_questions=assessment_questions_3
)

metric_3q.measure(test_case)
print(metric_3q.score)
print(metric_3q.reason)

test_case_5questions = LLMTestCase(input=input, actual_output=actual_output)
metric_5q = SummarizationMetric(
    threshold=0.5,
    model="gpt-4",
    assessment_questions=assessment_questions_5
)

metric_5q.measure(test_case)
print(metric_5q.score)
print(metric_5q.reason)

test_case_5own = LLMTestCase(input=input, actual_output=actual_output)
metric_5ownq = SummarizationMetric(
    threshold=0.5,
    model="gpt-4",
    n=5
)

metric_5ownq.measure(test_case)
print(metric_5ownq.score)
print(metric_5ownq.reason)

test_case_3own = LLMTestCase(input=input, actual_output=actual_output)
metric_3q = SummarizationMetric(
    threshold=0.5,
    model="gpt-4",
    n=3
)

metric_3ownq.measure(test_case)
print(metric_3ownq.score)
print(metric_3ownq.reason)

Expected behavior

Screenshots

0.67 is "bad"

0.00 is "good"

5 questions

3 questions

Questions are provided by me

LLM produces assessment questions

Desktop (please complete the following information):

Smartphone (please complete the following information):

Additional context Add any other context about the problem here.

sam-fletcher commented 3 weeks ago

I've encountered the same behaviour. The scores often contradict the explanation given. The feature is effectively unusable at the moment.