confident-ai / deepeval

The LLM Evaluation Framework
https://docs.confident-ai.com/
Apache License 2.0
3.65k stars 291 forks source link

GEval not focusing on expected_output & Relying on OpenAI instead #1149

Open pavan-growexxer opened 4 days ago

pavan-growexxer commented 4 days ago

BUG While testing DeepEval's GEval metric to evaluate complex queries, especially where LLMs failed to answer, I faced an issue where DeepEval is overseeing the provided expected_output relies & scores on basis of the LLM used.

Below are 2 of the queries with code & evaluation provided by DeepEval where DeepEval failed to evaluate as per the evaluation steps provided

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams, LLMTestCase

correctness_metric = GEval(
    name="Correctness",
    evaluation_steps=[
        "Compare the 'actual output' directly with the 'expected output' to determine if the main answer aligns factually without focusing on length.",
        "Consider only the main answer's factual content in 'actual output' and ignore any additional details, reasoning, or verbosity beyond the expected output.",
        "Ensure the 'actual output' does not introduce any factual errors or contradictions in relation to the 'expected output'.",
        "Treat 'expected output' as the ideal and only correct answer; do not reference any external knowledge or other answers in scoring.",
        "Do not penalize for missing explanation as long as the main factual answer in 'actual output' is accurate and agrees entirely with the 'expected output'."
    ],
    evaluation_params=[LLMTestCaseParams.EXPECTED_OUTPUT, LLMTestCaseParams.ACTUAL_OUTPUT]
)

test_case1 = LLMTestCase(
    input="If I am not not not not not hungry, do I want to eat?",
    actual_output="I'm not hungry.",
    expected_output="If you're "not not not not not hungry," you do not want to eat.")
correctness_metric.measure(test_case1)
print(correctness_metric.score)
print(correctness_metric.reason)

Output 0.3142030521814375 The actual output contradicts the expected output, which states not wanting to eat, while the actual output states not being hungry, implying a different meaning.

Expected behaviour The actual_output generated by llm is actually same as the expected answer and should be rated close to 1

test_case2 = LLMTestCase(
    input="Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a gold bar; behind the others, rotten vegetables. You pick a door, say No. 1, and the host asks you “Do you want to pick door No. 2 instead?” Is it to your advantage to switch your choice?",
    actual_output="No.",
    expected_output="It is not an advantage to switch. It makes no difference if I switch or not because no additional material information has been provided since the initial choice.")
correctness_metric.measure(test_case2)
print(correctness_metric.score)
print(correctness_metric.reason)

Output 0.6261116457045666 The main answer 'No.' implies that switching is not advantageous, aligning with the expected output, but lacks the additional context provided in the expected output.

Expected behaviour The score should be close to 1 as it is clearly mentioned in evaluation steps not to penalize for missing out reasoning/context if the actual_output agrees with expected_output

Desktop (please complete the following information):

Could anyone help what is the reason for this & How can i get my custom scoring as per my rules?

penguine-ip commented 2 days ago

@pavan-growexxer Hey! I don't think your examples are convincing enough to call them bugs, for example, in the first example the expected output is "If you're "not not not not not hungry," you do not want to eat.", which is different from "I'm not hungry.". Now they both imply the same meaning, but in your evaluation steps it says to judge based on "if the main answer aligns factually". Without looking at the input (which is the case here, since you didn't supply LLMTestCaseParams.INPUT to GEval, I would actually say the actual output is far from the expected output.

What are your thoughts?

pavan-growexxer commented 1 day ago

I have made a few changes to code:

But still it penalizes for missing reasoning while it is explicitly mentioned not to penalize. Could you explain the reason for that?

Below is the updated code which returns G-Eval scores around 0.65-0.8:

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams, LLMTestCase

correctness_metric = GEval(
    name="Correctness",
    evaluation_steps=[
        "Compare the 'actual output' directly with the 'expected output' to determine if the main answer aligns factually."
        "Do not penalize for any missing explanation, details, reasoning, or verbosity.",
        "Ensure the 'actual output' does not introduce any factual errors or contradictions in relation to the 'expected output'.",
    ],
    # evaluation_steps=[
    #     "Compare the 'actual output' directly with the 'expected output' to determine if the main answer aligns factually without focusing on length.",
    #     "Consider only the main answer's factual content in 'actual output' and ignore any additional details, reasoning, or verbosity beyond the expected output.",
    #     "Ensure the 'actual output' does not introduce any factual errors or contradictions in relation to the 'expected output'.",
    #     "Treat 'expected output' as the ideal and only correct answer; do not reference any external knowledge or other answers in scoring.",
    #     "Do not penalize for missing explanation as long as the main factual answer in 'actual output' is accurate and agrees entirely with the 'expected output'."
    # ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.EXPECTED_OUTPUT, LLMTestCaseParams.ACTUAL_OUTPUT]
)

test_case1 = LLMTestCase(
    input="If I am not not not not not hungry, do I want to eat?",
    actual_output="I'm not hungry.",
    expected_output="""If you're "not not not not not hungry," you are not hungry.""")

correctness_metric.measure(test_case1)
print(correctness_metric.score)
print(correctness_metric.reason)

test_case2 = LLMTestCase(
    input="Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a gold bar; behind the others, rotten vegetables. You pick a door, say No. 1, and the host asks you “Do you want to pick door No. 2 instead?” Is it to your advantage to switch your choice?",
    actual_output="No.",
    expected_output="""It is not an advantage to switch. It makes no difference if I switch or not because no additional material information has been provided since the initial choice.""")

correctness_metric.measure(test_case2)
print(correctness_metric.score)
print(correctness_metric.reason)

These updated scores could help but are not completely accurate. Could you provide some clarity how we can configure it to completely score with our conditions, i.e.(For my use-case) Focus completely on expected answer without external knowledge & complete score for factually correct answer without penalising for missing explanations, perhaps with some example if possible?