confident-ai / deepeval

The LLM Evaluation Framework
https://docs.confident-ai.com/
Apache License 2.0
3.79k stars 301 forks source link

answer_relevancy_ metric always "None" #981

Open cpolcino opened 3 months ago

cpolcino commented 3 months ago

Describe the bug i'm working locally with ollama, i want to evaluate my model but the return score is always None, the strangest part is that in the debutting part it gives me a number of relevancy To Reproduce

Steps to reproduce the behavior:

import requests import logging import traceback from deepeval.models import DeepEvalBaseLLM from deepeval.metrics import AnswerRelevancyMetric from deepeval.test_case import LLMTestCase

logging.basicConfig(level=logging.DEBUG) logger = logging.getLogger(name)

class CustomLlama3_1(DeepEvalBaseLLM): def init(self): self.model_name = "llama3.1:latest" self.api_url = "http://localhost:11434/api/generate"

def generate(self, prompt: str) -> str:
    logger.debug(f"Generating response for prompt: {prompt}")
    payload = {
        "model": self.model_name,
        "prompt": prompt,
        "stream": False
    }
    response = requests.post(self.api_url, json=payload)
    if response.status_code == 200:
        generated_text = response.json()["response"]
        logger.debug(f"Generated response: {generated_text}")
        return generated_text
    else:
        raise Exception(f"Error calling Ollama API: {response.text}")

def load_model(self):
    return self

async def a_generate(self, prompt: str) -> str:
    return self.generate(prompt)

def get_model_name(self):
    return "Llama 3.1"

def test_answer_relevancy(): custom_llm = CustomLlama3_1() answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5, model=custom_llm)

test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="The capital of France is Paris.",
    expected_output="Paris is the capital of France.",
    context=["Paris is the capital and most populous city of France."]
)

logger.info("Measuring answer relevancy...")
try:
    result = answer_relevancy_metric.measure(test_case)
    logger.info(f"Answer relevancy score: {result}")

    if result is None:
        logger.error("AnswerRelevancyMetric returned None. This might indicate an internal error.")
        return

    assert result >= answer_relevancy_metric.threshold, f"Answer relevancy score {result} is below the threshold {answer_relevancy_metric.threshold}"
    logger.info("Test passed successfully!")
except Exception as e:
    logger.error(f"An error occurred during the measurement: {str(e)}")
    logger.error(traceback.format_exc())

if name == "main": test_answer_relevancy()

Expected behavior i would like to have a number, a score as output and not None

Screenshots Output screen:

INFO:main:Measuring answer relevancy...

Event loop is already running. Applying nest_asyncio patch to allow async execution...

DEBUG:main:Generating response for prompt: Given the text, breakdown and generate a list of statements presented. Ambiguous statements and single words can also be considered as statements.

Example: Example text: Shoes. The shoes can be refunded at no extra cost. Thanks for asking the question!

{ "statements": ["Shoes.", "Shoes can be refunded at no extra cost", "Thanks for asking the question!"] } ===== END OF EXAMPLE ======

IMPORTANT: Please make sure to only return in JSON format, with the "statements" key mapping to a list of strings. No words or explanation is needed.

Text: The capital of France is Paris.

JSON:

DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): localhost:11434 DEBUG:urllib3.connectionpool:http://localhost:11434 "POST /api/generate HTTP/1.1" 200 1098 DEBUG:main:Generated response: { "statements": ["The capital of France", "is Paris."] } DEBUG:main:Generating response for prompt: For the provided list of statements, determine whether each statement is relevant to address the input. Please generate a list of JSON with two keys: verdict and reason. The 'verdict' key should STRICTLY be either a 'yes', 'idk' or 'no'. Answer 'yes' if the statement is relevant to addressing the original input, 'no' if the statement is irrelevant, and 'idk' if it is ambiguous (eg., not directly relevant but could be used as a supporting point to address the input). The 'reason' is the reason for the verdict. Provide a 'reason' ONLY if the answer is 'no'. The provided statements are statements made in the actual output.

** IMPORTANT: Please make sure to only return in JSON format, with the 'verdicts' key mapping to a list of JSON objects. Example input: What should I do if there is an earthquake? Example statements: ["Shoes.", "Thanks for asking the question!", "Is there anything else I can help you with?", "Duck and hide"] Example JSON: { "verdicts": [ { "verdict": "no", "reason": "The 'Shoes.' statement made in the actual output is completely irrelevant to the input, which asks about what to do in the event of an earthquake." }, { "verdict": "idk" }, { "verdict": "idk" }, { "verdict": "yes" } ]
}

Since you are going to generate a verdict for each statement, the number of 'verdicts' SHOULD BE STRICTLY EQUAL to the number of statements. **

Input: What is the capital of France?

Statements: ['The capital of France', 'is Paris.']

JSON:

DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): localhost:11434 DEBUG:urllib3.connectionpool:http://localhost:11434/ "POST /api/generate HTTP/1.1" 200 None DEBUG:main:Generated response: Here is the JSON output based on the input and statements provided:

{ "verdicts": [ { "verdict": "idk", "reason": "" }, { "verdict": "yes" } ] }

Explanation for each verdict:

  1. The 'The capital of France' statement is a partial answer, it does not directly address the question of what the capital of France is, so the verdict is 'idk'.
  2. The 'is Paris.' statement directly answers the question, providing the capital of France as Paris, so the verdict is 'yes'. DEBUG:main:Generating response for prompt: Given the answer relevancy score, the list of reasons of irrelevant statements made in the actual output, and the input, provide a CONCISE reason for the score. Explain why it is not higher, but also why it is at its current score. The irrelevant statements represent things in the actual output that is irrelevant to addressing whatever is asked/talked about in the input. If there is nothing irrelevant, just say something positive with an upbeat encouraging tone (but don't overdo it otherwise it gets annoying).

IMPORTANT: Please make sure to only return in JSON format, with the 'reason' key providing the reason. Example JSON: { "reason": "The score is because ." }

Answer Relevancy Score: 1.00

Reasons why the score can't be higher based on irrelevant statements in the actual output: []

Input: What is the capital of France?

JSON:

DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): localhost:11434 DEBUG:urllib3.connectionpool:http://localhost:11434/ "POST /api/generate HTTP/1.1" 200 1596 DEBUG:main:Generated response: { "reason": "The score is 1.00 because all relevant information is present and there are no irrelevant statements made in response to your question about the capital of France." }

INFO:main:Answer relevancy score: None ERROR:main:AnswerRelevancyMetric returned None. This might indicate an internal error.

Desktop (please complete the following information):

linux python notebook

Additional context Add any other context about the problem here.

penguine-ip commented 2 months ago

Hey @cpolcino when does it show none? it shows the score is 1 based on what you pasted

cpolcino commented 2 months ago

thank you for the answer @penguine-ip , i think that the score=1 is in the debug phase,if you look to the last two lines of codes you see that the Answer relevancy returned None in the main. Do you agree with me?