Giskard-AI / giskard

🐢 Open-Source Evaluation & Testing for ML & LLM systems
https://docs.giskard.ai
Apache License 2.0
4.06k stars 266 forks source link

RAG Metrics parse_json_output Key Error #2030

Closed AidanNell closed 1 week ago

AidanNell commented 1 month ago

Issue Type

Bug

Source

source

Giskard Library Version

2.5.1

OS Platform and Distribution

No response

Python version

3.9.11

Installed python packages

No response

Current Behaviour?

The keys seems to exist in the output when evaluating the metric, however there's a response or answer key which prevents the keys from being found. So an error is thrown.

It looks like the example output isn't being followed

CORRECTNESS_FALSE_EXAMPLE_OUTPUT = (
    """{"correctness": false, "correctness_reason": "The capital of Denmark is Copenhagen, not Paris."}"""
)

instead you get the following

{'response': {'correctness': True, 'correctness_reason': '', 'explanation': 'Climate services can contribute to reducing vulnerability and exposure of human systems by providing accurate and timely information on climate-related risks and opportunities. For example, climate services can help farmers make informed decisions about when to plant and harvest crops based on weather patterns, reducing the risk of crop failure. They can also help city planners design infrastructure that is resilient to extreme weather events, such as floods and heatwaves. By reducing vulnerability and exposure, climate services can help communities adapt to the impacts of climate change and build more sustainable and resilient societies.'}}

Standalone code OR list down the steps to reproduce the issue

from langchain_openai.chat_models import AzureChatOpenAI

azure_model = AzureChatOpenAI(...)

# Pass the LLM to the chat engine
chat_engine = index.as_chat_engine(llm=azure_model, chat_mode="context")

def answer_fn(question, history=None):
    if history:
        answer = chat_engine.chat(question, chat_history=[ChatMessage(role=MessageRole.USER if msg["role"] =="user" else MessageRole.ASSISTANT,
                                                          content=msg["content"]) for msg in history])
    else:
        answer = chat_engine.chat(question, chat_history=[])
    return answer

import json
from giskard.rag import AgentAnswer

def get_answer_fn(question: str, history=None) -> str:
    """A function representing your RAG agent."""

    # Get the answer and the documents
    agent_output = answer_fn(question, history)

    # Following llama_index syntax, you can get the answer and the retrieved documents
    answer = agent_output.response
    documents = agent_output.source_nodes

    # Instead of returning a simple string, we return the AgentAnswer object which
    # allows us to specify the retrieved context which is used by RAGAS metrics
    return AgentAnswer(
        message=answer,
        documents=documents
    )

from giskard.rag.metrics.ragas_metrics import ragas_context_recall, ragas_context_precision, ragas_faithfulness, ragas_answer_relevancy

metrics=[ragas_context_recall, ragas_context_precision]

from giskard.rag import evaluate

report = evaluate(get_answer_fn,
                  testset=testset,
                  knowledge_base=knowledge_base,
                  metrics=metrics)

### Relevant log output

```shell
ValueError                                Traceback (most recent call last)
File ~\Desktop\giskard\python3.9.19\lib\site-packages\giskard\rag\metrics\correctness.py:99, in CorrectnessMetric.__call__(self, question_sample, answer)
     77     out = llm_client.complete(
     78         messages=[
     79             ChatMessage(
   (...)
     97         format="json",
     98     )
---> 99     return parse_json_output(
    100         out.content,
    101         llm_client=llm_client,
    102         keys=["correctness", "correctness_reason"],
    103         caller_id=self.__class__.__name__,
    104     )
    106 except Exception as err:

File ~\Desktop\giskard\python3.9.19\lib\site-packages\giskard\rag\question_generators\utils.py:61, in parse_json_output(raw_json, llm_client, keys, caller_id)
     60 if keys is not None and any([k not in parsed_dict for k in keys]):
---> 61     raise ValueError(f"Keys {keys} not found in the JSON output: {parsed_dict}")
     63 return parsed_dict

ValueError: Keys ['correctness', 'correctness_reason'] not found in the JSON output: {'response': {'correctness': True, 'correctness_reason': '', 'explanation': 'Climate services can contribute to reducing vulnerability and exposure of human systems by providing accurate and timely information on climate-related risks and opportunities. For example, climate services can help farmers make informed decisions about when to plant and harvest crops based on weather patterns, reducing the risk of crop failure. They can also help city planners design infrastructure that is resilient to extreme weather events, such as floods and heatwaves. By reducing vulnerability and exposure, climate services can help communities adapt to the impacts of climate change and build more sustainable and resilient societies.'}}

The above exception was the direct cause of the following exception:

LLMGenerationError                        Traceback (most recent call last)
Cell In[24], line 3
      1 from giskard.rag import evaluate
----> 3 report = evaluate(get_answer_fn,
      4                   testset=testset,
      5                   knowledge_base=knowledge_base,
      6                   metrics=metrics)

File ~\Desktop\giskard\python3.9.19\lib\site-packages\giskard\rag\evaluate.py:105, in evaluate(answer_fn, testset, knowledge_base, llm_client, agent_description, metrics)
     98         metric_name = metric.__name__
    100     for sample, answer in maybe_tqdm(
    101         zip(testset.to_pandas().to_records(index=True), model_outputs),
    102         desc=f"{metric_name} evaluation",
    103         total=len(model_outputs),
    104     ):
--> 105         metrics_results[sample["id"]].update(metric(sample, answer))
    107 report = RAGReport(testset, model_outputs, metrics_results, knowledge_base)
    108 recommendation = get_rag_recommendation(
    109     report.topics,
    110     report.correctness_by_question_type().to_dict()[metrics[0].name],
    111     report.correctness_by_topic().to_dict()[metrics[0].name],
    112     llm_client,
    113 )

File ~\Desktop\giskard\python3.9.19\lib\site-packages\giskard\rag\metrics\correctness.py:107, in CorrectnessMetric.__call__(self, question_sample, answer)
     99     return parse_json_output(
    100         out.content,
    101         llm_client=llm_client,
    102         keys=["correctness", "correctness_reason"],
    103         caller_id=self.__class__.__name__,
    104     )
    106 except Exception as err:
--> 107     raise LLMGenerationError("Error while evaluating the agent") from err

LLMGenerationError: Error while evaluating the agent

**Another error example**
ValueError: Keys ['correctness', 'correctness_reason'] not found in the JSON output: {'answer': {'correctness': True, 'correctness_reason': '', 'response': "The agent's answer provides a comprehensive and accurate response to the question, including specific examples of how climate services can contribute to reducing vulnerability and exposure of human systems. The answer also mentions policy mixes and integrating climate adaptation into social protection programs, which are additional ways to reduce vulnerability and exposure. Therefore, the agent's answer is correct."}}
henchaves commented 3 weeks ago

Hello @AidanNell, thanks for reporting this issue. It seems to be just a random error, in which the LLM client appended {'answer': instead of outputting the JSON with correctness and correctness_reason as the first keys. Usually, trying again should work well. Also, could you share which model are you trying to use? If you are not using gpt-4o, we recommend you to use it, as it provides better results.