RAG Metrics parse_json_output Key Error

Issue Type

Bug

Source

source

Giskard Library Version

2.5.1

OS Platform and Distribution

No response

Python version

3.9.11

Installed python packages

No response

Current Behaviour?

The keys seems to exist in the output when evaluating the metric, however there's a response or answer key which prevents the keys from being found. So an error is thrown.

It looks like the example output isn't being followed

CORRECTNESS_FALSE_EXAMPLE_OUTPUT = (
    """{"correctness": false, "correctness_reason": "The capital of Denmark is Copenhagen, not Paris."}"""
)

instead you get the following

{'response': {'correctness': True, 'correctness_reason': '', 'explanation': 'Climate services can contribute to reducing vulnerability and exposure of human systems by providing accurate and timely information on climate-related risks and opportunities. For example, climate services can help farmers make informed decisions about when to plant and harvest crops based on weather patterns, reducing the risk of crop failure. They can also help city planners design infrastructure that is resilient to extreme weather events, such as floods and heatwaves. By reducing vulnerability and exposure, climate services can help communities adapt to the impacts of climate change and build more sustainable and resilient societies.'}}

Standalone code OR list down the steps to reproduce the issue

from langchain_openai.chat_models import AzureChatOpenAI

azure_model = AzureChatOpenAI(...)

# Pass the LLM to the chat engine
chat_engine = index.as_chat_engine(llm=azure_model, chat_mode="context")

def answer_fn(question, history=None):
    if history:
        answer = chat_engine.chat(question, chat_history=[ChatMessage(role=MessageRole.USER if msg["role"] =="user" else MessageRole.ASSISTANT,
                                                          content=msg["content"]) for msg in history])
    else:
        answer = chat_engine.chat(question, chat_history=[])
    return answer

import json
from giskard.rag import AgentAnswer

def get_answer_fn(question: str, history=None) -> str:
    """A function representing your RAG agent."""

    # Get the answer and the documents
    agent_output = answer_fn(question, history)

    # Following llama_index syntax, you can get the answer and the retrieved documents
    answer = agent_output.response
    documents = agent_output.source_nodes

    # Instead of returning a simple string, we return the AgentAnswer object which
    # allows us to specify the retrieved context which is used by RAGAS metrics
    return AgentAnswer(
        message=answer,
        documents=documents
    )

from giskard.rag.metrics.ragas_metrics import ragas_context_recall, ragas_context_precision, ragas_faithfulness, ragas_answer_relevancy

metrics=[ragas_context_recall, ragas_context_precision]

from giskard.rag import evaluate

report = evaluate(get_answer_fn,
                  testset=testset,
                  knowledge_base=knowledge_base,
                  metrics=metrics)


### Relevant log output

```shell
ValueError                                Traceback (most recent call last)
File ~\Desktop\giskard\python3.9.19\lib\site-packages\giskard\rag\metrics\correctness.py:99, in CorrectnessMetric.__call__(self, question_sample, answer)
     77     out = llm_client.complete(
     78         messages=[
     79             ChatMessage(
   (...)
     97         format="json",
     98     )
---> 99     return parse_json_output(
    100         out.content,
    101         llm_client=llm_client,
    102         keys=["correctness", "correctness_reason"],
    103         caller_id=self.__class__.__name__,
    104     )
    106 except Exception as err:

File ~\Desktop\giskard\python3.9.19\lib\site-packages\giskard\rag\question_generators\utils.py:61, in parse_json_output(raw_json, llm_client, keys, caller_id)
     60 if keys is not None and any([k not in parsed_dict for k in keys]):
---> 61     raise ValueError(f"Keys {keys} not found in the JSON output: {parsed_dict}")
     63 return parsed_dict

ValueError: Keys ['correctness', 'correctness_reason'] not found in the JSON output: {'response': {'correctness': True, 'correctness_reason': '', 'explanation': 'Climate services can contribute to reducing vulnerability and exposure of human systems by providing accurate and timely information on climate-related risks and opportunities. For example, climate services can help farmers make informed decisions about when to plant and harvest crops based on weather patterns, reducing the risk of crop failure. They can also help city planners design infrastructure that is resilient to extreme weather events, such as floods and heatwaves. By reducing vulnerability and exposure, climate services can help communities adapt to the impacts of climate change and build more sustainable and resilient societies.'}}

The above exception was the direct cause of the following exception:

LLMGenerationError                        Traceback (most recent call last)
Cell In[24], line 3
      1 from giskard.rag import evaluate
----> 3 report = evaluate(get_answer_fn,
      4                   testset=testset,
      5                   knowledge_base=knowledge_base,
      6                   metrics=metrics)

File ~\Desktop\giskard\python3.9.19\lib\site-packages\giskard\rag\evaluate.py:105, in evaluate(answer_fn, testset, knowledge_base, llm_client, agent_description, metrics)
     98         metric_name = metric.__name__
    100     for sample, answer in maybe_tqdm(
    101         zip(testset.to_pandas().to_records(index=True), model_outputs),
    102         desc=f"{metric_name} evaluation",
    103         total=len(model_outputs),
    104     ):
--> 105         metrics_results[sample["id"]].update(metric(sample, answer))
    107 report = RAGReport(testset, model_outputs, metrics_results, knowledge_base)
    108 recommendation = get_rag_recommendation(
    109     report.topics,
    110     report.correctness_by_question_type().to_dict()[metrics[0].name],
    111     report.correctness_by_topic().to_dict()[metrics[0].name],
    112     llm_client,
    113 )

File ~\Desktop\giskard\python3.9.19\lib\site-packages\giskard\rag\metrics\correctness.py:107, in CorrectnessMetric.__call__(self, question_sample, answer)
     99     return parse_json_output(
    100         out.content,
    101         llm_client=llm_client,
    102         keys=["correctness", "correctness_reason"],
    103         caller_id=self.__class__.__name__,
    104     )
    106 except Exception as err:
--> 107     raise LLMGenerationError("Error while evaluating the agent") from err

LLMGenerationError: Error while evaluating the agent

**Another error example**
ValueError: Keys ['correctness', 'correctness_reason'] not found in the JSON output: {'answer': {'correctness': True, 'correctness_reason': '', 'response': "The agent's answer provides a comprehensive and accurate response to the question, including specific examples of how climate services can contribute to reducing vulnerability and exposure of human systems. The answer also mentions policy mixes and integrating climate adaptation into social protection programs, which are additional ways to reduce vulnerability and exposure. Therefore, the agent's answer is correct."}}

Giskard-AI / giskard

RAG Metrics parse_json_output Key Error #2030

Issue Type

Source

Giskard Library Version

OS Platform and Distribution

Python version

Installed python packages

Current Behaviour?

Standalone code OR list down the steps to reproduce the issue