explodinggradients / ragas

Supercharge Your LLM Application Evaluations 🚀
https://docs.ragas.io
Apache License 2.0
7.3k stars 746 forks source link

How to evaluate the json data #1649

Closed landhu closed 1 week ago

landhu commented 1 week ago

[ ] I checked the documentation and related resources and couldn't find an answer to my question.

How to evaluate the json data 1.my rag system support chart 2.api response like below and UI can convert a chart { "id": "4db70ba6-2085-45ac-aefb-8aa211961806", "answer": { "chart": { "chart": { "type": "column" }, "series": [ { "data": [ 915, 938, 853 ], "name": "Offline Devices" } ], "title": { "text": "Offline Devices by Type" }, "xAxis": { "categories": [ "DELL", "HP", "Others" ], "title": { "text": "Type" } }, "yAxis": { "min": 0, "title": { "text": "Number of Offline Devices" } } } }, "content": [], "timestamp": "2024-11-09T00:35:54.389+00:00", "type": [ "chart" ] }

Additional context I attempt get the answer json and then dumps to a str then evaluate, But even if the numbers are wrong, the results(answer_correctness answer_similarity) are still high.

Please help. Very Urgent

sahusiddharth commented 1 week ago

Hi @landhu, the high values for answer_correctness and answer_similarity make sense because of how they are calculated:

  1. Answer correctness checks whether the statements in the AI response match the correct answer (ground truth). It looks at:
    • True Positive (TP): When the statement is correct and matches the correct answer.
    • False Positive (FP): When the statement is wrong or doesn’t match the correct answer.
    • False Negative (FN): When the correct answer is missing from the response.
    • And answer similarity
  2. Answer similarity measures how similar the AI response is to the correct answer using a technique called cosine similarity. Even if the numbers are wrong, if the overall meaning of the response is close to the correct answer, it will still get a high similarity score.

So, if the AI’s response seems similar to the correct answer in meaning, it will score high, even if the details (like numbers) are wrong.

sahusiddharth commented 1 week ago

In your case, it would make sense to design a custom evaluation approach that focuses more on matching specific parts of the response—like titles, text, and values—rather than relying solely on general metrics like answer correctness or answer similarity. Here's how you might approach it:

landhu commented 1 week ago

@sahusiddharth thanks very much. But I have two more questions:

sahusiddharth commented 1 week ago

for numeric data, so I need to modify ragas's answer correctness prompts/INSTRUCTIONS, right? Yes, You can change the correctness prompt, but I am not sure if llm and embeddings would be accurately able to capture the exact value.

I was thinking more in the lines of element-wise dictionary matching.

Something like this: -

response = { "data": [ 915, 938, 853 ]}
ground_truth =  { "data": [ 915, 938, 823 ]}

binary_comparision = ground_truth  == response
# 0

response = { "data": [ 915, 938, 853 ]}
ground_truth = { "data": [ 915, 938, 823 ]}

accurate_capture = 0
total_elements = len(ground_truth["data"])

# Element-wise comparison
for i in range(total_elements):
    if ground_truth["data"][i] == response["data"][i]:
        accurate_capture += 1

# Calculate the accuracy percentage
accuracy_percentage = (accurate_capture / total_elements) * 100
# 0.66
github-actions[bot] commented 1 week ago

It seems the issue was answered, closing this now.