explodinggradients / ragas

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
https://docs.ragas.io
Apache License 2.0
5.8k stars 547 forks source link

answer_correctness giving inconsistent result #1049

Open chboudry opened 2 weeks ago

chboudry commented 2 weeks ago

Describe the bug The answer correctness does not seems consistent.

Code to Reproduce question = ["1","2","3","4","5"] answers = ["yes", "no", "no", "yes", "no"] contexts = [["1","2","3","4"],["1","2","3","4"],["1","2","3","4"],["1","2","3","4"],["1","2","3","4"]] ground_truth = ["yes", "no","yes","no", "no"]

ds = Dataset.from_dict({ "question" : question, "answer" : answers, "contexts" : contexts, "ground_truth" : ground_truth })

result_ragas = evaluate(ds,metrics=metrics,raise_exceptions=False) result_ragas

results = result_ragas.to_pandas() results.head()

result image

How come row 1 and row 4 does not share the same answer correctness ?

dosubot[bot] commented 2 weeks ago

Hey @chboudry! I'm here to assist you with any bugs, questions, or contributions. Let's tackle this issue together!

The difference in the "answer_correctness" value between row 1 and row 4, despite having similar data, can be attributed to several factors:

  1. F1 Score Calculation: The F1 score is computed based on the presence of true positive (TP), false positive (FP), and false negative (FN) statements. Even slight differences in these classifications can lead to different F1 scores.

  2. Answer Similarity Score: If the weights for factuality and semantic similarity are not zero, the answer similarity score is also considered. Differences in semantic similarity between the answers and the ground truth can affect the final score.

  3. Weighted Average: The final "answer_correctness" score is a weighted average of the F1 score and the similarity score. Variations in either of these components can lead to different final scores.

  4. Factual and Semantic Similarity: Answer correctness encompasses both factual and semantic similarity between the generated answer and the ground truth. Differences in either of these aspects can lead to variations in the final score [1][2].

Therefore, even if the data appears similar, differences in the classification of statements, the calculation of similarity scores, and the weighted combination of these factors can result in different "answer_correctness" values for row 1 and row 4.

To continue talking to Dosu, mention @dosu.