Open Simeon-Krah opened 4 weeks ago
A similar issue was discussed in a closed thread: metrics reproducibility: improve answer_correctness [1].
The poor score in the answer_correctness
metric, despite having good scores in context_recall
, context_precision
, and faithfulness
, can be attributed to the classification of statements into True Positives (TP), False Positives (FP), and False Negatives (FN). The answer_correctness
metric evaluates the factuality and semantic similarity of the answer compared to the ground truth. If the answer contains statements that are not directly supported by the ground truth (FP) or misses statements that are present in the ground truth (FN), it will negatively impact the answer_correctness
score.
Here is the relevant code snippet from the AnswerCorrectness
class that shows how the metric is computed:
def _compute_statement_presence(
self, prediction: AnswerCorrectnessClassification
) -> float:
tp = len(prediction.TP)
fp = len(prediction.FP)
fn = len(prediction.FN)
score = tp / (tp + 0.5 * (fp + fn)) if tp > 0 else 0
return score
This function calculates the score based on the number of TP, FP, and FN statements. Even if the context-related metrics are high, the presence of FP and FN statements will lower the answer_correctness
score [2].
To improve the answer_correctness
score, consider the following best practices:
CORRECTNESS_PROMPT
and CORRECTNESS_INSTRUCTIONS
to provide clearer and more detailed guidance for the classification task.AnswerSimilarity
object is properly initialized and used to compute the similarity score.By following these practices, you can improve the answer_correctness
score in Ragas, ensuring more accurate and reliable evaluation of your RAG pipelines.
To continue talking to Dosu, mention @dosu.
@Simeon-Krah can you show me a couple of examples in the following format
[ ] I have checked the documentation and related resources and couldn't resolve my bug.
Describe the bug The answer generated by my model is similar to the ground_truths, but my answer_correctness score is very poor eventhough I have good score in the other metrics like context_recall, precision and faithfulness.
Ragas version:0.1.18 Python version:3.12
Code to Reproduce scores = evaluate( eval_dataset, metrics=[context_recall, context_precision, faithfulness, answer_correctness], llm = llm, embeddings=GoogleGenerativeAIEmbeddings(google_api_key="AIzaSyCJFrruyyyuuuytuuiysJnFcQZJa-E-QJOxyjK_X8rpBkIQ", model="models/embedding-001"), )
Error trace No error but poor answer correctness metric score
Expected behavior To get a good score for the answer_correctness like I got for the others
Additional context Add any other context about the problem here.