explodinggradients / ragas

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
https://docs.ragas.io
Apache License 2.0
6.76k stars 665 forks source link

Why Context Precision and Context Recall are yielding a binary score instead a range between 0 and 1? #1167

Open Anjalisoni99-byte opened 1 month ago

Anjalisoni99-byte commented 1 month ago
Screenshot 2024-08-05 124328
dosubot[bot] commented 1 month ago

Hey @Anjalisoni99-byte! I'm here to help you with any questions or issues you have. I can assist you with bug fixes, answer questions, and guide you on how to become a contributor. Let me know how I can assist you!

The binary scores (0 or 1) for Context Precision and Context Recall are likely due to the way the metrics are calculated. Each sentence in the ground truth is checked to see if it can be attributed to the retrieved context. If a sentence can be attributed, it contributes positively (1), otherwise negatively (0). This binary decision for each sentence results in the overall score being an average of these binary decisions, which can sometimes yield binary scores for small datasets or specific cases.

For example, if all sentences in the ground truth can be attributed to the retrieved context, the score will be 1. If none can be attributed, the score will be 0. For mixed cases, the score will be a fraction between 0 and 1.

Here's the relevant formula and example from the document:

\text{context recall} = {|\text{GT sentences that can be attributed to context}| \over |\text{Number of sentences in GT}|}

Example calculation:

In this example, the score is 0.5, but if all statements were attributable, the score would be 1, and if none were, it would be 0.

Additionally, the Context Precision metric in the RAGAS framework uses a binary verdict (0 or 1) to determine if the context was useful in arriving at the given answer. This binary decision is based on whether the context includes key information relevant to the answer. The overall score is then calculated as an average of these binary decisions, which can sometimes result in binary scores for small datasets or specific cases [1][2][3][4].

To continue talking to Dosu, mention @dosu.

jjmachan commented 1 month ago

@shahules786 can you help out here?

@Anjalisoni99-byte which models are you using for evaluations

Anjalisoni99-byte commented 1 month ago

@jjmachan I am using GPT3.5 for evaluation

shahules786 commented 1 month ago

Hey @Anjalisoni99-byte Are you providing more than one chunk as context in your samples? If not, according to the formula it will be only 0 or 1.

Please close the issue if this is the case.

Anjalisoni99-byte commented 1 month ago

Hello @shahules786 We have a retriever which collects data from top 5 documents and passed as context for QA. So , yes we can say we are providing more than one chunk as context.

shahules786 commented 1 month ago

@Anjalisoni99-byte Are you using any tracing tool to validate the evaluations? can you share a sample data point as a JSON file for me to inspect? Also please ensure that you're using latest version of ragas.

sahusiddharth commented 1 month ago

Hello @shahules786 and @jjmachan , I was wondering if we could break that context into smaller simpler sentences as a part of context precision and recall and then do the evaluation, similar to what has be done with faithfulness?

Do you think we can do that?

shahules786 commented 1 month ago

@sahusiddharth Sure, then you'll also have to think about an ideal way to break up the context so that it won't affect the results in a negative way. This process can be done prior to sending the chunks into ragas to evaluate.