Open Anjalisoni99-byte opened 1 month ago
Hey @Anjalisoni99-byte! I'm here to help you with any questions or issues you have. I can assist you with bug fixes, answer questions, and guide you on how to become a contributor. Let me know how I can assist you!
The binary scores (0 or 1) for Context Precision
and Context Recall
are likely due to the way the metrics are calculated. Each sentence in the ground truth is checked to see if it can be attributed to the retrieved context. If a sentence can be attributed, it contributes positively (1), otherwise negatively (0). This binary decision for each sentence results in the overall score being an average of these binary decisions, which can sometimes yield binary scores for small datasets or specific cases.
For example, if all sentences in the ground truth can be attributed to the retrieved context, the score will be 1. If none can be attributed, the score will be 0. For mixed cases, the score will be a fraction between 0 and 1.
Here's the relevant formula and example from the document:
\text{context recall} = {|\text{GT sentences that can be attributed to context}| \over |\text{Number of sentences in GT}|}
Example calculation:
Step 1: Break the ground truth answer into individual statements.
Step 2: For each of the ground truth statements, verify if it can be attributed to the retrieved context.
Step 3: Use the formula depicted above to calculate context recall.
\text{context recall} = { \text{1} \over \text{2} } = 0.5
In this example, the score is 0.5, but if all statements were attributable, the score would be 1, and if none were, it would be 0.
Additionally, the Context Precision
metric in the RAGAS framework uses a binary verdict (0 or 1) to determine if the context was useful in arriving at the given answer. This binary decision is based on whether the context includes key information relevant to the answer. The overall score is then calculated as an average of these binary decisions, which can sometimes result in binary scores for small datasets or specific cases [1][2][3][4].
To continue talking to Dosu, mention @dosu.
@shahules786 can you help out here?
@Anjalisoni99-byte which models are you using for evaluations
@jjmachan I am using GPT3.5 for evaluation
Hey @Anjalisoni99-byte Are you providing more than one chunk as context in your samples? If not, according to the formula it will be only 0 or 1.
Please close the issue if this is the case.
Hello @shahules786 We have a retriever which collects data from top 5 documents and passed as context for QA. So , yes we can say we are providing more than one chunk as context.
@Anjalisoni99-byte Are you using any tracing tool to validate the evaluations? can you share a sample data point as a JSON file for me to inspect? Also please ensure that you're using latest version of ragas.
Hello @shahules786 and @jjmachan , I was wondering if we could break that context into smaller simpler sentences as a part of context precision and recall and then do the evaluation, similar to what has be done with faithfulness?
Do you think we can do that?
@sahusiddharth Sure, then you'll also have to think about an ideal way to break up the context so that it won't affect the results in a negative way. This process can be done prior to sending the chunks into ragas to evaluate.