Issue in Context Recall Metrics

Anjalisoni99-byte commented 3 months ago

According to Ragas docs, Context Recall is calculated based on ground truth and context. But in github repo , it is using question, context and answer (model response not the ground truth).

Can anyone explain or whether I am interpreting incorrectly?

dosubot[bot] commented 3 months ago

The discrepancy you're noticing might be due to the specific implementation details in the code. The ContextRecallClassificationAnswers.parse_obj function is used to parse a list of dictionaries into a ContextRecallClassificationAnswers object. This function takes a list of dictionaries as its input parameter, where each dictionary represents a classification answer with keys such as statement, reason, and attributed. The expected output is a ContextRecallClassificationAnswers object that can be converted to a list of dictionaries using the dicts method.

Here is an example of how it works:

classification = ContextRecallClassificationAnswers.parse_obj(
    [
        {
            "statement": "Albert Einstein, born on 14 March 1879, was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time.",
            "reason": "The date of birth of Einstein is mentioned clearly in the context.",
            "attributed": 1,
        },
        {
            "statement": "He received the 1921 Nobel Prize in Physics for his services to theoretical physics.",
            "reason": "The exact sentence is present in the given context.",
            "attributed": 1,
        },
        {
            "statement": "He published 4 papers in 1905.",
            "reason": "There is no mention about papers he wrote in the given context.",
            "attributed": 0,
        },
        {
            "statement": "Einstein moved to Switzerland in 1895.",
            "reason": "There is no supporting evidence for this in the given context.",
            "attributed": 0,
        },
    ]
).dicts()

The input parameters are a list of dictionaries, and the expected output is a list of dictionaries representing the parsed classification answers ^[1].

Additionally, recent updates in the Ragas codebase might have affected the calculation or interpretation of the Context Recall metric. Specifically, the ContextEntityRecall class calculates recall based on entities present in the ground truth and context. The method _compute_score calculates the recall by finding the intersection of entities in the context and ground truth and dividing by the number of entities in the ground truth ^[2].

This might explain the differences you're seeing between the documentation and the code implementation.

To continue talking to Dosu, mention @dosu.

Anjalisoni99-byte commented 3 months ago

@dosu Can you justify the mismatch between doc and codebase?

dosubot[bot] commented 3 months ago

The mismatch between the documentation and the codebase regarding the calculation of the Context Recall metric can be justified by examining the provided code and the updated documentation. The documentation states that the Context Recall metric is estimated by calculating True Positives (TP) and False Negatives (FN) using the annotated answer and retrieved context. However, the code implementation shows that the metric is calculated based on the presence of entities in the context and ground truth.

Here is the relevant code snippet that performs this calculation:

class ContextEntityRecall(MetricWithLLM):
    """
    Calculates recall based on entities present in ground truth and context.
    Let CN be the set of entities present in context,
    GN be the set of entities present in the ground truth.

    Then we define can the context entity recall as follows:
    Context Entity recall = | CN ∩ GN | / | GN |

    If this quantity is 1, we can say that the retrieval mechanism has
    retrieved context which covers all entities present in the ground truth,
    thus being a useful retrieval. Thus this can be used to evaluate retrieval
    mechanisms in specific use cases where entities matter, for example, a
    tourism help chatbot.
    """

    def _compute_score(
        self, ground_truth_entities: t.Sequence[str], context_entities: t.Sequence[str]
    ) -> float:
        num_entities_in_both = len(
            set(context_entities).intersection(set(ground_truth_entities))
        )
        return num_entities_in_both / (len(ground_truth_entities) + 1e-8)

In this code:

The recall is calculated based on the intersection of entities present in both the context and the ground truth.
The score is calculated as the ratio of the number of common entities to the total number of entities in the ground truth.

This approach aligns with the description in the updated documentation but may differ from the traditional definition of context recall involving TP and FN. Therefore, the mismatch arises from the difference in the conceptual understanding of how context recall is calculated versus its actual implementation in the code ^[1]^[2].

To continue talking to Dosu, mention @dosu.

jjmachan commented 3 months ago

@shahules786 can you take a look at this?

shahules786 commented 3 months ago

Hi @Anjalisoni99-byte I have raised a fix for this https://github.com/explodinggradients/ragas/pull/1173 To expand it little more on why question is needed, there could be examples in your evaluation dataset like

'question': Wha is the capital of France?
"ground_truth":"Paris"
"Context":"France is in western.......nd Paris it's capital"

Here by also providing a question along with ground_truth, the LLM gets a better idea of the answer itself and it can rephrase the ground_truth to "Pairs is the capital of France" and do the inference against the provided context. Checkout a of few shot examples here

sahusiddharth commented 3 months ago

Hi @jjmachan and @shahules786, Correct me if I am wrong,

The way I understood context recall is it only takes context and answer and doesn't require the question.

Have a look at the instruction prompt used in the codebase of context recall,

instruction="""Given a context, and an answer, analyze each sentence in the answer and classify if the sentence can be attributed to the given context or not. Use only "Yes" (1) or "No" (0) as a binary classification. Output json with reason."""

shahules786 commented 3 months ago

Checkout the few shot examples pointed in this comment here Not only the instruction but also few shot examples influence results of LLMs.

explodinggradients / ragas

Issue in Context Recall Metrics #1168