explodinggradients / ragas

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
https://docs.ragas.io
Apache License 2.0
6.79k stars 669 forks source link

Context Precision Metric Binary values #1320

Open kaan9700 opened 2 weeks ago

kaan9700 commented 2 weeks ago

I have noticed that if you look at the evaluated data set, the metric is determined individually for each data point of the dataset. The output of the console is then the mean value of all calculated metrics. With Context Precision, I have noticed that the individual values (values per data point) are always binary, i.e. 0 or 1. Is this normal? Is there only this range of values for the individual data points, or have I made a mistake?

The whole thing can be reproduced if you use the small test set from the RAGAS documentation and save the generated dataframe as a CSV.

dosubot[bot] commented 2 weeks ago

Yes, it is normal for the individual values of the Context Precision metric to always be binary (0 or 1) for each data point. This is confirmed by the ContextPrecisionVerification class, which defines the verdict field as a binary (0/1) value:

class ContextPrecisionVerification(BaseModel):
    """Answer for the verification task whether the context was useful."""

    reason: str = Field(..., description="Reason for verification")
    verdict: int = Field(..., description="Binary (0/1) verdict of verification")

Additionally, the prompt for the CONTEXT_PRECISION metric instructs to give a verdict as "1" if useful and "0" if not:

CONTEXT_PRECISION = Prompt(
    name="context_precision",
    instruction="""Given question, answer and context verify if the context was useful in arriving at the given answer. Give verdict as "1" if useful and "0" if not with json output.""",
    ...
)

The Context Precision metric evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. This metric is computed using the question, ground_truth, and the contexts, with values ranging between 0 and 1, where higher scores indicate better precision. The relevance indicator at each rank, denoted as $v_k$, is binary (0 or 1), confirming that individual values are indeed binary [1][2].

To continue talking to Dosu, mention @dosu.