Metrics values are very different from one evaluation to another evalution (run after 5 minutes using same dataset)

cristinastanusiu commented 1 month ago

Hello,

I am using RAGAS framework to evaluate a RAG pipeline, to measure the progress of RAG improvements. After some ragas evaluation, I observe that metrics values: context precision, faithfulness, answer relevancy and context recall. I could not find any reasonible explanation and therefore I tested the RAGAS evaluation on explodinggradients/amnesty_qa, which I load accordingly:

    amnesty_qa = load_dataset("explodinggradients/amnesty_qa", "english_v2")
    dataset = amnesty_qa["eval"]

I see the same problem occurs even with this static dataset: high incosistencies between different running evaluation (made after 5 minutes) for faithfulness, context recall and context precision. The average differences between first evaluation run and second evaluation are:

{
    "context_precision": "0.16666666667500019 (out of 1)",
    "answer_cosine_similarity": "1.0000001192092896 (out of 15)",
    "contexts_cosine_similarity": "1.0 (out of 15)",
    "faithfulness": "0.32894094209883684 (out of 8)",
    "context_recall": "0.5615740740740741 (out of 6)"
}

As a double-check, I use the metric of cosine similarity for answers and contexts between the 2 evaluations. The 1.0 values are as expected.

Your Question Why there are this big differences between 2 different evaluation runs with the same data and OpenAI LLM ?

Code Examples Code for evaluation:

class RagasEvaluator():
    def __init__(self, key, name="RagasEvaluator"):
        super().__init__(name)
        os.environ['OPENAI_API_KEY'] = "{{OpenAIKEY}}"

    def eval(self, data: Dataset):
        result = evaluate(
            data,
            raise_exceptions=True,
            run_config=RunConfig(
                timeout=60,
                max_retries=10,
                max_wait=180,  # default= 60
                max_workers=1,
            ),
            metrics=[
                context_precision,
                faithfulness,
                answer_relevancy,
                context_recall,
            ])
        return result.to_pandas()

from datasets import load_dataset

# loading the V2 dataset
amnesty_qa = load_dataset("explodinggradients/amnesty_qa", "english_v2")
dataset = amnesty_qa["eval"]

result = RagasEvaluator().eval(dataset)
result.to_csv("ragas.csv")

The two compared files are: ragas_current.csv ragas_previous.csv

Additional context Ragas: 0.1.11 Python: 3.12.2

dosubot[bot] commented 1 month ago

@cristinastanusiu Hello there! I'm here to help you with any bugs, questions, or contribution-related inquiries. Let me know how I can assist you!

The significant differences between two different evaluation runs with the same data and OpenAI LLM when using the RAGAS framework can be attributed to several factors:

Stochastic Nature of LLMs: Language models like OpenAI's GPT-3.5-turbo-16k and GPT-4 have inherent variability in their outputs. Even with the same input, the model might generate different outputs on different runs due to their stochastic nature.
Testset Generation Variability: The process of generating test sets in RAGAS involves some randomness, especially when generating synthetic test data. This can lead to variations in the test cases used for evaluation.
Evaluation Metrics Sensitivity: Metrics such as faithfulness, answer relevancy, context_precision, and context_recall can be sensitive to slight changes in the generated answers, leading to noticeable differences in the scores.
Online Evaluation Variability: When using RAGAS for online evaluations, the lack of ground-truth data and the use of production traces can introduce variability. The evaluation is performed on real-time data, which can vary significantly between runs.

These factors combined can lead to significant differences in evaluation results between different runs ^[1]^[2]^[3]^[4].

To continue talking to Dosu, mention @dosu.

cristinastanusiu commented 1 month ago

What is the best practice the decrease this sensitivity and variability?

dosubot[bot] commented 1 month ago

The best practice to decrease the sensitivity and variability in the evaluation results when using the RAGAS framework to evaluate a RAG pipeline is to create an evaluation dataset that serves as a reference point for evaluating the RAG pipeline. This approach is highlighted in Atita Arora's tutorial on building and evaluating RAG using Qdrant and Langchain, where it is suggested to:

Create an Evaluation Dataset: This dataset should be comprehensive and representative of the types of queries and documents the RAG pipeline will handle. Additionally, using synthetic test data generation can significantly reduce the time and effort required to curate high-quality and diverse data points, as mentioned in the Phoenix (Arize) tutorial. An ideal test data set should contain samples that closely mirror your real-world use case.
Understand and Use RAGAS Evaluation Metrics: Familiarize yourself with the RAGAS evaluation metrics and apply them consistently to measure the performance of the RAG pipeline. Ragas provides several metrics to evaluate various aspects of your RAG systems, such as context_precision, context_recall, faithfulness, and answer_relevancy.

Supporting evidence from the documents:

"Learn how to create an evaluation dataset that serves as a reference point for evaluating our RAG pipeline, Understand the RAGAS evaluation metrics and how to make sense of them and putting them in action to test a Naive RAG pipeline and measure its performance using RAGAS metrics."
"Curating a golden test dataset for evaluation can be a long, tedious, and expensive process that is not pragmatic — especially when starting out or when data sources keep changing. This can be solved by synthetically generating high quality data points, which then can be verified by developers. This can reduce the time and effort in curating test data by 90%."
"An ideal test data set should contain samples that closely mirror your real-world use case."

Additionally, the Ragas framework employs an evolutionary generation paradigm to create diverse and complex questions from the provided documents. This approach ensures comprehensive coverage of the performance of various components within your pipeline, resulting in a more robust evaluation process. Techniques such as reasoning, conditioning, and multi-context are used to systematically craft questions of varying difficulty levels, enhancing the thoroughness of the evaluation.

Reference:

To continue talking to Dosu, mention @dosu.

jjmachan commented 1 month ago

@cristinastanusiu this is a known issue, one thing you can do is use the in_ci function in evaluate()

    in_ci: bool
        Whether the evaluation is running in CI or not. If set to True then some
        metrics will be run to increase the reproducability of the evaluations. This
        will increase the runtime and cost of evaluations. Default is False.

which should help with reproducability. I'll add @shahules786 too for his inputs

cristinastanusiu commented 1 month ago

Hi @jjmachan thank you for your advice! Indeed, setting in_ci=True and using GPT4o model with temperature 0, I got lower metric differences on amnestyqa dataset: { "context_precision": "0.1666666666083334 (out of 1)", "answer_cosine_similarity": "1.0 (out of 4)", "contexts_cosinesimilarity": "1.0 (out of 4)", "faithfulness": "0.2564798932723461 (out of 3)" }

As observed, there still exists minor differences but they are indeed lower.

Q: Is there any important aspect of having Ragas evaluator into a CI pipeline?

jjmachan commented 1 month ago

this is one major thing but I'll tag @shahules786 who did a lot of research for reproducibility of LLM metrics

cristinastanusiu commented 1 month ago

Thanks, looking forward to get any other insights.

shahules786 commented 1 month ago

Hi @cristinastanusiu In short there are two ways to improve this 1) Running metrics multiple times and aggregating the results. I did some work here and introduced reproducibility parameters in a few metrics which ensembles results from different runs. This has improved reproducibility 2) Using high-quality models like gpt4 instead of gpt3.5 or worse. I have observed that better models tend to arrive at the same conclusion much more than poor models. 3) Aligning metrics with human evaluators: Metrics also change because there can be many instances in the dataset that are confusing to give a verdict. For example, let's take faithfulness as a metric, let's say context="I have a car" and output = "I have a blue car". It can be faithful and unfaithful at the same time - this strictness actually depends on the developer/domain. For example, in fintech, they might have a strict definition of what is faithful compared to general purpose domains. These samples when run through LLM gives different output in alternative runs because of this inherent confusion. We intend to introduce a method by which developers can align ragas metrics to their domain needs after v.02 which also improved reproducibility.

I hope this was helpful.

cristinastanusiu commented 1 month ago

Thank you for your insigths, this helps me understanding and explaining the differences. The human metric alignment would be indeed extremely usefull for the LLM evaluator.

explodinggradients / ragas

Metrics values are very different from one evaluation to another evalution (run after 5 minutes using same dataset) #1125