explodinggradients / ragas

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
https://docs.ragas.io
Apache License 2.0
6.61k stars 649 forks source link

If i am using a local RAG. Which context am i suppose to use during evaluation #1337

Closed zeoxleh closed 1 hour ago

zeoxleh commented 2 hours ago

Your Question From my understanding, the "create your own test set" documentation uses OpenAI models and embeddings to create Question/context/groundtruth. In this case, context is obtained using embeddings from OpenAI.

When I use my own RAG pipeline, I use GenAI embeddings, which will provide different contexts.

So I'm just wondering, during evaluation, if I should include contexts from my RAG or use the context obtained during test set generation?

Or maybe I am misunderstanding something, if so, please do correct me. Thank you

dosubot[bot] commented 2 hours ago

Hey @zeoxleh! I'm here to help you with any bugs, questions, or contributions. If you need assistance, feel free to ask!

You should use the contexts generated by your RAG pipeline for evaluation. The document describes the process of building a RAG pipeline and evaluating it using the contexts retrieved by the pipeline itself. This approach ensures that the evaluation metrics reflect the actual performance of your RAG system in retrieving and generating relevant content.

Supporting evidence from the document:

## The Data

For this tutorial, we'll use an example dataset from one of the baselines we created for the [Amnesty QA](https://huggingface.co/datasets/explodinggradients/amnesty_qa) dataset. The dataset contains the following columns:

- question: `list[str]` - These are the questions your RAG pipeline will be evaluated on.
- context: `list[list[str]]` - The contexts which were passed into the LLM to answer the question.
- ground_truth: `list[str]` - The ground truth answer to the questions.
## Metrics

Ragas provides several metrics to evaluate various aspects of your RAG systems:

1. Retriever: Offers `context_precision` and `context_recall` that measure the performance of your retrieval system.
2. Generator (LLM): Provides `faithfulness` that measures hallucinations and `answer_relevancy` that measures how relevant the answers are to the question.

This indicates that the evaluation is performed using the contexts retrieved by the RAG pipeline built with the specified embeddings [1].

To continue talking to Dosu, mention @dosu.

zeoxleh commented 2 hours ago

I see.

So effectively, the Question and ground truth portion of my test set is obtained from the "create your own test set" documentation. While the answer and context portion is obtained from my own RAG pipeline?

dosubot[bot] commented 1 hour ago

Yes, the question and ground truth portion of your test set should be obtained from the "create your own test set" documentation, while the answer and context portion should be obtained from your own RAG pipeline [1][2].

To continue talking to Dosu, mention @dosu.