RAGAS Framework into CI/CD Model for RAG Pipeline

maraevfd commented 1 month ago

[x] I checked the documentation and related resources and couldn't find an answer to my question.

Your Question How can we use RAGAS within a CI/CD model for a RAG pipeline?

Additional context Our team is currently working on a Retrieval-Augmented Generation (RAG) pipeline to process our extensive repository of presentations and various documents. For testing purposes, we decided to try out the RAGAS framework to integrate end-to-end automated tests into the pipeline and dynamically evaluate its metrics in practice. However, we haven't found any examples of using RAGAS in the context of test automation frameworks, only its integration into existing RAG pipelines as shown in the examples.

From your perspective, is it feasible to apply RAGAS in this manner, and are we on the right track? Or is it necessary to fully integrate RAGAS into the system to derive any benefit from it?

dosubot[bot] commented 1 month ago

Hey @maraevfd! I'm here to help you with any questions or issues you have. Let's tackle this together!

Yes, it is feasible to apply the RAGAS framework for test automation within a CI/CD model for a Retrieval-Augmented Generation (RAG) pipeline. The documentation provides a detailed guide on how to integrate RAGAS evaluations into your CI pipeline using Pytest. By setting the in_ci argument for the evaluate() function to True, you can run RAGAS metrics in a special mode that ensures more reproducible metrics, albeit at a higher cost. This allows you to continuously monitor the qualitative performance of your RAG pipeline without needing to fully integrate RAGAS into the system. Additionally, you can use Pytest markers to tag and selectively run these end-to-end tests as part of your CI process ^[1]^[2].

To continue talking to Dosu, mention @dosu.

maraevfd commented 1 month ago

Hi @dosu,

Thank you for your response. While your guidance is appreciated, it seems we need more detailed insights into integrating RAGAS into our CI/CD pipeline effectively.

Questions:

Practical Implementation: Could you provide more practical steps or examples on how to integrate RAGAS into our pipeline? We are already using the in_ci argument in evaluate(), but we need more granular details on how to implement it correctly.

Ground Truth and Contexts: What are the best practices for obtaining Ground Truth and Contexts? Any guidelines or methodologies would be highly beneficial.

Metric Stability: We've noticed significant variability in metrics with unchanged data. What strategies can we use to stabilize these metrics and ensure more consistent results?

Your detailed insights on these aspects will greatly help us refine our implementation.

Thanks again for your assistance!

dosubot[bot] commented 1 month ago

Practical Implementation Steps

To integrate the RAGAS framework into your CI/CD pipeline for a Retrieval-Augmented Generation (RAG) pipeline, follow these steps:

Installation: Install RAGAS in your environment using pip:
```
pip install ragas
```
Set Up Environment Variables: Ensure you have your OpenAI API key set up in your environment:
```
import os
os.environ["OPENAI_API_KEY"] = "your-openai-key"
```

Prepare Your Dataset: Create a dataset with questions, answers, contexts, and ground truth:

from datasets import Dataset

data_samples = {
   'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
   'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
   'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'], 
   ['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],
   'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
}

dataset = Dataset.from_dict(data_samples)

Evaluate the Dataset: Use RAGAS to evaluate the dataset with desired metrics:

from ragas import evaluate
from ragas.metrics import faithfulness, answer_correctness

score = evaluate(dataset, metrics=[faithfulness, answer_correctness], in_ci=True)
score.to_pandas()

Integrate with CI/CD: Add the evaluation script to your CI/CD pipeline configuration (e.g., in a GitHub Actions workflow):

name: RAGAS Evaluation

on: [push]

jobs:
 evaluate:
   runs-on: ubuntu-latest
   steps:
     - uses: actions/checkout@v2
     - name: Set up Python
       uses: actions/setup-python@v2
       with:
         python-version: '3.x'
     - name: Install dependencies
       run: |
         python -m pip install --upgrade pip
         pip install ragas
     - name: Run RAGAS evaluation
       run: |
         python evaluate_ragas.py

Best Practices for Obtaining Ground Truth and Contexts

Ground Truth:
- Ensure ground truth answers are accurate and verified by domain experts.
- Use multiple sources to cross-verify the correctness of the ground truth.
Contexts:
- Collect context data from reliable and authoritative sources.
- Ensure the context is relevant and directly related to the questions being asked.
- Regularly update the context data to reflect the most current information.

Strategies for Stabilizing Metrics

Consistent Data Collection:
- Use a standardized process for collecting and updating datasets to maintain consistency.
- Regularly review and clean the dataset to remove any outdated or incorrect information.
Regular Evaluation:
- Integrate regular evaluations in your CI/CD pipeline to continuously monitor performance.
- Track metrics over time to identify trends and areas for improvement.
Feedback Loop:
- Implement a feedback loop where the evaluation results are reviewed by a team to make necessary adjustments.
- Use the insights gained from evaluations to refine and improve the RAG pipeline.
Proper Metric Initialization:
- Properly initialize metrics with the chosen LLMs and embeddings to ensure consistent metric performance.

Here is an example of initializing metrics with OpenAI:

from langchain_openai.chat_models import ChatOpenAI
from langchain_openai.embeddings import OpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

llm = ChatOpenAI()
emb = OpenAIEmbeddings()

def init_ragas_metrics(metrics, llm, embedding):
    for metric in metrics:
        if isinstance(metric, MetricWithLLM):
            metric.llm = llm
        if isinstance(metric, MetricWithEmbeddings):
            metric.embeddings = embedding
        run_config = RunConfig()
        metric.init(run_config)

init_ragas_metrics(
    metrics,
    llm=LangchainLLMWrapper(llm),
    embedding=LangchainEmbeddingsWrapper(emb),
)

By following these steps and best practices, you can effectively integrate the RAGAS framework into your CI/CD pipeline and ensure the stability and accuracy of your Retrieval-Augmented Generation pipeline ^[1]^[2]^[3]^[4].

To continue talking to Dosu, mention @dosu.

rafaelaraujo-hotmart commented 1 month ago

Hi @dosu,

Thank you for your response. While your guidance is appreciated, it seems we need more detailed insights into integrating RAGAS into our CI/CD pipeline effectively.

Questions:

Practical Implementation: Could you provide more practical steps or examples on how to integrate RAGAS into our pipeline? We are already using the in_ci argument in evaluate(), but we need more granular details on how to implement it correctly.

Ground Truth and Contexts: What are the best practices for obtaining Ground Truth and Contexts? Any guidelines or methodologies would be highly beneficial.

Metric Stability: We've noticed significant variability in metrics with unchanged data. What strategies can we use to stabilize these metrics and ensure more consistent results?

Your detailed insights on these aspects will greatly help us refine our implementation.

Thanks again for your assistance!

Hey @maraevfd, I solved the metric stability problem just by changing the generator temperature parameter llm, because the BaseChatOpenAI has a default temperature of 0.7 and this makes the result of the same LLM inference vary greatly.

generator_llm = ChatOpenAI(model=generator_model, temperature=0.0)

jjmachan commented 1 month ago

@maraevfd may I suggest a call sometime so that I can understand a bit more about your situations and suggest some steps that some of our current users are following. This is something we are adding back into the core lib to make it easier so your input will help us a lot as well 🙂

@rafaelaraujo-hotmart thanks a lot for sharing that tip. This might be a bug from our end too because internally we are suppose to override the temperature to 1e-8 which should have helped. Will check this and fix it in #998

explodinggradients / ragas