Open maraevfd opened 1 month ago
Hey @maraevfd! I'm here to help you with any questions or issues you have. Let's tackle this together!
Yes, it is feasible to apply the RAGAS framework for test automation within a CI/CD model for a Retrieval-Augmented Generation (RAG) pipeline. The documentation provides a detailed guide on how to integrate RAGAS evaluations into your CI pipeline using Pytest. By setting the in_ci
argument for the evaluate()
function to True
, you can run RAGAS metrics in a special mode that ensures more reproducible metrics, albeit at a higher cost. This allows you to continuously monitor the qualitative performance of your RAG pipeline without needing to fully integrate RAGAS into the system. Additionally, you can use Pytest markers to tag and selectively run these end-to-end tests as part of your CI process [1][2].
To continue talking to Dosu, mention @dosu.
Hi @dosu,
Thank you for your response. While your guidance is appreciated, it seems we need more detailed insights into integrating RAGAS into our CI/CD pipeline effectively.
Questions:
Practical Implementation: Could you provide more practical steps or examples on how to integrate RAGAS into our pipeline? We are already using the in_ci argument in evaluate(), but we need more granular details on how to implement it correctly.
Ground Truth and Contexts: What are the best practices for obtaining Ground Truth and Contexts? Any guidelines or methodologies would be highly beneficial.
Metric Stability: We've noticed significant variability in metrics with unchanged data. What strategies can we use to stabilize these metrics and ensure more consistent results?
Your detailed insights on these aspects will greatly help us refine our implementation.
Thanks again for your assistance!
To integrate the RAGAS framework into your CI/CD pipeline for a Retrieval-Augmented Generation (RAG) pipeline, follow these steps:
Installation: Install RAGAS in your environment using pip:
pip install ragas
Set Up Environment Variables: Ensure you have your OpenAI API key set up in your environment:
import os
os.environ["OPENAI_API_KEY"] = "your-openai-key"
Prepare Your Dataset: Create a dataset with questions, answers, contexts, and ground truth:
from datasets import Dataset
data_samples = {
'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'],
['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],
'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
}
dataset = Dataset.from_dict(data_samples)
Evaluate the Dataset: Use RAGAS to evaluate the dataset with desired metrics:
from ragas import evaluate
from ragas.metrics import faithfulness, answer_correctness
score = evaluate(dataset, metrics=[faithfulness, answer_correctness], in_ci=True)
score.to_pandas()
Integrate with CI/CD: Add the evaluation script to your CI/CD pipeline configuration (e.g., in a GitHub Actions workflow):
name: RAGAS Evaluation
on: [push]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.x'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install ragas
- name: Run RAGAS evaluation
run: |
python evaluate_ragas.py
Ground Truth:
Contexts:
Consistent Data Collection:
Regular Evaluation:
Feedback Loop:
Proper Metric Initialization:
Here is an example of initializing metrics with OpenAI:
from langchain_openai.chat_models import ChatOpenAI
from langchain_openai.embeddings import OpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
llm = ChatOpenAI()
emb = OpenAIEmbeddings()
def init_ragas_metrics(metrics, llm, embedding):
for metric in metrics:
if isinstance(metric, MetricWithLLM):
metric.llm = llm
if isinstance(metric, MetricWithEmbeddings):
metric.embeddings = embedding
run_config = RunConfig()
metric.init(run_config)
init_ragas_metrics(
metrics,
llm=LangchainLLMWrapper(llm),
embedding=LangchainEmbeddingsWrapper(emb),
)
By following these steps and best practices, you can effectively integrate the RAGAS framework into your CI/CD pipeline and ensure the stability and accuracy of your Retrieval-Augmented Generation pipeline [1][2][3][4].
To continue talking to Dosu, mention @dosu.
Hi @dosu,
Thank you for your response. While your guidance is appreciated, it seems we need more detailed insights into integrating RAGAS into our CI/CD pipeline effectively.
Questions:
Practical Implementation: Could you provide more practical steps or examples on how to integrate RAGAS into our pipeline? We are already using the in_ci argument in evaluate(), but we need more granular details on how to implement it correctly.
Ground Truth and Contexts: What are the best practices for obtaining Ground Truth and Contexts? Any guidelines or methodologies would be highly beneficial.
Metric Stability: We've noticed significant variability in metrics with unchanged data. What strategies can we use to stabilize these metrics and ensure more consistent results?
Your detailed insights on these aspects will greatly help us refine our implementation.
Thanks again for your assistance!
Hey @maraevfd, I solved the metric stability problem just by changing the generator temperature parameter llm, because the BaseChatOpenAI has a default temperature of 0.7 and this makes the result of the same LLM inference vary greatly.
generator_llm = ChatOpenAI(model=generator_model, temperature=0.0)
@maraevfd may I suggest a call sometime so that I can understand a bit more about your situations and suggest some steps that some of our current users are following. This is something we are adding back into the core lib to make it easier so your input will help us a lot as well 🙂
@rafaelaraujo-hotmart thanks a lot for sharing that tip. This might be a bug from our end too because internally we are suppose to override the temperature to 1e-8 which should have helped. Will check this and fix it in #998
[x] I checked the documentation and related resources and couldn't find an answer to my question.
Your Question How can we use RAGAS within a CI/CD model for a RAG pipeline?
Additional context Our team is currently working on a Retrieval-Augmented Generation (RAG) pipeline to process our extensive repository of presentations and various documents. For testing purposes, we decided to try out the RAGAS framework to integrate end-to-end automated tests into the pipeline and dynamically evaluate its metrics in practice. However, we haven't found any examples of using RAGAS in the context of test automation frameworks, only its integration into existing RAG pipelines as shown in the examples.
From your perspective, is it feasible to apply RAGAS in this manner, and are we on the right track? Or is it necessary to fully integrate RAGAS into the system to derive any benefit from it?