Can Ragas be used to evaluate Amazon Bedrock Agents w/ managed Knowledge Base?

[X] I checked the documentation and related resources and couldn't find an answer to my question.

Your Question what is unclear to you? What would you like to know?

I just discovered the MDD approach for RAG development implemented by Ragas, and it is fascinating.

I have a conversational chatbot based on Amazon Bedrock Agent that uses claude-3-haiku-20240307-v1 as LLM, configured to use a Bedrock Knowledge Base (AWS managed). The KB is built on top of Amazon OpenSearch Service Serverless with titan-embed-text-v1 as the embedding model (chunking setup: fixed length size, max 300 tokens, 20% overlap).

Would Ragas be able to be used to evaluate my RAG setup? If so, do you have any examples, or can you point me to relevant documentation/approaches?

Thank you!

Code Examples The chatbot infrastructure is scripted using AWS CD (python). I can provide relevant snippets if required.

Additional context Anything else you want to share with us? I opted for this setup since AWS manages every service.

Hey @danielesalvatore! Welcome to the repository! I'm here to help you with any bugs, questions, or contributions while you're waiting for a human maintainer. If you need any assistance, feel free to ask.

Yes, Ragas can be used to evaluate your Amazon Bedrock Agent setup with a managed Knowledge Base. The provided documentation includes a detailed example of how to use Ragas for evaluation with Amazon Bedrock.

Here is a relevant example from the documentation:

# Import necessary libraries
from datasets import load_dataset
from ragas.metrics import (
    context_precision,
    faithfulness,
    context_recall,
)
from ragas.metrics.critique import harmfulness
from langchain_community.chat_models import BedrockChat
from langchain_community.embeddings import BedrockEmbeddings
from ragas import evaluate
import nest_asyncio

# Load sample dataset
amnesty_qa = load_dataset("explodinggradients/amnesty_qa", "english_v2")

# Define metrics
metrics = [
    faithfulness,
    context_recall,
    context_precision,
    harmfulness,
]

# Configure Bedrock model
config = {
    "credentials_profile_name": "your-profile-name",  # E.g "default"
    "region_name": "your-region-name",  # E.g. "us-east-1"
    "model_id": "your-model-id",  # E.g "anthropic.claude-v2"
    "model_kwargs": {"temperature": 0.4},
}

bedrock_model = BedrockChat(
    credentials_profile_name=config["credentials_profile_name"],
    region_name=config["region_name"],
    endpoint_url=f"https://bedrock-runtime.{config['region_name']}.amazonaws.com",
    model_id=config["model_id"],
    model_kwargs=config["model_kwargs"],
)

# Initialize embeddings
bedrock_embeddings = BedrockEmbeddings(
    credentials_profile_name=config["credentials_profile_name"],
    region_name=config["region_name"],
)

# Apply nest_asyncio if running in a Jupyter notebook
nest_asyncio.apply()

# Evaluate the model
result = evaluate(
    amnesty_qa["eval"].select(range(3)),
    metrics=metrics,
    llm=bedrock_model,
    embeddings=bedrock_embeddings,
)

# Convert results to pandas DataFrame
df = result.to_pandas()
df.head()

This example demonstrates how to load a dataset, configure the Bedrock model, define evaluation metrics, and run the evaluation using Ragas ^[1].

Additionally, Ragas provides tools and techniques to leverage continual learning in RAG applications. It allows you to synthetically generate diverse test datasets, use LLM-assisted evaluation metrics, monitor the quality of your apps in production, and use insights to iterate and improve your application. For more detailed information, you can refer to the Core Concepts section in the Ragas documentation ^[2].

For more community insights and examples, you can explore the Community section of the Ragas documentation, which features various blog posts, tutorials, and case studies on how different teams and individuals are using Ragas to enhance their RAG systems ^[3].

To continue talking to Dosu, mention @dosu.

@dosu

Can you give an example that strictly includes the following requirements:

The example shows how to integrate Ragas with Amazon Bedrock Agent instead of BedrockChat
The example shows how to integrate Regas with Amazon Bedrock Knowledge Bases instead of BedrockEmbeddings
The example can eventually use the LangChain Bedrock (Knowledge Bases) Retriever (URL: https://python.langchain.com/v0.2/docs/integrations/retrievers/bedrock/)
The example uses a PDF as a testing dataset

If you cannot provide such an example, explain in depth the implications in terms of test validity of using BedrockChat and BedrockEmbeddings instead of the actual infrastructure based on Amazon Bedrock Agents and Knowledge Bases. Thanks!

To integrate Ragas with Amazon Bedrock Agent and Amazon Bedrock Knowledge Bases, and use a PDF as a testing dataset, follow these steps:

Install Required Packages:

!pip install ragas langchain_openai python-dotenv

Set Up Environment and Load API Key:

import nest_asyncio
import os
import openai
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()
# Access the API key from the environment variable
api_key = os.environ.get("OPENAI_API_KEY")

# Initialize the OpenAI API client
openai.api_key = api_key

nest_asyncio.apply()

Load PDF Dataset:

from pathlib import Path
from ragas_loader import RAGASLoader

# Initialize the loader with the path to your PDF file or directory containing PDFs
loader = RAGASLoader(file_path=Path("path/to/your/pdf_or_directory"), mode="single")

# Load the documents
documents = list(loader.lazy_load())

# If you prefer asynchronous loading
# documents = [doc async for doc in loader.lazy_aload()]

Configure Amazon Bedrock Agent and Knowledge Bases:

from langchain_community.retrievers import AmazonKnowledgeBasesRetriever
from langchain_community.chat_models import BedrockChat

config = {
    "credentials_profile_name": "your-profile-name",  # E.g "default"
    "region_name": "your-region-name",  # E.g. "us-east-1"
    "model_id": "your-model-id",  # E.g "anthropic.claude-v2"
    "model_kwargs": {"temperature": 0.4},
}

bedrock_model = BedrockChat(
    credentials_profile_name=config["credentials_profile_name"],
    region_name=config["region_name"],
    endpoint_url=f"https://bedrock-runtime.{config['region_name']}.amazonaws.com",
    model_id=config["model_id"],
    model_kwargs=config["model_kwargs"],
)

retriever = AmazonKnowledgeBasesRetriever(
    credentials_profile_name=config["credentials_profile_name"],
    region_name=config["region_name"],
    index_name="your-index-name",  # The name of your OpenSearch index
    embedding_model="titan-embed-text-v1",
)

Define Evaluation Metrics:

from ragas.metrics import (
    context_precision,
    faithfulness,
    context_recall,
)
from ragas.metrics.critique import harmfulness

metrics = [
    faithfulness,
    context_recall,
    context_precision,
    harmfulness,
]

Evaluate the Model:

from ragas import evaluate

result = evaluate(
    documents,  # Use the loaded PDF documents
    metrics=metrics,
    llm=bedrock_model,
    retriever=retriever,
)

# Convert results to pandas DataFrame
df = result.to_pandas()
df.head()

Implications of Using BedrockChat and BedrockEmbeddings Instead of Amazon Bedrock Agents and Knowledge Bases

Using BedrockChat and BedrockEmbeddings instead of Amazon Bedrock Agents and Knowledge Bases can affect the validity of your tests in several ways:

Contextual Relevance: Bedrock Agents and Knowledge Bases are designed to work together, leveraging the managed knowledge base for more accurate and contextually relevant responses. Using BedrockChat and BedrockEmbeddings might not fully utilize the knowledge base, leading to less accurate results.
Performance: The managed Knowledge Base in Amazon Bedrock is optimized for retrieval tasks, potentially offering better performance and more relevant document retrieval compared to a generic embedding model.
Integration: The integration between Bedrock Agents and Knowledge Bases is seamless, ensuring that the retrieval and generation processes are tightly coupled. Using separate components like BedrockChat and BedrockEmbeddings might introduce inconsistencies or inefficiencies in the workflow.
Evaluation Metrics: The evaluation metrics might not fully capture the capabilities of the managed Knowledge Base if the retrieval process is not aligned with how the Knowledge Base is intended to be used.

By following the steps provided, you can ensure that your evaluation setup leverages the full capabilities of Amazon Bedrock Agents and Knowledge Bases, leading to more accurate and relevant test results.

To continue talking to Dosu, mention @dosu.

Hey @danielesalvatore , we are working on supporting conversational and agentic workflows in ragas by v0.2. I'm working on the initial PR #1222 . I would would to talk to you to further understand your use-case and ensure that we support it too. If you're free, here's my cal

Hey @shahules786, thanks for pinging! I just submitted a meeting request for Monday, September 2, 2024, 10:00 – 10:30 Rome, ITA time.

explodinggradients / ragas

Can Ragas be used to evaluate Amazon Bedrock Agents w/ managed Knowledge Base? #1234

Implications of Using BedrockChat and BedrockEmbeddings Instead of Amazon Bedrock Agents and Knowledge Bases