explodinggradients / ragas

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
https://docs.ragas.io
Apache License 2.0
6.56k stars 643 forks source link

How to get started question #1229

Open botchagalupeai opened 2 weeks ago

botchagalupeai commented 2 weeks ago

I am trying to follow the Get Started documentation on

https://docs.ragas.io/en/latest/getstarted/testset_generation.html

I apologize for being a newbie upfront.

I was able to create the synthetic test set from my data. However, I can't figure out how to convert the dataset (testset) into a format that the next step in the tutorial "Evaluating Using Your Test Set" (i.e., documentation example expects.)

https://docs.ragas.io/en/latest/getstarted/evaluation.html

It would enhance the tutorial if there were an intermediate step (example) to make the process more seamless. Any assistance in this matter would be greatly appreciated.

dosubot[bot] commented 2 weeks ago

Hey @botchagalupeai! I'm here to assist you with any questions, bugs, or contributions you have. Let's work together to resolve your issue.

To convert the synthetic test set generated using Ragas into a format suitable for evaluation, follow these steps:

  1. Generate the Synthetic Test Set: Ensure you have generated the synthetic test set as described in the "Generate a Synthetic Test Set" section of the documentation. Here is an example:

    from langchain_community.document_loaders import DirectoryLoader
    from ragas.testset.generator import TestsetGenerator
    from ragas.testset.evolutions import simple, reasoning, multi_context
    from langchain_openai import ChatOpenAI, OpenAIEmbeddings
    
    # Load documents
    loader = DirectoryLoader("your-directory")
    documents = loader.load()
    
    # Ensure metadata includes 'filename'
    for document in documents:
       document.metadata['filename'] = document.metadata['source']
    
    # Initialize generator with OpenAI models
    generator_llm = ChatOpenAI(model="gpt-3.5-turbo-16k")
    critic_llm = ChatOpenAI(model="gpt-4")
    embeddings = OpenAIEmbeddings()
    
    generator = TestsetGenerator.from_langchain(
       generator_llm,
       critic_llm,
       embeddings
    )
    
    # Generate test set
    testset = generator.generate_with_langchain_docs(documents, test_size=10, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})
    
    # Export to Pandas DataFrame
    test_df = testset.to_pandas()
  2. Convert the Test Set for Evaluation: Use the following code snippet to convert the test set into the required format for evaluation:

    from datasets import Dataset
    from tqdm.auto import tqdm
    import pandas as pd
    
    def generate_response(query_engine, question):
       response = query_engine.query(question)
       return {
           "answer": response.response,
           "contexts": [c.node.get_content() for c in response.source_nodes],
       }
    
    def generate_ragas_dataset(query_engine, test_df):
       test_questions = test_df["question"].values
       responses = [generate_response(query_engine, q) for q in tqdm(test_questions)]
    
       dataset_dict = {
           "question": test_questions,
           "answer": [response["answer"] for response in responses],
           "contexts": [response["contexts"] for response in responses],
           "ground_truth": test_df["ground_truth"].values.tolist(),
       }
       ds = Dataset.from_dict(dataset_dict)
       return ds
    
    ragas_eval_dataset = generate_ragas_dataset(query_engine, test_df)
    ragas_evals_df = pd.DataFrame(ragas_eval_dataset)
    ragas_evals_df.head()
  3. Evaluate the RAG Pipeline: Follow these steps to evaluate your RAG pipeline:

    import os
    os.environ["OPENAI_API_KEY"] = "your-openai-key"
    
    from ragas.metrics import (
       answer_relevancy,
       faithfulness,
       context_recall,
       context_precision,
    )
    
    from ragas import evaluate
    
    result = evaluate(
       ragas_eval_dataset,
       metrics=[
           context_precision,
           faithfulness,
           answer_relevancy,
           context_recall,
       ],
    )
    
    df = result.to_pandas()
    df.head()

By following these steps, you will have a synthetic test set ready for evaluation and can analyze the performance of your RAG pipeline using the provided metrics [1][2][3].

To continue talking to Dosu, mention @dosu.

botchagalupeai commented 2 weeks ago

where was query_engine supposed to be defined

dosubot[bot] commented 2 weeks ago

The query_engine is defined in the build_query_engine function. Here is the relevant code snippet to define and initialize the query_engine:

import nest_asyncio
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.embeddings import OpenAIEmbedding

nest_asyncio.apply()

def build_query_engine(documents):
    vector_index = VectorStoreIndex.from_documents(
        documents,
        service_context=ServiceContext.from_defaults(chunk_size=512),
        embed_model=OpenAIEmbedding(),
    )

    query_engine = vector_index.as_query_engine(similarity_top_k=2)
    return query_engine

# Example usage
documents = [...]  # Load your documents here
query_engine = build_query_engine(documents)

Make sure you have your documents loaded and passed to the build_query_engine function to initialize the query_engine [1].

To continue talking to Dosu, mention @dosu.

botchagalupeai commented 2 weeks ago

@dosu where was query_engine supposed to be defined in these code examples

dosubot[bot] commented 2 weeks ago

The query_engine is defined in the build_query_engine function. Here is the relevant code snippet to define and initialize the query_engine:

import nest_asyncio
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.embeddings import OpenAIEmbedding

nest_asyncio.apply()

def build_query_engine(documents):
    vector_index = VectorStoreIndex.from_documents(
        documents,
        service_context=ServiceContext.from_defaults(chunk_size=512),
        embed_model=OpenAIEmbedding(),
    )

    query_engine = vector_index.as_query_engine(similarity_top_k=2)
    return query_engine

# Example usage
documents = [...]  # Load your documents here
query_engine = build_query_engine(documents)

Make sure you have your documents loaded and passed to the build_query_engine function to initialize the query_engine [1][2].

To continue talking to Dosu, mention @dosu.

botchagalupeai commented 2 weeks ago

@dosu

This is confusing. In the examples on the web page there is no vector database. It's just the synthetic dataset passed to the evaluate

dosubot[bot] commented 2 weeks ago

The webpage examples do not include a vector database because they focus on loading documents using langchain and llama-index document loaders. The synthetic dataset is passed to the evaluate function by first generating the test set using the TestsetGenerator and then converting it to a pandas DataFrame.

Here is the relevant code snippet:

from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

TEST_SIZE = 25

# generator with openai models
generator_llm = ChatOpenAI(model="gpt-3.5-turbo-16k")
critic_llm = ChatOpenAI(model="gpt-4")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(generator_llm, critic_llm, embeddings)

# set question type distribution
distribution = {simple: 0.5, reasoning: 0.25, multi_context: 0.25}
# generate testset
testset = generator.generate_with_llamaindex_docs(
    documents, test_size=TEST_SIZE, distributions=distribution
)
test_df = testset.to_pandas()
test_df.head()

This code demonstrates how to generate a synthetic test dataset using TestsetGenerator and convert it to a pandas DataFrame for evaluation [1].

To continue talking to Dosu, mention @dosu.

botchagalupeai commented 2 weeks ago

@dosu

I created my synthetic dataset and I want to pass it to the evaluate instead of the amnesty_qa dataset

dosubot[bot] commented 2 weeks ago

To pass your synthetic dataset to the evaluate function instead of using the amnesty_qa dataset, you need to replace the amnesty_qa["eval"] with your synthetic dataset. Here is an example of how you can do it:

from ragas import evaluate
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
)

# Assuming your synthetic dataset is loaded into a variable named `synthetic_dataset`
result = evaluate(
    synthetic_dataset,
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
    ],
)

result

Make sure that your synthetic_dataset is in the same format as expected by the evaluate function. The dataset should contain the following columns:

Additionally, ensure you have your OpenAI key set in your environment if you are using metrics that require it:

import os
os.environ["OPENAI_API_KEY"] = "your-openai-key"

If you are using Amazon Bedrock instead of OpenAI, you will need to configure the BedrockChat and BedrockEmbeddings classes from Langchain and pass them to the evaluate function. Here is an example setup:

from langchain_community.chat_models import BedrockChat
from langchain_community.embeddings import BedrockEmbeddings

config = {
    "credentials_profile_name": "your-profile-name",  # E.g "default"
    "region_name": "your-region-name",  # E.g. "us-east-1"
    "model_id": "your-model-id",  # E.g "anthropic.claude-v2"
    "model_kwargs": {"temperature": 0.4},
}

bedrock_model = BedrockChat(
    credentials_profile_name=config["credentials_profile_name"],
    region_name=config["region_name"],
    endpoint_url=f"https://bedrock-runtime.{config['region_name']}.amazonaws.com",
    model_id=config["model_id"],
    model_kwargs=config["model_kwargs"],
)

bedrock_embeddings = BedrockEmbeddings(
    credentials_profile_name=config["credentials_profile_name"],
    region_name=config["region_name"],
)

result = evaluate(
    synthetic_dataset,
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
    ],
    llm=bedrock_model,
    embeddings=bedrock_embeddings,
)

result

This setup ensures that you can use Amazon Bedrock models and embeddings for evaluation [1][2][3].

To continue talking to Dosu, mention @dosu.

botchagalupeai commented 2 weeks ago

@dosu

My question is how can I convert my synthetic dataset so that it's compatible with the evaluate. I can't just pass the synthetic dataset from the synthetic data page to the evaluate function

dosubot[bot] commented 2 weeks ago

To convert your synthetic dataset so that it's compatible with the evaluate function, you need to ensure that your dataset is in the format expected by the evaluate function. Here is an example of how to load a sample dataset and use it with the evaluate function:

from datasets import load_dataset
from ragas import evaluate
from ragas.metrics import (
    context_precision,
    faithfulness,
    context_recall,
    answer_relevancy,
)
import os
import nest_asyncio

# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = "your-openai-key"

# Load your synthetic dataset
synthetic_dataset = load_dataset("path_to_your_synthetic_dataset")

# Define the metrics
metrics = [
    faithfulness,
    context_recall,
    context_precision,
    answer_relevancy,
]

# Apply nest_asyncio if running in a Jupyter notebook
nest_asyncio.apply()

# Evaluate the dataset
result = evaluate(
    synthetic_dataset["eval"],
    metrics=metrics,
)

# Convert the result to a pandas DataFrame
df = result.to_pandas()
df.head()

Ensure that your synthetic dataset is structured similarly to the amnesty_qa dataset used in the example. The dataset should contain the following columns:

Additionally, you can refer to the following example from the ragas framework to understand how to structure your dataset and upload it for evaluation:

import pandas as pd
from uuid import UUID
from langsmith import Client

client = Client()

# Sample questions and answers
questions = [
    "What are the advantages of remote work? Why does gitlab do it?",
    "what are the dis-advantages of remote work in gitlab? How do you mitigate that?",
    "What does it mean to be 'all-remote'? Why is it important?",
    "How does compensation work in an 'all-remote' setting?",
    "How to run effective meetings in 'all-remote' setting",
]
answers = [
    "Remote work offers numerous advantages including increased flexibility, the ability to hire top talent globally without geographical constraints, enhanced productivity due to fewer distractions, and significant cost savings on office space and related expenses. GitLab adopts an all-remote model to leverage these benefits, ensuring a more inclusive and diverse workforce, fostering a culture that values output over presence, and providing employees the freedom to work in environments that best suit their personal and professional needs. This approach not only supports individual employee well-being and work-life balance but also positions GitLab as a resilient and adaptive organization in a rapidly changing global work landscape.",
    "GitLab's remote work challenges include onboarding difficulties, loneliness, communication breakdowns, work/life balance issues, time zone conflicts, and the need for strong time management skills. To mitigate these, GitLab employs strategies such as providing comprehensive onboarding resources, fostering community through virtual coffee chats and social calls, prioritizing asynchronous communication, reimbursing coworking spaces, empowering employees to manage their own schedules, focusing on results rather than hours, and screening for self-discipline during the hiring process. These measures aim to enhance productivity and employee satisfaction in a remote work setting.",
    "Being 'all-remote' means that an organization empowers every individual to work from any location where they feel most fulfilled, without the need to report to a company-owned office, thereby treating all employees equally regardless of their physical location. This approach is important because it eliminates location hierarchy, allowing for a more inclusive work environment where team members have the autonomy to create their ideal workspace and can maintain their job regardless of life changes, such as relocations due to family commitments. It supports a diverse workforce, including caregivers, working parents, and military spouses, by providing them with the flexibility to work from anywhere, fostering equality among all employees and enabling a global talent pool without the constraints of geographical boundaries.",
    "In an 'all-remote' setting, such as at GitLab, compensation is structured around local rates rather than a single global standard, which means employees are paid based on the cost of living and market rates in their respective locations. This approach allows the company to hire globally without being bound by the high salary standards of any particular region, like San Francisco. GitLab uses a compensation calculator to ensure transparency and fairness in pay, adjusting salaries based on a combination of factors including location, experience, and market data. Payments are typically made in the local currency of the employee, and for countries where direct employment isn't feasible, GitLab utilizes professional employment organizations or hires contractors. This model supports GitLab's global talent acquisition strategy while managing compensation costs effectively.",
    "To run effective meetings in an 'all-remote' setting, it's crucial to be intentional about meeting necessity, provide clear agendas and supporting materials in advance, start and end on time, document discussions in real time, and make attendance optional to respect time zones and individual schedules. Recording meetings for asynchronous viewing, using reliable communication tools like Zoom, and ensuring active participation through video feedback are also key practices. This approach aligns with GitLab's guidelines for maximizing efficiency, inclusivity, and collaboration in a remote work environment.",
]
dataset = {"question": questions, "ground_truth": answers}

# Creating a pandas DataFrame from the dataset dictionary
df = pd.DataFrame(dataset)

# Upload to langsmith
langsmith_dataset = client.upload_dataframe(
    name="temporal_dataset",
    description="temporal dataset for testing langsmith",
    df=df,
    input_keys=["question"],
    output_keys=["ground_truth"],
)

# Clean up the dataset after use
client.delete_dataset(dataset_id=langsmith_dataset.id)

This example demonstrates how to create a dataset with questions and ground truth answers, upload it to a service, and clean up afterward. Ensure your dataset follows a similar structure to be compatible with the evaluate function [1][2][3][4][5].

To continue talking to Dosu, mention @dosu.

botchagalupeai commented 2 weeks ago

@dosu

Let me try this again.

How do I convert the dataset created by the following command testset = generator.generate_with_langchain_docs(documents, test_size=10, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})

to be used by this code

from ragas import evaluate

result = evaluate( amnesty_qa["eval"], metrics=[ context_precision, faithfulness, answer_relevancy, context_recall, ], )

result

dosubot[bot] commented 2 weeks ago

To convert the dataset created by the command testset = generator.generate_with_langchain_docs(documents, test_size=10, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25}) to be used by the evaluate function from the ragas library, you need to convert the testset to a format that the evaluate function can accept. The TestDataset class provides a method to_dataset() that converts the test data to a Dataset object.

Here is how you can do it:

# Assuming `testset` is the result of the generate_with_langchain_docs method
dataset = testset.to_dataset()

# Now `dataset` can be used with the `evaluate` function from the `ragas` library
from ragas import evaluate
from ragas.metrics import (
    context_precision,
    faithfulness,
    answer_relevancy,
    context_recall,
)

result = evaluate(
    dataset,
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
    ],
)

print(result)

The to_dataset() method converts the test data into a Dataset object, which is compatible with the evaluate function [1][2].

To continue talking to Dosu, mention @dosu.

botchagalupeai commented 2 weeks ago

@dosu

Getting closer. Now I get this error


ValueError Traceback (most recent call last) Cell In[16], line 10 2 from ragas import evaluate 3 from ragas.metrics import ( 4 context_precision, 5 faithfulness, 6 answer_relevancy, 7 context_recall, 8 ) ---> 10 result = evaluate( 11 dataset, 12 metrics=[ 13 context_precision, 14 faithfulness, 15 answer_relevancy, 16 context_recall, 17 ], 18 ) 20 print(result)

File ~/.local/lib/python3.12/site-packages/ragas/_analytics.py:129, in track_was_completed..wrapper(*args, *kwargs) 126 @wraps(func) 127 def wrapper(args: P.args, **kwargs: P.kwargs) -> t.Any: 128 track(IsCompleteEvent(event_type=func.name, is_completed=False)) ... 99 f"additional columns {list(required_columns - available_columns)} " 100 f"to be present in the dataset. {extra_msg}" 101 )

ValueError: The metric [faithfulness] that that is used requires the following additional columns ['answer'] to be present in the dataset.

dosubot[bot] commented 2 weeks ago

To convert the dataset created by testset = generator.generate_with_langchain_docs(documents, test_size=10, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25}) to be compatible with the evaluate function from the ragas library, ensuring it includes the required columns such as 'answer', you can follow these steps:

  1. Convert the generated testset to a pandas DataFrame.
  2. Ensure the DataFrame includes the 'answer' column.

Here is the code to achieve this:

# Convert the testset to a pandas DataFrame
test_df = testset.to_pandas()

# Ensure the DataFrame includes the 'answer' column
# Assuming 'answer' is one of the fields in the generated testset
if 'answer' not in test_df.columns:
    test_df['answer'] = test_df.apply(lambda row: row['expected_answer'], axis=1)  # Adjust based on actual field names

# Collect questions and answers
test_questions = test_df['question'].values.tolist()
test_answers = [[item] for item in test_df['answer'].values.tolist()]

# Now you can use the DataFrame with the evaluate function
from ragas.llama_index import evaluate
import nest_asyncio

# Only used when running on a Jupyter notebook
nest_asyncio.apply()

# Define your metrics
from ragas.metrics import context_precision, context_recall
metrics = [context_precision, context_recall]

# Build your query engine (example with OpenAI embeddings)
from langchain.embeddings import OpenAIEmbeddings
from llama_index import VectorStoreIndex, ServiceContext

def build_query_engine(embed_model):
    vector_index = VectorStoreIndex.from_documents(
        documents, service_context=ServiceContext.from_defaults(chunk_size=512),
        embed_model=embed_model,
    )
    query_engine = vector_index.as_query_engine(similarity_top_k=2)
    return query_engine

openai_model = OpenAIEmbeddings()
query_engine = build_query_engine(openai_model)

# Evaluate
result = evaluate(query_engine, metrics, test_questions, test_answers)

result

This code ensures that the 'answer' column is included in the DataFrame, making it compatible with the evaluate function from the ragas library. Adjust the lambda function in the apply method based on the actual field names in your testset. Additionally, it demonstrates how to build a query engine and evaluate the test questions and answers using the ragas library [1][2][3][4][5].

To continue talking to Dosu, mention @dosu.

botchagalupeai commented 2 weeks ago

@dosu

can you give me a working example of how to generate synthetic data and run the evaluate

dosubot[bot] commented 2 weeks ago

Here is a working example of how to generate synthetic data and run the evaluate function using the ragas library:

  1. Generate Synthetic Data: Use the TestsetGenerator to create a synthetic dataset.

  2. Convert the Dataset: Ensure the dataset is in the correct format for evaluation.

  3. Run the Evaluate Function: Use the evaluate function to evaluate the synthetic dataset.

Here is the complete code:

import os
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import answer_relevancy, context_precision, faithfulness, context_recall
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = "your-openai-key"

# Initialize OpenAI models
generator_llm = ChatOpenAI(model="gpt-3.5-turbo-16k")
critic_llm = ChatOpenAI(model="gpt-4")
embeddings = OpenAIEmbeddings()

# Create a generator instance
generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

# Load documents
from langchain_community.document_loaders import DirectoryLoader
loader = DirectoryLoader("your-directory")
documents = loader.load()

# Ensure metadata includes 'filename'
for document in documents:
    document.metadata['filename'] = document.metadata['source']

# Generate test set
testset = generator.generate_with_langchain_docs(documents, test_size=10, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})

# Convert to Pandas DataFrame
test_df = testset.to_pandas()

# Ensure the DataFrame includes the 'answer' column
if 'answer' not in test_df.columns:
    test_df['answer'] = test_df.apply(lambda row: row['expected_answer'], axis=1)  # Adjust based on actual field names

# Convert the DataFrame to a Dataset
dataset = Dataset.from_pandas(test_df)

# Run the evaluate function
result = evaluate(
    dataset,
    metrics=[answer_relevancy, context_precision, faithfulness, context_recall],
)

print(result)

This example demonstrates how to generate synthetic data, convert it to the required format, and run the evaluate function using the ragas library [1][2][3][4][5].

To continue talking to Dosu, mention @dosu.

botchagalupeai commented 2 weeks ago

@dosu

this code gets an error

Ensure the DataFrame includes the 'answer' column

if 'answer' not in test_df.columns: test_df['answer'] = test_df.apply(lambda row: row['expected_answer'], axis=1) # Adjust based on actual field names

KeyError                                  Traceback (most recent call last)

File ~/.local/lib/python3.12/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key) 3804 try: -> 3805 return self._engine.get_loc(casted_key) 3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'expected_answer'

The above exception was the direct cause of the following exception:

KeyError Traceback (most recent call last) Cell In[8], line 3 1 # Ensure the DataFrame includes the 'answer' column 2 if 'answer' not in test_df.columns: ----> 3 test_df['answer'] = test_df.apply(lambda row: row['expected_answer'], axis=1) # Adjust based on actual field names ... 3815 # InvalidIndexError. Otherwise we fall through and re-raise 3816 # the TypeError. 3817 self._check_indexing_error(key)

KeyError: 'expected_answer'

dosubot[bot] commented 2 weeks ago

To convert the dataset created by testset = generator.generate_with_langchain_docs(documents, test_size=10, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25}) to be used by the evaluate function without encountering a KeyError for the 'expected_answer' column, you need to ensure that the 'expected_answer' column is present in the DataFrame. You can achieve this by adding a default value for the 'expected_answer' column if it is missing.

Here is how you can do it:

# Generate the test dataset
testset = generator.generate_with_langchain_docs(documents, test_size=10, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})

# Convert the test dataset to a pandas DataFrame
df = testset.to_pandas()

# Ensure the 'expected_answer' column is present
if 'expected_answer' not in df.columns:
    df['expected_answer'] = None  # or any default value you prefer

# Ensure the 'answer' column is present
if 'answer' not in df.columns:
    df['answer'] = df['expected_answer']

# Convert the DataFrame to a Dataset
from datasets import Dataset
dataset = Dataset.from_pandas(df)

# Now you can use the DataFrame with the evaluate function
from ragas import evaluate
from ragas.metrics import answer_relevancy, context_precision, faithfulness, context_recall

result = evaluate(
    dataset,
    metrics=[answer_relevancy, context_precision, faithfulness, context_recall],
)

print(result)

This code ensures that the 'expected_answer' column is added to the DataFrame if it is missing, and then it assigns the 'expected_answer' values to the 'answer' column if 'answer' is not already present. This will prevent a KeyError when using the evaluate function [1].

To continue talking to Dosu, mention @dosu.

botchagalupeai commented 2 weeks ago

@dosu

ValueError Traceback (most recent call last) Cell In[9], line 20 17 from ragas import evaluate 18 from ragas.metrics import answer_relevancy, context_precision, faithfulness, context_recall ---> 20 result = evaluate( 21 dataset, 22 metrics=[answer_relevancy, context_precision, faithfulness, context_recall], 23 ) 25 print(result)

File ~/.local/lib/python3.12/site-packages/ragas/_analytics.py:129, in track_was_completed..wrapper(*args, kwargs) 126 @wraps(func) 127 def wrapper(*args: P.args, *kwargs: P.kwargs) -> t.Any: 128 track(IsCompleteEvent(event_type=func.name, is_completed=False)) --> 129 result = func(args, kwargs) 130 track(IsCompleteEvent(event_type=func.name, is_completed=True)) 132 return result

File ~/.local/lib/python3.12/site-packages/ragas/evaluation.py:165, in evaluate(dataset, metrics, llm, embeddings, callbacks, in_ci, run_config, token_usage_parser, raise_exceptions, column_map) 163 dataset = handle_deprecated_ground_truths(dataset) 164 validate_evaluation_modes(dataset, metrics) --> 165 validate_column_dtypes(dataset) 167 # set the llm and embeddings 168 if isinstance(llm, LangchainLLM): ... 48 ) 50 for column_names in ["contexts"]: 51 if column_names in ds.features:

ValueError: Dataset feature "answer" should be of type string

dosubot[bot] commented 2 weeks ago

To convert the dataset created by testset = generator.generate_with_langchain_docs(documents, test_size=10, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25}) to be compatible with the evaluate function from the ragas library, ensuring that the 'answer' column is of type string, you can follow these steps:

  1. Convert the testset to a pandas DataFrame.
  2. Ensure the 'answer' column is of type string.
  3. Convert the DataFrame back to a datasets.Dataset object.
  4. Validate the column data types to ensure compatibility with the ragas library.

Here is the code to achieve this:

import pandas as pd
from datasets import Dataset

# Convert the testset to a pandas DataFrame
test_df = testset.to_pandas()

# Ensure the 'answer' column is of type string
if 'answer' not in test_df.columns:
    test_df['answer'] = test_df['expected_answer'].astype(str)  # Adjust based on actual field names
else:
    test_df['answer'] = test_df['answer'].astype(str)

# Convert the DataFrame back to a datasets.Dataset object
test_dataset = Dataset.from_pandas(test_df)

# Validate the column data types
from ragas.validation import validate_column_dtypes

validate_column_dtypes(test_dataset)

# Now test_dataset is ready to be used with the evaluate function from the ragas library
from ragas import evaluate
from ragas.metrics import answer_relevancy, context_precision, faithfulness, context_recall

result = evaluate(
    test_dataset,
    metrics=[answer_relevancy, context_precision, faithfulness, context_recall],
)

print(result)

This will ensure that the 'answer' column in your dataset is of type string and that the dataset is validated for compatibility with the evaluate function from the ragas library [1][2].

To continue talking to Dosu, mention @dosu.