explodinggradients / ragas

Supercharge Your LLM Application Evaluations 🚀
https://docs.ragas.io
Apache License 2.0
6.89k stars 689 forks source link

Test data generation Answer column is missing #1084

Closed manojpraba closed 2 months ago

manojpraba commented 3 months ago

[ ] I have checked the documentation and related resources and couldn't resolve my bug.

Describe the bug when I creating test data using below code, answer column is missing not generated

import nest_asyncio nest_asyncio.apply() generator = TestsetGenerator.from_langchain( generator_llm=azure_model, critic_llm=azure_model, embeddings=azure_embeddings )

testset = generator.generate_with_langchain_docs(

lchain_docs,
test_size=2,
raise_exceptions=False,
is_async= True,
with_debugging_logs=False,

distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},

Ragas version: Python version: 3.10

Code to Reproduce

Error trace { "name": "ValueError", "message": "The metric [faithfulness] that that is used requires the following additional columns ['answer'] to be present in the dataset. ", "stack": "--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[178], line 3 1 import nest_asyncio 2 nest_asyncio.apply() ----> 3 result = evaluate( 4 dataset1,metrics=metrics1, llm=azure_model, embeddings=azure_embeddings, raise_exceptions=False)

File c:\DL projects\ai-kb-bot\.venv\lib\site-packages\ragas\evaluation.py:156, in evaluate(dataset, metrics, llm, embeddings, callbacks, in_ci, is_async, run_config, raise_exceptions, column_map) 154 # validation 155 dataset = handle_deprecated_ground_truths(dataset) --> 156 validate_evaluation_modes(dataset, metrics) 157 validate_column_dtypes(dataset) 159 # set the llm and embeddings

File c:\DL projects\ai-kb-bot\.venv\lib\site-packages\ragas\validation.py:97, in validate_evaluation_modes(ds, metrics, evalmode_to_columns) 91 if ( 92 isinstance(m, ContextPrecision) 93 and \"ground_truth\" not in available_columns 94 ): 95 extra_msg = \"Looks like you're trying to use 'context_precision' without ground_truth. Please use consider using `context_utilization' instead.\" ---> 97 raise ValueError( 98 f\"The metric [{m.name}] that that is used requires the following \" 99 f\"additional columns {list(required_columns - available_columns)} \" 100 f\"to be present in the dataset. {extra_msg}\" 101 )

ValueError: The metric [faithfulness] that that is used requires the following additional columns ['answer'] to be present in the dataset. " Expected behavior Expecting workaround for generating answer column

Additional context image

dosubot[bot] commented 3 months ago

Hey @manojpraba! I'm here to help you with any issues you're experiencing. I can assist with bug fixes, answer questions, and guide you to become a contributor. Let me know how I can assist you!

To generate the "answer" column in your dataset and avoid the ValueError, you need to ensure that the TestsetGenerator generates responses for the questions in your test dataset. Here is a step-by-step guide to achieve this:

  1. Generate Responses for Questions: Use a function to query your query_engine with each question and compile the results into a DataFrame that includes the "answer" column.

  2. Create the Dataset: Use the generated responses to create a dataset that includes the "answer" column.

Here is an example code snippet to help you generate the "answer" column:

from datasets import Dataset
from tqdm.auto import tqdm
import pandas as pd

def generate_response(query_engine, question):
    response = query_engine.query(question)
    return {
        "answer": response.response,
        "contexts": [c.node.get_content() for c in response.source_nodes],
    }

def generate_ragas_dataset(query_engine, test_df):
    test_questions = test_df["question"].values
    responses = [generate_response(query_engine, q) for q in tqdm(test_questions)]

    dataset_dict = {
        "question": test_questions,
        "answer": [response["answer"] for response in responses],
        "contexts": [response["contexts"] for response in responses],
        "ground_truth": test_df["ground_truth"].values.tolist(),
    }
    ds = Dataset.from_dict(dataset_dict)
    return ds

# Assuming `query_engine` and `test_df` are already defined
ragas_eval_dataset = generate_ragas_dataset(query_engine, test_df)
ragas_evals_df = pd.DataFrame(ragas_eval_dataset)
print(ragas_evals_df.head())

Additionally, if you are using custom LLMs or embeddings, you can replace the default models with your own when instantiating the TestsetGenerator. Here is an example:

from ragas.testset.generator import TestsetGenerator
from langchain_together import Together
from langchain_together.embeddings import TogetherEmbeddings

together_key = "<your-key-here>"

embeddings = TogetherEmbeddings(model="togethercomputer/m2-bert-80M-8k-retrieval")

together_completion = Together(
    model="NousResearch/Nous-Hermes-2-Mixtral-8x7B-SFT",
    temperature=0.7,
    max_tokens=4000,
    top_k=1,
    together_api_key=together_key
)

# generator with custom llm and embeddings
generator = TestsetGenerator.from_langchain(
    generator_llm=together_completion,
    critic_llm=together_completion,
    embeddings=embeddings,
)

This ensures that the TestsetGenerator uses your specified LLM and embeddings for generating the test set, including the "answer" column [1][2][3].

To continue talking to Dosu, mention @dosu.

manojpraba commented 3 months ago

@dosu what should be the value for query_engine if using azure open ai api

dosubot[bot] commented 3 months ago

To use the Azure OpenAI API for generating the "answer" column with the TestsetGenerator, you need to set up the query_engine accordingly. Here is how you can do it:

import nest_asyncio
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.service_context import ServiceContext
from llama_index.embeddings.openai import OpenAIEmbedding
from datasets import Dataset

nest_asyncio.apply()

def build_query_engine(documents):
    vector_index = VectorStoreIndex.from_documents(
        documents,
        service_context=ServiceContext.from_defaults(chunk_size=512),
        embed_model=OpenAIEmbedding(api_type="azure"),
    )

    query_engine = vector_index.as_query_engine(similarity_top_k=3)
    return query_engine

def generate_responses(query_engine, test_questions, test_answers):
    responses = [query_engine.query(q) for q in test_questions]

    answers = []
    contexts = []
    for r in responses:
        answers.append(r.response)
        contexts.append([c.node.get_content() for c in r.source_nodes])
    dataset_dict = {
        "question": test_questions,
        "answer": answers,
        "contexts": contexts,
    }
    if test_answers is not None:
        dataset_dict["ground_truth"] = test_answers
    ds = Dataset.from_dict(dataset_dict)
    return ds

reader = SimpleDirectoryReader(PATH, num_files_limit=30, required_exts=[".pdf"])
documents = reader.load_data()

test_questions = eval_dataset["question"].values.tolist()
test_answers = eval_dataset["ground_truth"].values.tolist()

query_engine = build_query_engine(documents)
result_ds = generate_responses(query_engine, test_questions, test_answers)

In this setup, the api_type parameter is set to "azure" to indicate that the Azure OpenAI API is being used. This aligns with the approach described in the ragas library documentation for integrating with LlamaIndex and generating test datasets [1][2][3][4][5].

To continue talking to Dosu, mention @dosu.

Zzrjin commented 3 months ago

same issue, do you solve this problem?

Gwenn-LR commented 2 months ago

As mentionned in the documentation, "Faithfulness - Measures the factual consistency of the answer to the context based on the question.". So you need to provide an answer for earch query in order to compute this metric.

To do so you need to generate responses with your RAG/LLM and append them to your dataset:

  1. Generate responses

    [!NOTE]
    I'm using LangChain terminology, feel free to adapt my llm.invoke(question).content to your configuration - I've also taken your variable generator instanciated from TestsetGenerator.from_langchain class method:

answers = [llm.invoke(question).content for question in generator["question"]]
  1. Append them to your dataset

    [!NOTE]
    Since TestsetGenerator doesn't seem to have a method to add a column, I've exported the TestsetGenerator object to a pandas.Dataframe object and I've added a "answer" column:

    generator_with_generated_answers = generator.to_pandas()
    generator_with_generated_answers["answer"] = answers

    [!NOTE] According to the datasets API, evaluate method expects a Dataset object, so I've exported back to this type:

    
    from datasets import Dataset

generated_dataset = Dataset.from_pandas(generator_with_generated_answers)


Finally, you can use your new dataset to evaluate your RAG/LLM:
```python
results = evaluate(generated_dataset, metrics=metrics, llm=azure_model, embeddings=azure_embeddings, raise_exceptions=False)

[!WARNING]
Since you use the same LLM to generate your synthetic dataset ground_truth and your answer, I think the results of this evaluation might be biased. I haven't realized a comparative study but it might be an issue which could have an impact on your interpretation.

github-actions[bot] commented 2 months ago

It seems the issue was answered, closing this now.