Open shayefarti opened 3 months ago
@shayefarti Hello there! I'm here to help you with any issues you encounter. Let me know how I can assist you.
To ensure that the 'answer' column is included in the generated test set, you need to modify the generate_with_langchain_docs
method to include the 'answer' column. Here is how you can do it:
Generate Responses for Questions: Use a function to query your query_engine
with each question and compile the results into a DataFrame that includes the "answer" column.
Create the Dataset: Use the generated responses to create a dataset that includes the "answer" column.
Here is an example code snippet to help you generate the "answer" column:
from datasets import Dataset
from tqdm.auto import tqdm
import pandas as pd
def generate_response(query_engine, question):
response = query_engine.query(question)
return {
"answer": response.response,
"contexts": [c.node.get_content() for c in response.source_nodes],
}
def generate_ragas_dataset(query_engine, test_df):
test_questions = test_df["question"].values
responses = [generate_response(query_engine, q) for q in tqdm(test_questions)]
dataset_dict = {
"question": test_questions,
"answer": [response["answer"] for response in responses],
"contexts": [response["contexts"] for response in responses],
"ground_truth": test_df["ground_truth"].values.tolist(),
}
ds = Dataset.from_dict(dataset_dict)
return ds
# Assuming `query_engine` and `test_df` are already defined
ragas_eval_dataset = generate_ragas_dataset(query_engine, test_df)
ragas_evals_df = pd.DataFrame(ragas_eval_dataset)
print(ragas_evals_df.head())
Additionally, ensure that your TestsetGenerator
is correctly configured to use the Azure OpenAI models:
azure_model = AzureChatOpenAI(
openai_api_version="2023-05-15",
azure_endpoint=azure_configs["base_url"],
azure_deployment=azure_configs["model_deployment"],
model=azure_configs["model_name"],
validate_base_url=False,
)
azure_embeddings = AzureOpenAIEmbeddings(
openai_api_version="2023-05-15",
azure_endpoint=azure_configs["base_url"],
azure_deployment=azure_configs["embedding_deployment"],
model=azure_configs["embedding_name"],
)
generator = TestsetGenerator.from_langchain(
generator_llm=azure_model, critic_llm=azure_model, embeddings=azure_embeddings
)
testset = generator.generate_with_langchain_docs(
documents,
test_size=10,
raise_exceptions=False,
with_debugging_logs=False,
distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
)
This configuration ensures that the 'answer' column is generated in the test set [1][2].
To continue talking to Dosu, mention @dosu.
@dosu hi So If I understand, it’s look like a bug, isn’t it?
The issue you're experiencing with the 'answer' column not being generated in the test set when using the generate_with_langchain_docs
method is likely due to a missing step in the configuration or generation process rather than a bug. The method relies on the generate
method, which is responsible for the overall generation process, including the 'answer' column.
To ensure the 'answer' column is generated, you should verify the following:
distributions
parameter is correctly set to include the necessary evolutions that generate the 'answer' column.init_evolution
method properly initializes the evolutions with the required LLMs and docstore.Here is an example of how to generate the 'answer' column:
from datasets import Dataset
from tqdm.auto import tqdm
import pandas as pd
def generate_response(query_engine, question):
response = query_engine.query(question)
return {
"answer": response.response,
"contexts": [c.node.get_content() for c in response.source_nodes],
}
def generate_ragas_dataset(query_engine, test_df):
test_questions = test_df["question"].values
responses = [generate_response(query_engine, q) for q in tqdm(test_questions)]
dataset_dict = {
"question": test_questions,
"answer": [response["answer"] for response in responses],
"contexts": [response["contexts"] for response in responses],
"ground_truth": test_df["ground_truth"].values.tolist(),
}
ds = Dataset.from_dict(dataset_dict)
return ds
# Assuming `query_engine` and `test_df` are already defined
ragas_eval_dataset = generate_ragas_dataset(query_engine, test_df)
ragas_evals_df = pd.DataFrame(ragas_eval_dataset)
print(ragas_evals_df.head())
Ensure that your TestsetGenerator
is correctly configured to use the Azure OpenAI models and that the distributions
parameter includes the necessary evolutions to generate the 'answer' column [1][2].
To continue talking to Dosu, mention @dosu.
Hi @shayefarti,
You can look at this previous answer to help you solve this issue: https://github.com/explodinggradients/ragas/issues/1084#issuecomment-2248219601
To deal with your example in depth, if I recall correctly, I've faced this issue regarding the tutorials and I came to the conclusion that this example tutorial had been made when TestsetGenerator.generate_with_langchain_docs
was generating answer
column. Nowadays, this named column is used to contain the results of your RAG, while the generated responses to the synthetic question/contexts is under the column named "ground_truth". So to evaluate your RAG you need to generate answer
if you want to measure the quality (answer_relevancy, answer_similarity, answer_correctness, faithfulness) of the RAG's responses.
@shayefarti thanks for bringing this up, it is a documentation error actually and do refer to @Gwenn-LR excellent explanation (Thanks a lot @Gwenn-LR ❤️ ). these are the columns now
@shahules786 we should update the documentation with the new table?
Your Question Synthetic Test Data generation doesn't output columns ['answer'] as in this Example
Code Examples Load Doc
Load Modules (I Use Azure Open AI )
Generate Test
*Display dataset/testset columns**
Run evaluate on the testset_to_dataset
})