Closed mariamaslam closed 10 months ago
df["contexts"]=df["contexts"].apply(lambda x : [x])
hi @sango-07 im still getting this error, please let me know what when wrong
I am trying to get results and I am getting this error ValueError: Dataset feature "contexts" should be of type Sequence[string], got <class 'datasets.features.features.Value'>
Ragas version: Python version: 3
Code to Reproduce I am working on a custom model i am have three different csv that i am stitching together. and trying to retieve information
Error trace
import pandas as pd
import numpy as np
# Load each CSV file into a separate DataFrame
hr_queries = pd.read_csv('test1.csv')
ground_truth = pd.read_csv('ground_truth.csv')
context = pd.read_csv('context.csv')
questions = test1['questions']
answers = test1['answers']
# Extract the first column from the 'ground_truth' DataFrame
ground_truth = ground_truth.iloc[:, 0]
# Extract the 'context' column from the 'context' DataFrame
context = context['context']
# Create a data dictionary
data_dict = {
'question': questions,
'answer': answers,
'ground_truths': ground_truth,
'contexts': context,
}
# Concatenate the DataFrames in the data dictionary
df = pd.concat(data_dict, axis=1)
# Save the concatenated DataFrame to a new CSV file
df.to_csv('test2.csv', index=False)
from datasets import load_dataset
with open('test2.csv') as f:
data = f.read()
dataset = load_dataset('csv', data_files='test2.csv')
print(dataset)
from ragas.metrics import (
context_precision,
answer_relevancy,
faithfulness,
context_recall,
)
from ragas.metrics.critique import harmfulness
# list of metrics we're going to use
metrics = [
faithfulness,
answer_relevancy,
context_recall,
context_precision,
harmfulness,
]
or m in metrics:
m.__setattr__("llm", ragas_azure_model)
from ragas import evaluate
result = evaluate(
dataset["train"],
metrics=metrics,
)
result
ValueError: Dataset feature "contexts" should be of type Sequence[string], got <class 'datasets.features.features.Value'>
Expected behavior I should be to able to evaluating with [context_recall]
Additional context Add any other context about the problem here.
i am still getting this after adding what you suggested
@sango-07
Hi @mariamaslam , the underlying issue is that your contexts
is not of the required type ie list[str]
. This will be evident if you check if manually by doing
dataset["contexts"][0]
You can either convert it to list[str] using the .map
function or make sure the data is in the correct format before transforming to hf dataset.
Hi @shahules786 thank you for your prompt n quick response. Can you please help in providing a sample if possible as i did try .map n converting it manually to list[str] as well it didnt work out for me.
import pandas as pd
import numpy as np
# Load each CSV file into a separate DataFrame
hr_queries = pd.read_csv('queries.csv')
ground_truth = pd.read_csv('ground_truth.csv')
context = pd.read_csv('context.csv')
# Extract the 'question' and 'answer' columns from the 'hr_queries' DataFrame
questions = hr_queries['questions']
answers = hr_queries['answers']
# Extract the first column from the 'ground_truth' DataFrame
ground_truth = ground_truth.iloc[:, 0]
# Extract the 'context' column from the 'context' DataFrame
context = context['context'] # Replace 'context' with the actual column name if it's different
# Create a data dictionary
data_dict = {
'question': questions,
'answer': answers,
'ground_truths':ground_truth,
'contexts': context,
}
# Concatenate the DataFrames in the data dictionary
df = pd.concat(data_dict, axis=1)
df = df.astype({"contexts": str}) # Explicitly convert to string column
# Save the concatenated DataFrame to a new CSV file
df.to_csv('queries_ragas.csv', index=False)
df = df.astype({"contexts": str}) # Explicitly convert to string column
contexts
needs to be list[str]
and not str
Would this help? https://docs.ragas.io/en/latest/howtos/applications/data_preparation.html
so my data is in csv but i manually need to add it into this format? Im sorry for silly questions. I am new to this n im still learning
data_samples = {
'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
'answer': ['The first superbowl was held on January 15, 1967', 'The most super bowls have been won by The New England Patriots'],
'contexts' : [['The Super Bowl....season since 1966,','replacing the NFL...in February.'],
['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],
'ground_truths': [['The first superbowl was held on January 15, 1967'], ['The New England Patriots have won the Super Bowl a record six times']]
}
dataset = Dataset.from_dict(data_samples)
No worries @mariamaslam , the document shows the required format. You can programmatically convert your csv to dict and check if it's in the required format. If not do the necessary manipulations on the dict to make to required format. Then load dict as a dataset as shown.
from datasets import load_dataset
with open('hr_queries_ragas.csv') as f:
data = f.read()
dataset = load_dataset('csv', data_files='hr_queries_ragas.csv')
contexts = dataset['train']['contexts'] # Replace 'train' with the correct split if it's different
contexts = [str(context) for context in contexts]
# Print the first few items of the list
print(contexts[:5])
# Print the dataset
print(dataset)
from ragas import evaluate
result = evaluate(
dataset["train"],
metrics=metrics,
)
@shahules786 this is still giving me same error.
this is also giving me similar error :( @shahules786
from datasets import load_dataset
dataset = load_dataset('csv', data_files='queries_ragas.csv')
data_dict = dataset['train'].to_dict()
print(data_dict)
from ragas import evaluate
result = evaluate(
dataset["train"],
metrics=metrics,
)
im still stuck n way more confused now. I dont know should i keep my data in csv, or should i convert to json or should convert into data_dict
ValueError: Dataset feature "contexts" should be of type Sequence[string], got <class 'datasets.features.features.Value'>
@mariamaslam Hi there !
As @shahules786 mentioned 5 days ago, the error message is due to your context
data not being in the format of list[str]
.
Today, I created the dataset
with the code below and resolved the same error.
from datasets import Dataset
...
my_chain = ...(I made it with LangChain)
retriever = ...(I made it with Weaviate)
...
# Make questions
questions = [
"question 1",
"question 2",
"question 3",
]
# Make ground_truths
ground_truths = [
["ground_truth to the question 1"],
["ground_truth to the question 2"],
["ground_truth to the question 3"],
]
answers = []
contexts = []
for query in questions:
# Make answers
answers.append(my_chain.invoke({"question":query}))
# Make contexts
unit_context = ''
for docs in retriever.get_relevant_documents(query):
unit_context += docs.page_content # type of page_content is 'str'
contexts.append([unit_context]) # type of unit_context is 'str'
# To dict
data = {
"question": questions,
"answer": answers,
"contexts": contexts,
"ground_truths": ground_truths
}
# Convert dict to dataset
dataset = Dataset.from_dict(data)
# Evaluation Result
result = evaluate(
dataset = dataset,
metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_recall,
],
)
df = result.to_pandas()
df
It looks like you need to modify the part of your code that generates contexts
.
Your data_samples
should be changed as follows.
# BEFORE
data_samples = {
'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
'answer': ['The first superbowl was held on January 15, 1967', 'The most super bowls have been won by The New England Patriots'],
'contexts' : [['The Super Bowl....season since 1966,','replacing the NFL...in February.'],
['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],
'ground_truths': [['The first superbowl was held on January 15, 1967'], ['The New England Patriots have won the Super Bowl a record six times']]
}
dataset = Dataset.from_dict(data_samples)
# AFTER
data_samples = {
'question': ['question 1', 'question 2', 'question 3'],
'answer': ['answer to the question 1', 'answer to the question 2', 'answer to the question 3'],
'contexts': [['context 1'], ['context 2'], ['context 3']],
'ground_truths': [['ground_truth to the question 1'], ['ground_truth to the question 2'], ['ground_truth to the question 3']]
}
dataset = Dataset.from_dict(data_samples)
Additionally, this post will be helpful for your test.
Hope this helps sincerely!
thanks a million at @code-sum for helping @mariamaslam. It's an honour to have valuable community members like yourself in ragas 🙂 ❤️
@mariamaslam let me know if it's fixed and if there is any issues. I'm closing this for now but if there is anything else feel free to reopen it.
Hi @jjmachan thank you for your kind words, my issue has been resolved.
My only question is regarding the results how to make them more presentable and what does t his indicates?
that is a pandas dataframe. but you can due further analysis in there. could you explain a bit more about which all ways you would like to represent the information?
hi @jjmachan i want to understand the scoring what does 1.00 indicates what does 0.9 0.88? What does these numbers indicates and in the end i want to display in this a graph with the answer_relevancy faithfuless and mention this is ofis it score is in this range then this is acceptable if it is below a specific range this needs to be improved
Understood 🙂 - now this is tricky but you will need 2 things
Describe the bug A clear and concise description of what the bug is.
Ragas version: Python version:
Code to Reproduce Share code to reproduce the issue
Error trace
Expected behavior A clear and concise description of what you expected to happen.
Additional context Add any other context about the problem here.