Open adityamity opened 2 months ago
Hello @adityamity! I'm a bot that can help you with bugs, questions, and becoming a contributor. I'm here to assist you while waiting for a human maintainer.
I found some similar issues and solutions that might help you:
Duplicated Questions:
Incorrect 'Ground Truth' Markings:
AnswerCorrectness
metric uses a detailed prompt to classify statements, which can sometimes lead to incorrect markings if not accurately categorized [3].Modify the TestsetGenerator
to ensure unique questions by checking for duplicates before adding a new question:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
TEST_SIZE = 25
generator_llm = ChatOpenAI(model="gpt-3.5-turbo-16k")
critic_llm = ChatOpenAI(model="gpt-4")
embeddings = OpenAIEmbeddings()
generator = TestsetGenerator.from_langchain(generator_llm, critic_llm, embeddings)
distribution = {simple: 0.5, reasoning: 0.25, multi_context: 0.25}
unique_questions = set()
testset = []
while len(testset) < TEST_SIZE:
new_testset = generator.generate_with_llama_index_docs(
documents, test_size=TEST_SIZE - len(testset), distributions=distribution
)
for item in new_testset:
if item['question'] not in unique_questions:
unique_questions.add(item['question'])
testset.append(item)
test_df = pd.DataFrame(testset)
test_df.head()
Ensure that the 'ground_truth' markings are correctly assigned by verifying the ground truth answers:
def verify_ground_truth(test_df):
for index, row in test_df.iterrows():
question = row['question']
ground_truth = row['ground_truth']
print(f"Question: {question}")
print(f"Ground Truth: {ground_truth}")
verify_ground_truth(test_df)
Additionally, refer to the Ragas documentation for more detailed guidance on creating synthetic test data and evaluating different LLMs [4][5][6][7].
To continue talking to Dosu, mention @dosu.
@dosu - Please refer to the image. My query is regarding the generation of the 'ground_truth
'. Although the context is available, I'm still getting 'The answer to the given question is not present in the context
'. I'd like to understand how and on what basis this 'ground_truth
' is being generated. If the 'ground_truth' generated by RAGAS TestsetGenerator
is incorrect, it could affect the accuracy of my overall metrics.
The 'ground_truth' in the RAGAS TestsetGenerator
is generated by segmenting the reference text into simpler statements using a sentence segmenter. These segmented statements are then used to create a prompt for the LLM to classify the statements in the answer as true positive (TP), false positive (FP), or false negative (FN).
The reason it might incorrectly state that the answer is not present in the context, even when it is, could be due to the segmentation process or the LLM's interpretation of the statements. If the segmentation does not accurately capture the context or if the LLM misinterprets the statements, it could lead to incorrect classifications.
Here is the relevant code snippet:
def _create_statements_prompt(self, question: str, text: str) -> PromptValue:
assert self.sentence_segmenter is not None, "sentence_segmenter is not set"
sentences = self.sentence_segmenter.segment(text)
sentences = [
sentence for sentence in sentences if sentence.strip().endswith(".")
]
sentences = "\n".join([f"{i}:{x}" for i, x in enumerate(sentences)])
prompt_value = self.long_form_answer_prompt.format(
question=question, answer=text, sentences=sentences
)
return prompt_value
async def _ascore(self, row: t.Dict, callbacks: Callbacks) -> float:
assert self.llm is not None, "LLM must be set"
question = row["user_input"]
statements = {}
for item in ["response", "reference"]:
p_value = self._create_statements_prompt(question, row[item])
item_statement = await self.llm.generate(p_value, callbacks=callbacks)
statements[item] = await _statements_output_parser.aparse(
item_statement.generations[0][0].text,
p_value,
self.llm,
self.max_retries,
)
statements[item] = (
statements[item].dicts() if statements[item] is not None else []
)
if not all([val == [] for val in statements.values()]):
ground_truth = [
statement
for item in statements["reference"]
for statement in item["simpler_statements"]
]
answer = [
statement
for item in statements["response"]
for statement in item["simpler_statements"]
]
p_value = self.correctness_prompt.format(
question=question,
ground_truth=ground_truth,
answer=answer,
)
is_statement_present = await self.llm.generate(p_value, callbacks=callbacks)
result_text = is_statement_present.generations[0][0].text
answers = await _output_parser.aparse(
result_text, p_value, self.llm, self.max_retries
)
if answers is None:
return np.nan
f1_score = self._compute_statement_presence(answers)
else:
f1_score = 1.0
This process involves generating statements from both the response and reference texts, and then using the LLM to classify these statements, which can sometimes lead to incorrect classifications if the segmentation or interpretation is flawed [1].
Additionally, the RAGAS synthetic data generation feature can be used to create questions and ground truth answers, which might help in formulating a more accurate test dataset. This feature is particularly useful when starting out and can be found in the RAGAS documentation under the testset generation section [2].
To continue talking to Dosu, mention @dosu.
@dosu How can i handle such cases where it is incorrectly classifying "The answer to the given question is not present in the context"
To handle cases where the TestsetGenerator
incorrectly classifies "The answer to the given question is not present in the context
", you can modify the critic_question
method to include additional checks or improve the prompt used for generating the critic's verdict. Here is an example of how you might enhance the critic_question
method:
async def critic_question(self, question: str) -> bool:
assert self.llm is not None, "LLM is not initialized"
output = await self.llm.generate(self.critic_question_prompt.format(question=question))
output = json.loads(output.generations[0][0].text)
# Additional check to ensure the answer is present in the context
if "The answer to the given question is not present in the context" in output:
return False
return all(score >= 2 for score in output.values())
In this modification, an additional check is added to see if the output contains the phrase "The answer to the given question is not present in the context
". If this phrase is detected, the method returns False
, indicating that the question is not valid. This should help in handling cases where the TestsetGenerator
incorrectly classifies the presence of the answer.
Additionally, you can refer to the AbstractQA
class in the ragas
framework, which provides a structure for generating and validating questions. The critic_question
method in this class is designed to evaluate the quality of generated questions, and you can customize it further to improve the accuracy of the classification [1][2].
To continue talking to Dosu, mention @dosu.
I am currently using Ragas to evaluate my RAG application, which is built using llama index . I've encountered a few issues in the generated results:
1- When generating queries using
TestsetGenerator
, I've noticed that many of the questions are duplicated. 2- In several instances, the 'ground_truth' is incorrectly marked as "The answer to the given question is not present in the context," even though the context is actually available. This leads to a 'context_precision
' and 'context_recall
' of 0, negatively impacting the overall score.Could you advise on how to address these issues?