Addressing Duplicates Question and Incorrect 'Ground Truth' in RAG Evaluation with Ragas

I am currently using Ragas to evaluate my RAG application, which is built using llama index . I've encountered a few issues in the generated results:

1- When generating queries using TestsetGenerator, I've noticed that many of the questions are duplicated. 2- In several instances, the 'ground_truth' is incorrectly marked as "The answer to the given question is not present in the context," even though the context is actually available. This leads to a 'context_precision' and 'context_recall' of 0, negatively impacting the overall score.

Could you advise on how to address these issues?

Hello @adityamity! I'm a bot that can help you with bugs, questions, and becoming a contributor. I'm here to assist you while waiting for a human maintainer.

I found some similar issues and solutions that might help you:

Duplicated Questions:
- There is an open issue where users reported that Ragas generates test cases from initially uploaded documents even after updating them, leading to mixed or old documents being used ^[1].
- A suggested solution is to divide the code into main and worker scripts to ensure unique questions are generated.
Incorrect 'Ground Truth' Markings:
- Common causes include deprecated column names, incorrect data types, and missing required columns. Ensure that the 'ground_truth' column is correctly named and of type string ^[2].
- The AnswerCorrectness metric uses a detailed prompt to classify statements, which can sometimes lead to incorrect markings if not accurately categorized ^[3].

Steps to Address These Issues:

Ensuring Unique Questions

Modify the TestsetGenerator to ensure unique questions by checking for duplicates before adding a new question:

from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

TEST_SIZE = 25

generator_llm = ChatOpenAI(model="gpt-3.5-turbo-16k")
critic_llm = ChatOpenAI(model="gpt-4")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(generator_llm, critic_llm, embeddings)

distribution = {simple: 0.5, reasoning: 0.25, multi_context: 0.25}

unique_questions = set()
testset = []

while len(testset) < TEST_SIZE:
    new_testset = generator.generate_with_llama_index_docs(
        documents, test_size=TEST_SIZE - len(testset), distributions=distribution
    )
    for item in new_testset:
        if item['question'] not in unique_questions:
            unique_questions.add(item['question'])
            testset.append(item)

test_df = pd.DataFrame(testset)
test_df.head()

Verifying Ground Truth Markings

Ensure that the 'ground_truth' markings are correctly assigned by verifying the ground truth answers:

def verify_ground_truth(test_df):
    for index, row in test_df.iterrows():
        question = row['question']
        ground_truth = row['ground_truth']
        print(f"Question: {question}")
        print(f"Ground Truth: {ground_truth}")

verify_ground_truth(test_df)

Additionally, refer to the Ragas documentation for more detailed guidance on creating synthetic test data and evaluating different LLMs ^[4]^[5]^[6]^[7].

To continue talking to Dosu, mention @dosu.

@dosu - Please refer to the image. My query is regarding the generation of the 'ground_truth'. Although the context is available, I'm still getting 'The answer to the given question is not present in the context'. I'd like to understand how and on what basis this 'ground_truth' is being generated. If the 'ground_truth' generated by RAGAS TestsetGenerator is incorrect, it could affect the accuracy of my overall metrics.

The 'ground_truth' in the RAGAS TestsetGenerator is generated by segmenting the reference text into simpler statements using a sentence segmenter. These segmented statements are then used to create a prompt for the LLM to classify the statements in the answer as true positive (TP), false positive (FP), or false negative (FN).

The reason it might incorrectly state that the answer is not present in the context, even when it is, could be due to the segmentation process or the LLM's interpretation of the statements. If the segmentation does not accurately capture the context or if the LLM misinterprets the statements, it could lead to incorrect classifications.

Here is the relevant code snippet:

def _create_statements_prompt(self, question: str, text: str) -> PromptValue:
    assert self.sentence_segmenter is not None, "sentence_segmenter is not set"

    sentences = self.sentence_segmenter.segment(text)
    sentences = [
        sentence for sentence in sentences if sentence.strip().endswith(".")
    ]
    sentences = "\n".join([f"{i}:{x}" for i, x in enumerate(sentences)])
    prompt_value = self.long_form_answer_prompt.format(
        question=question, answer=text, sentences=sentences
    )
    return prompt_value

async def _ascore(self, row: t.Dict, callbacks: Callbacks) -> float:
    assert self.llm is not None, "LLM must be set"

    question = row["user_input"]
    statements = {}
    for item in ["response", "reference"]:
        p_value = self._create_statements_prompt(question, row[item])
        item_statement = await self.llm.generate(p_value, callbacks=callbacks)
        statements[item] = await _statements_output_parser.aparse(
            item_statement.generations[0][0].text,
            p_value,
            self.llm,
            self.max_retries,
        )
        statements[item] = (
            statements[item].dicts() if statements[item] is not None else []
        )
if not all([val == [] for val in statements.values()]):
        ground_truth = [
            statement
            for item in statements["reference"]
            for statement in item["simpler_statements"]
        ]
        answer = [
            statement
            for item in statements["response"]
            for statement in item["simpler_statements"]
        ]
        p_value = self.correctness_prompt.format(
            question=question,
            ground_truth=ground_truth,
            answer=answer,
        )
        is_statement_present = await self.llm.generate(p_value, callbacks=callbacks)
        result_text = is_statement_present.generations[0][0].text

        answers = await _output_parser.aparse(
            result_text, p_value, self.llm, self.max_retries
        )
        if answers is None:
            return np.nan

        f1_score = self._compute_statement_presence(answers)
    else:
        f1_score = 1.0

This process involves generating statements from both the response and reference texts, and then using the LLM to classify these statements, which can sometimes lead to incorrect classifications if the segmentation or interpretation is flawed ^[1].

Additionally, the RAGAS synthetic data generation feature can be used to create questions and ground truth answers, which might help in formulating a more accurate test dataset. This feature is particularly useful when starting out and can be found in the RAGAS documentation under the testset generation section ^[2].

To continue talking to Dosu, mention @dosu.

@dosu How can i handle such cases where it is incorrectly classifying "The answer to the given question is not present in the context"

To handle cases where the TestsetGenerator incorrectly classifies "The answer to the given question is not present in the context", you can modify the critic_question method to include additional checks or improve the prompt used for generating the critic's verdict. Here is an example of how you might enhance the critic_question method:

async def critic_question(self, question: str) -> bool:
    assert self.llm is not None, "LLM is not initialized"
    output = await self.llm.generate(self.critic_question_prompt.format(question=question))
    output = json.loads(output.generations[0][0].text)

    # Additional check to ensure the answer is present in the context
    if "The answer to the given question is not present in the context" in output:
        return False

    return all(score >= 2 for score in output.values())

In this modification, an additional check is added to see if the output contains the phrase "The answer to the given question is not present in the context". If this phrase is detected, the method returns False, indicating that the question is not valid. This should help in handling cases where the TestsetGenerator incorrectly classifies the presence of the answer.

Additionally, you can refer to the AbstractQA class in the ragas framework, which provides a structure for generating and validating questions. The critic_question method in this class is designed to evaluate the quality of generated questions, and you can customize it further to improve the accuracy of the classification ^[1]^[2].

To continue talking to Dosu, mention @dosu.

explodinggradients / ragas