explodinggradients / ragas

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
https://docs.ragas.io
Apache License 2.0
6.73k stars 661 forks source link

Black box insights of answer correctness #1193

Closed knpunk closed 1 month ago

knpunk commented 1 month ago

@dosu How can I view the TP, FP and FN that are being generated by the llm to calculate the f-1 score? as I want to verify it manually

dosubot[bot] commented 1 month ago

To view the True Positives (TP), False Positives (FP), and False Negatives (FN) generated by the LLM, you can use the AnswerCorrectnessClassification class and the CORRECTNESS_PROMPT to classify the statements. Here is an example of how you can do this:

from ragas.metrics._answer_correctness import AnswerCorrectnessClassification, CORRECTNESS_PROMPT
from ragas.llms.output_parser import RagasoutputParser

# Example data
question = "What powers the sun and what is its primary function?"
answer = [
    "The sun is powered by nuclear fission, similar to nuclear reactors on Earth.",
    "The primary function of the sun is to provide light to the solar system.",
]
ground_truth = [
    "The sun is powered by nuclear fusion, where hydrogen atoms fuse to form helium.",
    "This fusion process in the sun's core releases a tremendous amount of energy.",
    "The energy from the sun provides heat and light, which are essential for life on Earth.",
    "The sun's light plays a critical role in Earth's climate system.",
    "Sunlight helps to drive the weather and ocean currents.",
]

# Create the prompt value
prompt_value = CORRECTNESS_PROMPT.format(
    question=question,
    ground_truth=ground_truth,
    answer=answer,
)

# Assuming you have an LLM instance and a method to generate the output
llm = ...  # Your LLM instance
result = llm.generate(prompt_value)

# Parse the result
output_parser = RagasoutputParser(pydantic_object=AnswerCorrectnessClassification)
classification = output_parser.parse(result.generations[0][0].text)

# View the TP, FP, and FN
tp = classification.TP
fp = classification.FP
fn = classification.FN

print("True Positives (TP):", tp)
print("False Positives (FP):", fp)
print("False Negatives (FN):", fn)

This code will classify the statements in the answer compared to the ground truth and output the True Positives, False Positives, and False Negatives [1].

To continue talking to Dosu, mention @dosu.

knpunk commented 1 month ago

@dosu it's providing the number of such statements, but I want specific statements too for POC, can you help with that too? classifying which statements are TP, FP and FN.

dosubot[bot] commented 1 month ago

To classify and display the specific statements that fall into each category (TP, FP, FN), you can modify the _ascore method to include the classification and display of these statements. Here is an example:

async def _ascore(self, row: t.Dict, callbacks: Callbacks) -> float:
    assert self.llm is not None, "LLM must be set"

    question = row["question"]
    statements = {}
    for item in ["answer", "ground_truth"]:
        p_value = self._create_statements_prompt(row)
        item_statement = await self.llm.generate(p_value, callbacks=callbacks)
        statements[item] = await _statements_output_parser.aparse(
            item_statement.generations[0][0].text,
            p_value,
            self.llm,
            self.max_retries,
        )
        statements[item] = (
            statements[item].dicts() if statements[item] is not None else []
        )

    if not all([val == [] for val in statements.values()]):
        ground_truth = [
            statement
            for item in statements["ground_truth"]
            for statement in item["simpler_statements"]
        ]
        answer = [
            statement
            for item in statements["answer"]
            for statement in item["simpler_statements"]
        ]
        p_value = self.correctness_prompt.format(
            question=question,
            ground_truth=ground_truth,
            answer=answer,
        )
        is_statement_present = await self.llm.generate(p_value, callbacks=callbacks)
        result_text = is_statement_present.generations[0][0].text

        answers = await _output_parser.aparse(
            result_text, p_value, self.llm, self.max_retries
        )
        if answers is None:
            return np.nan

        # Display the classified statements
        print("True Positives (TP):")
        for tp in answers.TP:
            print(f"Statement: {tp['statement']}, Reason: {tp['reason']}")

        print("False Positives (FP):")
        for fp in answers.FP:
            print(f"Statement: {fp['statement']}, Reason: {fp['reason']}")

        print("False Negatives (FN):")
        for fn in answers.FN:
            print(f"Statement: {fn['statement']}, Reason: {fn['reason']}")

        f1_score = self._compute_statement_presence(answers)
    else:
        f1_score = 1.0

    if self.weights[1] == 0:
        similarity_score = 0.0
    else:
        assert self.answer_similarity is not None, "AnswerSimilarity must be set"

        similarity_score = await self.answer_similarity.ascore(
            row, callbacks=callbacks
        )

    score = np.average(
        [f1_score, similarity_score],
        weights=self.weights,
    )

    return float(score)

This modification ensures that the specific statements classified as TP, FP, and FN are printed out along with their reasons, providing a detailed classification and display of the statements [1][2].

To continue talking to Dosu, mention @dosu.

knpunk commented 1 month ago

Thanks, works fine :)