Closed knpunk closed 1 month ago
To view the True Positives (TP), False Positives (FP), and False Negatives (FN) generated by the LLM, you can use the AnswerCorrectnessClassification
class and the CORRECTNESS_PROMPT
to classify the statements. Here is an example of how you can do this:
from ragas.metrics._answer_correctness import AnswerCorrectnessClassification, CORRECTNESS_PROMPT
from ragas.llms.output_parser import RagasoutputParser
# Example data
question = "What powers the sun and what is its primary function?"
answer = [
"The sun is powered by nuclear fission, similar to nuclear reactors on Earth.",
"The primary function of the sun is to provide light to the solar system.",
]
ground_truth = [
"The sun is powered by nuclear fusion, where hydrogen atoms fuse to form helium.",
"This fusion process in the sun's core releases a tremendous amount of energy.",
"The energy from the sun provides heat and light, which are essential for life on Earth.",
"The sun's light plays a critical role in Earth's climate system.",
"Sunlight helps to drive the weather and ocean currents.",
]
# Create the prompt value
prompt_value = CORRECTNESS_PROMPT.format(
question=question,
ground_truth=ground_truth,
answer=answer,
)
# Assuming you have an LLM instance and a method to generate the output
llm = ... # Your LLM instance
result = llm.generate(prompt_value)
# Parse the result
output_parser = RagasoutputParser(pydantic_object=AnswerCorrectnessClassification)
classification = output_parser.parse(result.generations[0][0].text)
# View the TP, FP, and FN
tp = classification.TP
fp = classification.FP
fn = classification.FN
print("True Positives (TP):", tp)
print("False Positives (FP):", fp)
print("False Negatives (FN):", fn)
This code will classify the statements in the answer compared to the ground truth and output the True Positives, False Positives, and False Negatives [1].
To continue talking to Dosu, mention @dosu.
@dosu it's providing the number of such statements, but I want specific statements too for POC, can you help with that too? classifying which statements are TP, FP and FN.
To classify and display the specific statements that fall into each category (TP, FP, FN), you can modify the _ascore
method to include the classification and display of these statements. Here is an example:
async def _ascore(self, row: t.Dict, callbacks: Callbacks) -> float:
assert self.llm is not None, "LLM must be set"
question = row["question"]
statements = {}
for item in ["answer", "ground_truth"]:
p_value = self._create_statements_prompt(row)
item_statement = await self.llm.generate(p_value, callbacks=callbacks)
statements[item] = await _statements_output_parser.aparse(
item_statement.generations[0][0].text,
p_value,
self.llm,
self.max_retries,
)
statements[item] = (
statements[item].dicts() if statements[item] is not None else []
)
if not all([val == [] for val in statements.values()]):
ground_truth = [
statement
for item in statements["ground_truth"]
for statement in item["simpler_statements"]
]
answer = [
statement
for item in statements["answer"]
for statement in item["simpler_statements"]
]
p_value = self.correctness_prompt.format(
question=question,
ground_truth=ground_truth,
answer=answer,
)
is_statement_present = await self.llm.generate(p_value, callbacks=callbacks)
result_text = is_statement_present.generations[0][0].text
answers = await _output_parser.aparse(
result_text, p_value, self.llm, self.max_retries
)
if answers is None:
return np.nan
# Display the classified statements
print("True Positives (TP):")
for tp in answers.TP:
print(f"Statement: {tp['statement']}, Reason: {tp['reason']}")
print("False Positives (FP):")
for fp in answers.FP:
print(f"Statement: {fp['statement']}, Reason: {fp['reason']}")
print("False Negatives (FN):")
for fn in answers.FN:
print(f"Statement: {fn['statement']}, Reason: {fn['reason']}")
f1_score = self._compute_statement_presence(answers)
else:
f1_score = 1.0
if self.weights[1] == 0:
similarity_score = 0.0
else:
assert self.answer_similarity is not None, "AnswerSimilarity must be set"
similarity_score = await self.answer_similarity.ascore(
row, callbacks=callbacks
)
score = np.average(
[f1_score, similarity_score],
weights=self.weights,
)
return float(score)
This modification ensures that the specific statements classified as TP, FP, and FN are printed out along with their reasons, providing a detailed classification and display of the statements [1][2].
To continue talking to Dosu, mention @dosu.
Thanks, works fine :)
@dosu How can I view the TP, FP and FN that are being generated by the llm to calculate the f-1 score? as I want to verify it manually