Closed Anjalisoni99-byte closed 2 months ago
The correctness value is low because the evaluation of answer correctness involves both semantic similarity and factual similarity. Even if the model correctly identified tag 82 as representing the Application Interchange Profile (AIP), the overall correctness score can still be low if there are discrepancies in other aspects of the answer or if the semantic similarity is not high enough.
According to the AnswerCorrectness
metric from the ragas
evaluation framework, the correctness score is calculated using a combination of factuality and semantic similarity, with default weights of 0.75 for factuality and 0.25 for semantic similarity. The metric classifies statements in the answer as true positives (TP), false positives (FP), or false negatives (FN) based on their presence and support in the ground truth. If the answer contains statements that are not supported by the ground truth (FP) or misses statements that are present in the ground truth (FN), the correctness score will be negatively impacted. Additionally, noise sensitivity measures how often a system makes errors by providing incorrect responses when utilizing either relevant or irrelevant retrieved documents. If the answer includes claims that are not supported by the relevant retrieved context, the noise sensitivity score will increase, further lowering the overall correctness value. Therefore, even if one part of the answer is correct, the overall score can be low due to other inaccuracies, lack of semantic alignment with the ground truth, or high noise sensitivity [1][2][3][4].
To continue talking to Dosu, mention @dosu.
What is the acceptable range of answer correctness score?
No, I mean what correctness value is accepted for a QA accuracy on a context based dataset?
The ragas
documentation does not specify a universally accepted threshold for the AnswerCorrectness
score, as it can vary depending on the specific requirements and context of the dataset. However, the score ranges from 0 to 1, with higher scores indicating better correctness. Generally, a higher score closer to 1 is preferred, but the acceptable threshold can depend on the use case and the level of accuracy required for your application.
For example, in high-stakes applications like medical or legal QA systems, you might require a score very close to 1. In less critical applications, a lower threshold might be acceptable. It is recommended to evaluate the performance of your QA system in the context of your specific needs and adjust the acceptable threshold accordingly [1].
To continue talking to Dosu, mention @dosu.
Ground truth- Tag 82 represents the Application Interchange Profile (AIP). LLM Response- In EMV contactless transactions, tag 82 represents the Application Interchange Profile (AIP). The AIP is a set of persistent data elements representing the context of an ongoing online transaction. The Online Transaction Context is saved by the Kernel before returning the Online Request. The AIP is identified by the tag '82' in the Static Data to be Authenticated (item 9d of Table 6.5). If the SDA Tag List is present
Answer Correctness- 0.437377148125797 Answer Similarity- 0.892365735360333