explodinggradients / ragas

Supercharge Your LLM Application Evaluations 🚀
https://docs.ragas.io
Apache License 2.0
7.09k stars 719 forks source link

Why correctness value is low even if model responded correctly? #1241

Closed Anjalisoni99-byte closed 2 months ago

Anjalisoni99-byte commented 2 months ago

Ground truth- Tag 82 represents the Application Interchange Profile (AIP). LLM Response- In EMV contactless transactions, tag 82 represents the Application Interchange Profile (AIP). The AIP is a set of persistent data elements representing the context of an ongoing online transaction. The Online Transaction Context is saved by the Kernel before returning the Online Request. The AIP is identified by the tag '82' in the Static Data to be Authenticated (item 9d of Table 6.5). If the SDA Tag List is present

Answer Correctness- 0.437377148125797 Answer Similarity- 0.892365735360333

dosubot[bot] commented 2 months ago

The correctness value is low because the evaluation of answer correctness involves both semantic similarity and factual similarity. Even if the model correctly identified tag 82 as representing the Application Interchange Profile (AIP), the overall correctness score can still be low if there are discrepancies in other aspects of the answer or if the semantic similarity is not high enough.

According to the AnswerCorrectness metric from the ragas evaluation framework, the correctness score is calculated using a combination of factuality and semantic similarity, with default weights of 0.75 for factuality and 0.25 for semantic similarity. The metric classifies statements in the answer as true positives (TP), false positives (FP), or false negatives (FN) based on their presence and support in the ground truth. If the answer contains statements that are not supported by the ground truth (FP) or misses statements that are present in the ground truth (FN), the correctness score will be negatively impacted. Additionally, noise sensitivity measures how often a system makes errors by providing incorrect responses when utilizing either relevant or irrelevant retrieved documents. If the answer includes claims that are not supported by the relevant retrieved context, the noise sensitivity score will increase, further lowering the overall correctness value. Therefore, even if one part of the answer is correct, the overall score can be low due to other inaccuracies, lack of semantic alignment with the ground truth, or high noise sensitivity [1][2][3][4].

To continue talking to Dosu, mention @dosu.

Anjalisoni99-byte commented 2 months ago

What is the acceptable range of answer correctness score?

dosubot[bot] commented 2 months ago

The acceptable range of the answer correctness score is from 0 to 1 [1].

To continue talking to Dosu, mention @dosu.

Anjalisoni99-byte commented 2 months ago

No, I mean what correctness value is accepted for a QA accuracy on a context based dataset?

dosubot[bot] commented 2 months ago

The ragas documentation does not specify a universally accepted threshold for the AnswerCorrectness score, as it can vary depending on the specific requirements and context of the dataset. However, the score ranges from 0 to 1, with higher scores indicating better correctness. Generally, a higher score closer to 1 is preferred, but the acceptable threshold can depend on the use case and the level of accuracy required for your application.

For example, in high-stakes applications like medical or legal QA systems, you might require a score very close to 1. In less critical applications, a lower threshold might be acceptable. It is recommended to evaluate the performance of your QA system in the context of your specific needs and adjust the acceptable threshold accordingly [1].

To continue talking to Dosu, mention @dosu.