Closed lalehsg closed 5 months ago
Hey @lalehsg does it occur frequently? Can you share a sample for which you observe this behavior
it happened 2 times in my sample of 18. these were the only cases that llm made up answer without having proper context though. i assume it will happen again if this situation happens. this is an example:
question: What should be included in a new hire's Welcome Kit?
answer:
Note: The context information provided does not contain specific information about what should be included in a new hire's Welcome Kit, the above answer is based on general practices and common items included in a Welcome Kit.
context: ['N']
g truth: A new hire's Welcome Kit should include a calendar for the first two weeks, an overview of the area, an organization chart, a phone directory, and materials unique to the individual's role.
Hope this info helps. I noticed that when retrieval fails and llm makes an answer starting with "I'm sorry..." or "i'm unable ..." the metric comes back as zero or close to zero by gprt3.5 and NaN by GPT4. Based on your comment on my other question, NaN is the correct value.
i actually realized sth. this is probably not Ragas issue. i can avoid confusing the metric by not calling the llm at all when the context is not retrieved and just print out "sorry i'm not able to... ".
I also encountered this problem, and I think it's a bug.
[ ] I have checked the documentation and related resources and couldn't resolve my bug.
Describe the bug I have noticed that in some cases when the retrieved context is empty and the llm has made up an answer, the returned faithfulness is set to 1. this happens with both GPT3.5 turbo and GPT4 as llm judges.
Looking into the code, i believe the problem is in the NLI part. I think if you add an example here https://github.com/explodinggradients/ragas/blob/4c31c0f12fd0c4e945ba2e1bad78181f72b92c49/src/ragas/metrics/_faithfulness.py#L56 and instruct the llm to return 0 when the context is empty, it can fix the issue.
Ragas version: Python version:
Code to Reproduce an example with no context and some madeup answer by llm that doesn't start with "i'm sorry" or "i'm unable"
Error trace
Expected behavior i expect to get zero for the faithfulness metric.
Additional context Add any other context about the problem here.