explodinggradients / ragas

Supercharge Your LLM Application Evaluations 🚀
https://docs.ragas.io
Apache License 2.0
7.28k stars 746 forks source link

Faithfulness return 1 when it should not #793

Closed lalehsg closed 5 months ago

lalehsg commented 8 months ago

[ ] I have checked the documentation and related resources and couldn't resolve my bug.

Describe the bug I have noticed that in some cases when the retrieved context is empty and the llm has made up an answer, the returned faithfulness is set to 1. this happens with both GPT3.5 turbo and GPT4 as llm judges.

Looking into the code, i believe the problem is in the NLI part. I think if you add an example here https://github.com/explodinggradients/ragas/blob/4c31c0f12fd0c4e945ba2e1bad78181f72b92c49/src/ragas/metrics/_faithfulness.py#L56 and instruct the llm to return 0 when the context is empty, it can fix the issue.

Ragas version: Python version:

Code to Reproduce an example with no context and some madeup answer by llm that doesn't start with "i'm sorry" or "i'm unable"

Error trace

Expected behavior i expect to get zero for the faithfulness metric.

Additional context Add any other context about the problem here.

shahules786 commented 8 months ago

Hey @lalehsg does it occur frequently? Can you share a sample for which you observe this behavior

lalehsg commented 8 months ago

it happened 2 times in my sample of 18. these were the only cases that llm made up answer without having proper context though. i assume it will happen again if this situation happens. this is an example:

question: What should be included in a new hire's Welcome Kit?

answer:

  1. Company Handbook: A company handbook that includes information about the company's culture, values, policies, and benefits.
  2. Company Swag: Items such as t-shirts, mugs, or pens with the company's logo to help the new hire feel like part of the team.
  3. Business Cards: If appropriate, a set of business cards with the new hire's contact information.
  4. Office Supplies: Necessary office supplies such as notebooks, pens, and highlighters.
  5. Technology: Any necessary technology such as a laptop, phone, or access badge.
  6. Training Materials: Any training materials or instructions for using company software and systems.
  7. Welcome Letter: A welcome letter from the company's leadership or the new hire's manager.
  8. Contact Information: A list of key contacts and their contact information to help the new hire get settled and started in their new role.
  9. Snacks and Drinks: A selection of snacks and drinks to help the new hire feel comfortable and at home in their new workspace.

Note: The context information provided does not contain specific information about what should be included in a new hire's Welcome Kit, the above answer is based on general practices and common items included in a Welcome Kit.

context: ['N']

g truth: A new hire's Welcome Kit should include a calendar for the first two weeks, an overview of the area, an organization chart, a phone directory, and materials unique to the individual's role.

Hope this info helps. I noticed that when retrieval fails and llm makes an answer starting with "I'm sorry..." or "i'm unable ..." the metric comes back as zero or close to zero by gprt3.5 and NaN by GPT4. Based on your comment on my other question, NaN is the correct value.

lalehsg commented 8 months ago

i actually realized sth. this is probably not Ragas issue. i can avoid confusing the metric by not calling the llm at all when the context is not retrieved and just print out "sorry i'm not able to... ".

wanggithub08 commented 2 months ago

I also encountered this problem, and I think it's a bug.