explodinggradients / ragas

Supercharge Your LLM Application Evaluations 🚀
https://docs.ragas.io
Apache License 2.0
7.3k stars 745 forks source link

Faithfulness and Response Relevancy have high parse error rate with non-OpenAI model #1631

Open ahgraber opened 2 weeks ago

ahgraber commented 2 weeks ago

[x] I have checked the documentation and related resources and couldn't resolve my bug.

Describe the bug On the same dataset: OpenAI has a 4% error rate in parsing faithfulness metric Anthropic has ~40% error rate in parsing faithfulness metric Llama3.1-70B-instruct (via TogetherAI) has a ~40% error rate in parsing faithfulness metric and ~20% error rate in parsing response relevance.

Ragas version: 0.22 Python version: 3.11

Code to Reproduce Share code to reproduce the issue

Error trace

Expected behavior Model-agnostic prompts / parsers provide equivalent (low) error rates

Additional context I'll try to provide some more experimental results for context when I can

shahules786 commented 2 weeks ago

Hey, @ahgraber Have you tried tuning/modifying default prompts using the set_prompts and get_prompts methods in ragas?

ahgraber commented 2 weeks ago

I haven't tried tuning the default prompt yet; trying to understand why I'm getting the errors:

It seems like the first step in faithfulness that decomposes into simple sentences (typically) works fine. The problem occurs that when there are many decomposed sentences (20+), the models have a hard time responding with the proper schema for the nli verdicts - it seems like the number of statements overwhelms the schema instructions. Unfortunately, I've run out of time at this point to continue investigating; my recommendation to my team will be to use OpenAI models as that seems to be the service this was primarily designed for.

Secondly (and this should probably be its own issue), when the "repair" prompt triggers, it escapes all json characters, and then re-escapes them to deeper and deeper depths every time the repair prompt is sent while also repeating the prompt and few-shot examples. It's possible I don't notice the repair prompt working if it works the first time, but I see a lot of logs that look like failed_repair.txt and ultimately fail out

jjmachan commented 2 weeks ago

would you be able to use something like https://github.com/Arize-ai/phoenix for viewing the traces? I wanted to check it there was maybe a out-of-context errors are happening?

but we can get on a call sometime to debug more too

ahgraber commented 2 weeks ago

I am running out of context sometimes, but that is not the root cause of the initial error loop. I'll see if I can get to adding traces, but I'm not sure what timing looks like (might take a week or two)