Open ahgraber opened 1 week ago
Hey, @ahgraber Have you tried tuning/modifying default prompts using the set_prompts
and get_prompts
methods in ragas?
I haven't tried tuning the default prompt yet; trying to understand why I'm getting the errors:
It seems like the first step in faithfulness
that decomposes into simple sentences (typically) works fine.
The problem occurs that when there are many decomposed sentences (20+), the models have a hard time responding with the proper schema for the nli verdicts - it seems like the number of statements overwhelms the schema instructions. Unfortunately, I've run out of time at this point to continue investigating; my recommendation to my team will be to use OpenAI models as that seems to be the service this was primarily designed for.
Secondly (and this should probably be its own issue), when the "repair" prompt triggers, it escapes all json characters, and then re-escapes them to deeper and deeper depths every time the repair prompt is sent while also repeating the prompt and few-shot examples. It's possible I don't notice the repair prompt working if it works the first time, but I see a lot of logs that look like failed_repair.txt and ultimately fail out
would you be able to use something like https://github.com/Arize-ai/phoenix for viewing the traces? I wanted to check it there was maybe a out-of-context errors are happening?
but we can get on a call sometime to debug more too
I am running out of context sometimes, but that is not the root cause of the initial error loop. I'll see if I can get to adding traces, but I'm not sure what timing looks like (might take a week or two)
[x] I have checked the documentation and related resources and couldn't resolve my bug.
Describe the bug On the same dataset: OpenAI has a 4% error rate in parsing faithfulness metric Anthropic has ~40% error rate in parsing faithfulness metric Llama3.1-70B-instruct (via TogetherAI) has a ~40% error rate in parsing faithfulness metric and ~20% error rate in parsing response relevance.
Ragas version: 0.22 Python version: 3.11
Code to Reproduce Share code to reproduce the issue
Error trace
Expected behavior Model-agnostic prompts / parsers provide equivalent (low) error rates
Additional context I'll try to provide some more experimental results for context when I can