explodinggradients / ragas

Supercharge Your LLM Application Evaluations 🚀
https://docs.ragas.io
Apache License 2.0
7.06k stars 714 forks source link

Q: Aspect Critique: Multiple Verdicts? #1046

Open dkhundley opened 4 months ago

dkhundley commented 4 months ago

[X] I checked the documentation and related resources and couldn't find an answer to my question.

https://docs.ragas.io/en/stable/concepts/metrics/critique.html

Your Question In the documentation (linked above), there is a calculation reference to collecting 3 different verdicts from 3 LLM calls. It then seems that the strictness parameter would then determine how to produce the aggregate final score per the particular aspect ratio.

Looking at the source code, I’m having trouble finding how the multiple verdicts are derived. The prompt engineering only seems to indicate that it’s looking for a single final verdict per the aspect critique. Interestingly, there does seem to be specific code to look the “commonality” of verdicts per the strictness parameter. I’m struggling to connect how the LLM could produce more than one verdict per how the code and prompt engineering are currently written.

(I really like the idea, which is why I’m asking. I’d like to implement that with that strictness parameter working. 😃)

Code Examples Linked source code in the material above

Additional context None

dosubot[bot] commented 4 months ago

Hey there, @dkhundley! I'm here to help with any bugs, questions, or contributions you have in mind. Let's squash those bugs together! 😄

The multiple verdicts in the AspectCritique class are derived by evaluating the submission multiple times, as determined by the strictness parameter. The strictness parameter dictates the number of self-consistency checks performed. The final score is computed using a majority vote from these multiple evaluations.

Here's how it works:

  1. Initialization and Adjustment of Strictness: The strictness parameter is initialized with a default value of 1 and adjusted to be an odd number to avoid ties in majority voting.

    strictness: int = field(default=1, repr=False)
    
    def __post_init__(self: t.Self):
        if self.name == "":
            raise ValueError("Expects a name")
        if self.definition == "":
            raise ValueError("Expects definition")
    
        # ensure odd number of checks to avoid tie in majority vote.
        self.strictness = (
            self.strictness if self.strictness % 2 != 0 else self.strictness + 1
        )
  2. Prompting the LLM Multiple Times: The LLM is prompted multiple times (as specified by the strictness value) to ensure reliability in the verdicts. Each prompt is formatted with the given question, answer, and optional context.

    def prompt_format(
        self: t.Self,
        question: str,
        answer: str,
        context: t.Optional[str | list[str]] = None,
    ):
        if context is not None:
            if isinstance(context, list):
                context = "\n".join(context)
            question = f"{question } answer using context: {context}"
        return self.critic_prompt.format(
            input=question, submission=answer, criteria=self.definition
        )
  3. Computing the Final Score: The final score is determined by taking the most common verdict from the multiple evaluations if strictness is greater than 1. If strictness is 1, the score is simply the verdict from the single evaluation.

    def _compute_score(self, safe_loaded_responses: t.List[CriticClassification]):
        if self.strictness > 1:
            score = Counter(
                [item.verdict for item in safe_loaded_responses]
            ).most_common(1)[0][0]
        else:
            score = safe_loaded_responses[0].verdict
    
        return score

In summary, the strictness parameter ensures that the evaluation is performed multiple times to maintain self-consistency, and the final score is derived using a majority vote from these evaluations [1][2].

To continue talking to Dosu, mention @dosu.