Open dkhundley opened 4 months ago
Hey there, @dkhundley! I'm here to help with any bugs, questions, or contributions you have in mind. Let's squash those bugs together! 😄
The multiple verdicts in the AspectCritique
class are derived by evaluating the submission multiple times, as determined by the strictness
parameter. The strictness
parameter dictates the number of self-consistency checks performed. The final score is computed using a majority vote from these multiple evaluations.
Here's how it works:
Initialization and Adjustment of Strictness: The strictness
parameter is initialized with a default value of 1
and adjusted to be an odd number to avoid ties in majority voting.
strictness: int = field(default=1, repr=False)
def __post_init__(self: t.Self):
if self.name == "":
raise ValueError("Expects a name")
if self.definition == "":
raise ValueError("Expects definition")
# ensure odd number of checks to avoid tie in majority vote.
self.strictness = (
self.strictness if self.strictness % 2 != 0 else self.strictness + 1
)
Prompting the LLM Multiple Times: The LLM is prompted multiple times (as specified by the strictness
value) to ensure reliability in the verdicts. Each prompt is formatted with the given question, answer, and optional context.
def prompt_format(
self: t.Self,
question: str,
answer: str,
context: t.Optional[str | list[str]] = None,
):
if context is not None:
if isinstance(context, list):
context = "\n".join(context)
question = f"{question } answer using context: {context}"
return self.critic_prompt.format(
input=question, submission=answer, criteria=self.definition
)
Computing the Final Score: The final score is determined by taking the most common verdict from the multiple evaluations if strictness
is greater than 1. If strictness
is 1, the score is simply the verdict from the single evaluation.
def _compute_score(self, safe_loaded_responses: t.List[CriticClassification]):
if self.strictness > 1:
score = Counter(
[item.verdict for item in safe_loaded_responses]
).most_common(1)[0][0]
else:
score = safe_loaded_responses[0].verdict
return score
In summary, the strictness
parameter ensures that the evaluation is performed multiple times to maintain self-consistency, and the final score is derived using a majority vote from these evaluations [1][2].
To continue talking to Dosu, mention @dosu.
[X] I checked the documentation and related resources and couldn't find an answer to my question.
https://docs.ragas.io/en/stable/concepts/metrics/critique.html
Your Question In the documentation (linked above), there is a calculation reference to collecting 3 different verdicts from 3 LLM calls. It then seems that the
strictness
parameter would then determine how to produce the aggregate final score per the particular aspect ratio.Looking at the source code, I’m having trouble finding how the multiple verdicts are derived. The prompt engineering only seems to indicate that it’s looking for a single final verdict per the aspect critique. Interestingly, there does seem to be specific code to look the “commonality” of verdicts per the strictness parameter. I’m struggling to connect how the LLM could produce more than one verdict per how the code and prompt engineering are currently written.
(I really like the idea, which is why I’m asking. I’d like to implement that with that strictness parameter working. 😃)
Code Examples Linked source code in the material above
Additional context None