[experiments] pairwise evaluator - Githubissues

Arize-ai / phoenix

AI Observability & Evaluation

https://docs.arize.com/phoenix

Other

4.05k stars 299 forks source link

[experiments] pairwise evaluator #3738

Open mikeldking opened 5 months ago

mikeldking commented 5 months ago

Implement a pairwise evaluator that leverages LLM as a judge to judge two generations against each-other. In the case of experiments this would assume to perform judgement against the expected>

https://docs.llamaindex.ai/en/stable/examples/evaluation/pairwise_eval/

Note that there should be a parameter for consensus. E.g. force the LLM to judge the answer flipped and see what it would say.

dosubot[bot] commented 5 days ago

Hi, @mikeldking. I'm Dosu, and I'm helping the Arize Phoenix team manage their backlog. I'm marking this issue as stale.

Issue Summary:

Proposal to develop a pairwise evaluator using a large language model.
Suggestion includes adding a consensus parameter for consistency checks.
No comments or further activity since the issue was opened.

Next Steps:

Please confirm if this issue is still relevant to the latest version of the Arize Phoenix repository by commenting here.
If there is no further activity, the issue will be automatically closed in 7 days.

Thank you for your understanding and contribution!