Initial hypothesis on inconsistent results

atla-ai / judge-arena

0 stars 0 forks source link

Initial hypothesis on inconsistent results #6

Open EwoutH opened 2 days ago

EwoutH commented 2 days ago

The current Judge Arena results, as of November 21, 2024 at 12:00 UTC, with 3069 total votes cast, are as follows:

judge_arena_updated_leaderboard_highres

While the confidence intervals are still quite large, some initial separation is visible in the ELO scores. I was curious if the authors (@kaikaidai @mauriceburg @RomanEngeler1805 @maxbartolo @clefourrier @TobyDrane @mathias-atla @jacksongolden) have any reflection, interpretation or hypothesis on these results.

If find it weird that some of the larger and generally more capable models are scoring so relatively low. For example, Claude 3 Sonnet is far below 3 Haiku, and with the 3.5's as well.

kaikaidai commented 2 days ago

Hey @EwoutH, thanks for bringing this up - you're completely right. Unfortunately, we've been seeing misuse of the arena with over a thousand votes coming from one IP address which skewed the leaderboard. We're going to remove the majority of this individual's votes today so that the leaderboard is more wholly representative of the community's preferences - will ping you here once that's done!

EDIT: This has been done. Please see the latest results with ~1.4k votes from a single individual removed

MauriceBurg commented 2 days ago

@EwoutH Pne pattern that I believe may contribute to smaller, generally less capable models being preferred is that they offer shorter, more to the point critiques. This is more aligned with how humans write critiques of answers as well vs. the more verbose answers from generally more capable models. This may account for some of the difference, but it's early days on the arena. I think we need more votes to be able to draw conclusions.