Contextualist / lone-arena

Self-hosted LLM chatbot arena, with yourself as the only judge
MIT License
36 stars 2 forks source link

Really needs a "both wrong" "both ok" option #1

Open bjj opened 9 months ago

bjj commented 9 months ago

After running through some test prompts there are many instances where there is nothing to separate the two answers (they're both wrong exactly the same amount or in the same way). There's probably something more statistically valid than picking at random in those cases.

Contextualist commented 9 months ago

Thanks for the feedback! The concern is indeed valid. I will need to think about how tie should be handled, though.

I did not implement the "both wrong" "both ok" out of two reasons:

  1. Tie conflicts with the elimination-based tournament process, which allow you to pick the better responses among the responses from each model first, before comparing responses from different models.
  2. Personally, two-way decision feels less mentally taxing as compared to three or four.

I need to think about how to handle tie in elimination matches, or to replace elimination with something else.