Closed notasquid1938 closed 6 months ago
I like the idea of allowing model names to be hidden - at least in the immediate results view.
On the ELO comparisons, could you please expand?
How do you see that working in terms of the UI?
Would you track evals based on the the model's name only, or also for the combination of parameters used in each experiment iteration?
This would require significant effort, so I'm mostly writing this out to show how I envision it:
Each configuration of model and parameters would be treated as a different competitor.
They are all given an initial elo. As you pick one configuration and model over another their elo's are adjusted after each vote. At the end a leaderboard would be displayed showing the end elo of each model and configuration.
This would work best if multiple prompts could be fed in similar to how the https://chat.lmsys.org arena works. For instance, I would test between 3 models each with 2 different parameters each for 6 competitors. I write out a list of prompts, the number required should increase as the competitors increase. The program will ask me to vote between the best response of two random competitors. Then, it repeats this process for all my prompts randomly selecting two competitors every time. Given 20-30 prompts a clear trend should appear in the elo of the 6 competitors, illustrating which model and parameters provides the preferred response as well as how far apart each model and parameter is.
@notasquid1938 ,
Thank you for the clarification.
Although I can see some similarities, I agree that it would take considerable effort (especially since we don't have a database layer in place yet), and diverge from what I had in mind when I started the project.
It's an interesting idea and I'd be happy if someone could make a fork and work on it.
The option to hide model names can help eliminate personal bias especially when comparing different models. Also, is there any plan to use this to make elo comparisons like a locally running personal https://chat.lmsys.org?