Closed gblazex closed 7 months ago
Great idea, I'll look into setting this up.
I've uploaded all the raw results files for the models I've benchmarked. Thanks for the suggestion.
I saw the benchmark correlation matrix you made -- very interesting. I'd love to see how EQ-Bench lines up with the others! Let me know if you want me to bench any models in particular if you don't have enough data to correlate.
That is amazing how fast you uploaded the outputs. It'll help further researching better LLM judges a lot.
EQ v2 is showing much better correlations so far. What I like about it is it's not gonna be affected by length of response (like Alpaca) right?
I post you the list of missing models, feel free to pick which ones you'd like to run: (I might be able to help with some of the closed source outputs like Claude)
Which version is listed on EQ-bench? I'm guessing it's Gemini Pro (dev), right?
+1 if I can have a personal suggestion the amazing @mlabonne has a new model called OmniBeagle which improved a lot on NeuralBeagle (both on MT-bench & Nous suite) https://huggingface.co/mlabonne/OmniBeagle-7B/discussions/1
I think it's worth to add, since his was the best 7b so far.
Correct, it doesn't have a bias for length of output.
I'll try to get some more of these benched for you. I don't have API access to claude so if you can help with those it'd be great.
As to which gemini pro version I'm using. Um...good question. I'll have to check into that. It's one of the API ones.
sounds amazing, thank you! Can't wait to add yours to the leaderboards landscape.
I added these models that I was able to bench successfully:
WizardLM/WizardLM-70B-V1.0 score: 71.28 lmsys/vicuna-13b-v1.5 score: 67.39 allenai/tulu-2-dpo-70b score: 76.63 WizardLM/WizardLM-13B-V1.2 score: 63.71 cognitivecomputations/dolphin-2.2.1-mistral-7b score: 69.92 timdettmers/guanaco-33b-merged score: 36.11 teknium/OpenHermes-2.5-Mistral-7B score: 66.89 berkeley-nest/Starling-LM-7B-alpha score: 73.9 lmsys/vicuna-33b-v1.3 score: 67.07
You were so fast! I really appreciate it.
Correlations looking really good
Spearman Correlations: EQ-bench v2: 0.863 MT-bench: 0.891 Alpaca v2: 0.899
Kendall's Tau Correlations: EQ-bench v2: 0.730 MT-bench: 0.759 Alpaca v2: 0.759
(I only checked overlapping rows where models have results for all 3 benchmarks)
We have 31 EQ models matching Arena.
I'll try to get your Claude results, and possibly Perplexity too. Also CodeLlama-34B-instruct I can easily run through Together.ai
Do you have twitter if I wanna mention you?
Both on Twitter and Reddit I wanna let folks know how your benchmark is shaping up regarding Arena.
Also I saw 1 more that could be added from Arena:
That would make the picture even more complete.
Interesting results! I'll be very curious how the Claude & Perplexity models perform.
I've avoided using twitter thus far but maybe now is the time to dust off the account and join the rest of the AI community. Twitter handle is @sam_paech
I now have 1 following, it's you, lol.
Have you looked into other correlation analysis methods at all? I'm thinking these metrics that only take rank into account might only tell half the story. In that, some benchmarks are better at discriminating the magnitude of performance differences as opposed to just their relative rank. I guess if we assume it ought to be a linear relationship between any pair of benchmark scores, and calculate the MSE / RMSE? It'd be interesting to see how that differs from the Spearman/Kendall's correlations.
That's a great idea, I'll look into it.
What temperature have you used for existing benchmarks? (just so i can try to match the methodology)
I found that temp 0.01 and increasing 0.15 on fail reading the code.
Hey! Great work on setting this up.
Can you publish the model outputs and judge annotations like alpaca eval does? https://github.com/tatsu-lab/alpaca_eval/tree/main/results
It's really helpful for