v2 outputs - Githubissues

gblazex commented 7 months ago

Hey! Great work on setting this up.

Can you publish the model outputs and judge annotations like alpaca eval does? https://github.com/tatsu-lab/alpaca_eval/tree/main/results

It's really helpful for

researching benchmarks and correlations with each other
getting help improving future versions of EQ-b
reproducing results

sam-paech commented 7 months ago

Great idea, I'll look into setting this up.

sam-paech commented 7 months ago

I've uploaded all the raw results files for the models I've benchmarked. Thanks for the suggestion.

I saw the benchmark correlation matrix you made -- very interesting. I'd love to see how EQ-Bench lines up with the others! Let me know if you want me to bench any models in particular if you don't have enough data to correlate.

gblazex commented 7 months ago

That is amazing how fast you uploaded the outputs. It'll help further researching better LLM judges a lot.

EQ v2 is showing much better correlations so far. What I like about it is it's not gonna be affected by length of response (like Alpaca) right?

I post you the list of missing models, feel free to pick which ones you'd like to run: (I might be able to help with some of the closed source outputs like Claude)

List of missing models (order same as Arena)

Gemini

Which version is listed on EQ-bench? I'm guessing it's Gemini Pro (dev), right?

gblazex commented 7 months ago

+1 if I can have a personal suggestion the amazing @mlabonne has a new model called OmniBeagle which improved a lot on NeuralBeagle (both on MT-bench & Nous suite) https://huggingface.co/mlabonne/OmniBeagle-7B/discussions/1

I think it's worth to add, since his was the best 7b so far.

sam-paech commented 7 months ago

Correct, it doesn't have a bias for length of output.

I'll try to get some more of these benched for you. I don't have API access to claude so if you can help with those it'd be great.

As to which gemini pro version I'm using. Um...good question. I'll have to check into that. It's one of the API ones.

gblazex commented 7 months ago

sounds amazing, thank you! Can't wait to add yours to the leaderboards landscape.

sam-paech commented 7 months ago

I added these models that I was able to bench successfully:

WizardLM/WizardLM-70B-V1.0 score: 71.28 lmsys/vicuna-13b-v1.5 score: 67.39 allenai/tulu-2-dpo-70b score: 76.63 WizardLM/WizardLM-13B-V1.2 score: 63.71 cognitivecomputations/dolphin-2.2.1-mistral-7b score: 69.92 timdettmers/guanaco-33b-merged score: 36.11 teknium/OpenHermes-2.5-Mistral-7B score: 66.89 berkeley-nest/Starling-LM-7B-alpha score: 73.9 lmsys/vicuna-33b-v1.3 score: 67.07

gblazex commented 7 months ago

You were so fast! I really appreciate it.

Correlations looking really good

Spearman Correlations: EQ-bench v2: 0.863 MT-bench: 0.891 Alpaca v2: 0.899

Kendall's Tau Correlations: EQ-bench v2: 0.730 MT-bench: 0.759 Alpaca v2: 0.759

(I only checked overlapping rows where models have results for all 3 benchmarks)

We have 31 EQ models matching Arena.

I'll try to get your Claude results, and possibly Perplexity too. Also CodeLlama-34B-instruct I can easily run through Together.ai

gblazex commented 7 months ago

Do you have twitter if I wanna mention you?

Both on Twitter and Reddit I wanna let folks know how your benchmark is shaping up regarding Arena.

Also I saw 1 more that could be added from Arena:

NV-Llama2-70B-SteerLM-Chat

That would make the picture even more complete.

sam-paech commented 7 months ago

Interesting results! I'll be very curious how the Claude & Perplexity models perform.

I've avoided using twitter thus far but maybe now is the time to dust off the account and join the rest of the AI community. Twitter handle is @sam_paech

I now have 1 following, it's you, lol.

Have you looked into other correlation analysis methods at all? I'm thinking these metrics that only take rank into account might only tell half the story. In that, some benchmarks are better at discriminating the magnitude of performance differences as opposed to just their relative rank. I guess if we assume it ought to be a linear relationship between any pair of benchmark scores, and calculate the MSE / RMSE? It'd be interesting to see how that differs from the Spearman/Kendall's correlations.

gblazex commented 7 months ago

That's a great idea, I'll look into it.

What temperature have you used for existing benchmarks? (just so i can try to match the methodology)

gblazex commented 7 months ago

I found that temp 0.01 and increasing 0.15 on fail reading the code.

EQ-bench / EQ-Bench

v2 outputs #4

List of missing models (order same as Arena)

Gemini