EQ-bench / EQ-Bench

A benchmark for emotional intelligence in large language models
MIT License
195 stars 17 forks source link

Contributing with other judges #23

Open Krisseck opened 6 months ago

Krisseck commented 6 months ago

I'd like to contribute the Creative Writing benchmark.

Since I live in the EU, I do not have access to the Claude API. I am currently running using Mixtral 8x22B as the judge via Mistral API.

Can I contribute those results? Also, is there any tutorial on how to share the results, do I just make a PR? 🙂

sam-paech commented 6 months ago

Hi Krisseck,

There isn't currently a formal submission process for the creative writing test. However if you have interesting models / results to share I will be happy to take a look and reproduce any that look interesting using claude-opus, for inclusion on the leaderboard.

Krisseck commented 6 months ago

If it's alright, I'll share some benchmark results here. No big surprises though.

Prompt format Model Score Test
ChatML TheBloke/CapybaraHermes-2.5-Mistral-7B-GGUF (Q8) 52.74 creative-writing
ChatML NeverSleep/Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss-GGUF (Q3_K_M) 58.4 creative-writing
Alpaca TheBloke/EstopianMaid-13B-GGUF (Q8) 51.49 creative-writing
Alpaca TheBloke/MythoMax-L2-13B-GGUF (Q8) 52.1 creative-writing
Mistral N8Programs/Coxcomb-GGUF (Q8) 56.09 creative-writing
sam-paech commented 6 months ago

Thanks for sharing these results! I will see how they fare with claude opus as judge. I'm a bit surprised the numbers are so low actually, considering that mixtral-8x22b-instruct typically scored models significantly higher than this in its judgemark results: https://eqbench.com/results/judgemark_test_model_scores/judgemark_score_ci_mistralai__Mixtral-8x22B-Instruct-v0.1.png

Were you using oobabooga for the inferencing engine with these results?