Contributing with other judges

Krisseck commented 7 months ago

I'd like to contribute the Creative Writing benchmark.

Since I live in the EU, I do not have access to the Claude API. I am currently running using Mixtral 8x22B as the judge via Mistral API.

Can I contribute those results? Also, is there any tutorial on how to share the results, do I just make a PR? 🙂

sam-paech commented 7 months ago

Hi Krisseck,

There isn't currently a formal submission process for the creative writing test. However if you have interesting models / results to share I will be happy to take a look and reproduce any that look interesting using claude-opus, for inclusion on the leaderboard.

Krisseck commented 7 months ago

If it's alright, I'll share some benchmark results here. No big surprises though.

Prompt format	Model	Score	Test
ChatML	TheBloke/CapybaraHermes-2.5-Mistral-7B-GGUF (Q8)	52.74	creative-writing
ChatML	NeverSleep/Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss-GGUF (Q3_K_M)	58.4	creative-writing
Alpaca	TheBloke/EstopianMaid-13B-GGUF (Q8)	51.49	creative-writing
Alpaca	TheBloke/MythoMax-L2-13B-GGUF (Q8)	52.1	creative-writing
Mistral	N8Programs/Coxcomb-GGUF (Q8)	56.09	creative-writing

These were judged by open-mixtral-8x22b on Mistral API
I could not replicate the Coxcomb score. Which prompt format did you use? I understand that I ran it Q8, but it should not matter that much
If we use Coxcomb-GGUF as a reference score, you might be interested in trying out Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss-GGUF, since it got a better score than it.
Bit surprised that older Mythomax model got better score than EstopianMaid. And smaller Mistral 7B-based model was better than both of them

sam-paech commented 6 months ago

Thanks for sharing these results! I will see how they fare with claude opus as judge. I'm a bit surprised the numbers are so low actually, considering that mixtral-8x22b-instruct typically scored models significantly higher than this in its judgemark results: https://eqbench.com/results/judgemark_test_model_scores/judgemark_score_ci_mistralai__Mixtral-8x22B-Instruct-v0.1.png

Were you using oobabooga for the inferencing engine with these results?

EQ-bench / EQ-Bench

Contributing with other judges #23