carlini / yet-another-applied-llm-benchmark

A benchmark to evaluate language models on questions I've previously asked them to solve.
GNU General Public License v3.0
798 stars 60 forks source link

Benchmark for some open source models #16

Open davideuler opened 2 months ago

davideuler commented 2 months ago

It is amazing that the mixtral-8x7b-instruct-v0.1.Q6_K GGUF got a 25% passes.

image
carlini commented 2 months ago

Ah very nice. I should write some code that will merge together multiple independent datasets to make a larger matrix...

I guess we don't know what Mistral Medium is, but if it's some variant of mixtral I guess this makes sense that they're similar-ish in score?

davideuler commented 2 months ago

I guess Mistral Medium maybe the mistral 70b instruct, or some MOEs like mixtral 8x7b. If independent datasets are merged to be a large matrix it would be perfect to check the differences of each model. I wonder which open model is the most capable at code generation currently.

davideuler commented 2 months ago

The result for deepseek-coder-33b-instruct is a big surprise.

image