OpenGenerativeAI / llm-colosseum

Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM
https://huggingface.co/spaces/junior-labs/llm-colosseum
MIT License
1.32k stars 156 forks source link

ELO ranking score? #47

Open Tokkiu opened 5 months ago

Tokkiu commented 5 months ago
截屏2024-04-10 11 34 05

How to generate this ranking? If I added new model, how to reproduce this benchmark?

Tokkiu commented 5 months ago

My new model is implemented in this pr. https://github.com/OpenGenerativeAI/llm-colosseum/pull/45/files You can watch the video of my model vs mistral at here. https://github.com/Tokkiu/llm-colosseum?tab=readme-ov-file#1-vs-1-mistral-vs-solar

shawokou123 commented 5 months ago

我的新模型已经在这个 PR 中实现。https://github.com/OpenGenerativeAI/llm-colosseum/pull/45/files您可以在这里观看我的模型与 Mistral 的视频。 https://github.com/Tokkiu/llm-colosseum?tab=readme-ov-file#1-vs-1-mistral-vs-solar

你好璟琦,我对这个项目也非常感兴趣,可以交流吗?

taozhiyuai commented 5 months ago

I just launch 50 rounds for two models. the result shows who is a better models. at the moment, Gemma 7B is the best. v1.1 is worse.