carlini / yet-another-applied-llm-benchmark

A benchmark to evaluate language models on questions I've previously asked them to solve.
GNU General Public License v3.0
875 stars 64 forks source link

Would you want to make a leaderboard for this? #10

Open clefourrier opened 6 months ago

clefourrier commented 6 months ago

Hi!

Super cool work! I'm a researcher at HuggingFace working on evaluation and leaderboards.

I understand that this cool eval suite is first and foremost there to evaluate use cases that you personally find interesting, and that it might/will change through time, therefore not making it very relevant to build a leaderboard from.

However, I think that the community really lacks leaderboards for applied and very concrete tasks, like your C++ evaluation, or code conversion tests. For non devs, leaderboards are an interesting way to get an idea of model capabilities "on the surface".

So would you be interested in pinning a version and making a leaderboard out of it? If yes, I'd love to give you a hand

(Side note: we've got good ways to evaluate chat capabilities through Elo scores and arenas, of course a range of so many purely academic benchmarks, and now we're starting to get some more leaderboards on applied datasets (like enterprise use cases), but there's a strong lack on practical benchmarks like yours imo)

carlini commented 6 months ago

Yeah maybe this isn't a bad idea... I did update my initial blog post with claude-3 and mistral large (https://nicholas.carlini.com/writing/2024/my-benchmark-for-large-language-models.html)

But maybe having an explicit leaderboard wouldn't be so bad. I've got a number of email requests for exactly this as well.

In order to do this there would need to be a few changes, that if you'd be interested in making would be great:

clefourrier commented 6 months ago

I'd keep your evals as is for now, and add tags later if they are requested by the community.

For adding new models, when you say it requires non zero work, is it an issue of compute, of not having all the task descriptions grouped, ...?

For the last point, the actual leaderboard aspect, we've got templates which read from a dataset and automatically updated the displayed results (here).

carlini commented 5 months ago

The work is just that you have to create a model/[llm].py file. I suppose for the case of huggingface models this should be trivial as long as they have the stuff set up to do the chat interface tokenizer stuff. (I don't remember what this is actually called.)

I've just pushed a commit 656a597d012dcc0688c9d6054b455b3bde38e3e9 that adds support for incremental builds. So that should make re-running the benchmark (much) faster.

I'll take a look at this huggingface page and see if I can get something uploading to there. I have my own personal server that I can run things on but it's not beefy enough to run any of the largest models. (I can run 7B models, but not much more.) I may see how much work it would be to try and create a cloud image that would just clone this and run the models. It probably wouldn't end up being too expensive per run if I can make it work easily...