Would you want to make a leaderboard for this?

clefourrier commented 6 months ago

Hi!

Super cool work! I'm a researcher at HuggingFace working on evaluation and leaderboards.

I understand that this cool eval suite is first and foremost there to evaluate use cases that you personally find interesting, and that it might/will change through time, therefore not making it very relevant to build a leaderboard from.

However, I think that the community really lacks leaderboards for applied and very concrete tasks, like your C++ evaluation, or code conversion tests. For non devs, leaderboards are an interesting way to get an idea of model capabilities "on the surface".

So would you be interested in pinning a version and making a leaderboard out of it? If yes, I'd love to give you a hand

(Side note: we've got good ways to evaluate chat capabilities through Elo scores and arenas, of course a range of so many purely academic benchmarks, and now we're starting to get some more leaderboards on applied datasets (like enterprise use cases), but there's a strong lack on practical benchmarks like yours imo)

carlini commented 6 months ago

Yeah maybe this isn't a bad idea... I did update my initial blog post with claude-3 and mistral large (https://nicholas.carlini.com/writing/2024/my-benchmark-for-large-language-models.html)

But maybe having an explicit leaderboard wouldn't be so bad. I've got a number of email requests for exactly this as well.

In order to do this there would need to be a few changes, that if you'd be interested in making would be great:

Some way to keep the model runs up to date when tests change. The simplest thing would be to re-run each test on each model every time, but this seems expensive. A slightly better idea is to maybe tag each results/ with the git commit hash, and then when re-generating the table see which tests are new or different from the last set of runs, and only re-run those.
Some way to decide which tests make it in the leaderboard. All of them? Or make a leaderboard that's filterable by tag? Right now I keep them all because I mainly put in the questions I want but someone else might care about different things.
A better way to add new models. Right now this project needs a nonzero amount of work to add new models.
Some kind of thing that would generate a leaderboard automatically and put it ... somewhere? At the top of the README kind of like I have now? On huggingface like the other things?

clefourrier commented 6 months ago

I'd keep your evals as is for now, and add tags later if they are requested by the community.

For adding new models, when you say it requires non zero work, is it an issue of compute, of not having all the task descriptions grouped, ...?

For the last point, the actual leaderboard aspect, we've got templates which read from a dataset and automatically updated the displayed results (here).

carlini commented 5 months ago

The work is just that you have to create a model/[llm].py file. I suppose for the case of huggingface models this should be trivial as long as they have the stuff set up to do the chat interface tokenizer stuff. (I don't remember what this is actually called.)

I've just pushed a commit 656a597d012dcc0688c9d6054b455b3bde38e3e9 that adds support for incremental builds. So that should make re-running the benchmark (much) faster.

I'll take a look at this huggingface page and see if I can get something uploading to there. I have my own personal server that I can run things on but it's not beefy enough to run any of the largest models. (I can run 7B models, but not much more.) I may see how much work it would be to try and create a cloud image that would just clone this and run the models. It probably wouldn't end up being too expensive per run if I can make it work easily...

carlini / yet-another-applied-llm-benchmark

Would you want to make a leaderboard for this? #10