Closed alexandraabbas closed 2 months ago
The benchmarks in Inspect are not intended as a formal suite, but more example uses of the framework. We will generally check them on 2 or 3 models to ensure that the results are comparable to what is published, however a formal suite would actually ideally be run regularly (and adapted as required) for many more models, and we'd also want to have published results. Not sure that this is something we will do in the short term.
Net: I think if there is a broader effort to build out a suite of Inspect evals comparable to OpenAI/EvalHarness it would at this stage be better done as a community effort. This is something we might do in the future but I have no way of handicapping likelihood or timetable.
Thanks for the reply! Makes sense. I would be happy to get this started for an initial set of benchmarks and encourage the community to add more. Do you think a separate package makes sense or would you recommend incorporating the registry into this repository?
I think it would ideally be in its own repository. Something like this: https://github.com/openai/simple-evals
Let us know here if/when you put up the repo! (and of course feel free to use the benchmarks we've already published as a starting point). Note that we have BoolQ (https://arxiv.org/abs/1905.10044) and PIQA (https://arxiv.org/abs/1911.11641) benchmarks coming in our next release (~ next few days)
@alexandraabbas we've some additional internal discussion on this and do think it would be beneficial if we make benchmarks/
more than just examples. That would mean running them on some models and publishing the results in the README as well as providing scripts to run them all against any model (generating the same output).
@sdtblckgov Is going to spend some time working on this. We will post back to this issue once we have something more to share. In the meantime, if you are working on benchmarks let us know the repo URL here and we can provide feedback, etc. (eventually we hope to be able to merge these as PRs, but we are still working on the process for that).
I think making benchmarks/
more than just examples would be very beneficial!
No problem! Sounds good, I'll get back with the repo once I have something to show.
Hi @alexandraabbas, an update. We have now sorted things for taking contributions into benchmarks/
. If you are still interested in contributing here please coordinate with @jwilles from Vector Institute as they are leading this effort. Thanks!
Do I understand correctly that AISI prefers other benchmarks to be in a separate package?
Or do you expect users to contribute benchmarks under
benchmarks/
similar to OpenAI's registry or lm-evaluation-harness's tasks?