UKGovernmentBEIS / inspect_ai

Inspect: A framework for large language model evaluations
https://inspect.ai-safety-institute.org.uk/
MIT License
567 stars 98 forks source link

New benchmarks should go under benchmarks directory or in a separate package? #19

Closed alexandraabbas closed 2 months ago

alexandraabbas commented 4 months ago

Do I understand correctly that AISI prefers other benchmarks to be in a separate package?

Or do you expect users to contribute benchmarks under benchmarks/ similar to OpenAI's registry or lm-evaluation-harness's tasks?

aisi-inspect commented 4 months ago

The benchmarks in Inspect are not intended as a formal suite, but more example uses of the framework. We will generally check them on 2 or 3 models to ensure that the results are comparable to what is published, however a formal suite would actually ideally be run regularly (and adapted as required) for many more models, and we'd also want to have published results. Not sure that this is something we will do in the short term.

Net: I think if there is a broader effort to build out a suite of Inspect evals comparable to OpenAI/EvalHarness it would at this stage be better done as a community effort. This is something we might do in the future but I have no way of handicapping likelihood or timetable.

alexandraabbas commented 4 months ago

Thanks for the reply! Makes sense. I would be happy to get this started for an initial set of benchmarks and encourage the community to add more. Do you think a separate package makes sense or would you recommend incorporating the registry into this repository?

jjallaire commented 4 months ago

I think it would ideally be in its own repository. Something like this: https://github.com/openai/simple-evals

jjallaire commented 4 months ago

Let us know here if/when you put up the repo! (and of course feel free to use the benchmarks we've already published as a starting point). Note that we have BoolQ (https://arxiv.org/abs/1905.10044) and PIQA (https://arxiv.org/abs/1911.11641) benchmarks coming in our next release (~ next few days)

jjallaire commented 4 months ago

@alexandraabbas we've some additional internal discussion on this and do think it would be beneficial if we make benchmarks/ more than just examples. That would mean running them on some models and publishing the results in the README as well as providing scripts to run them all against any model (generating the same output).

@sdtblckgov Is going to spend some time working on this. We will post back to this issue once we have something more to share. In the meantime, if you are working on benchmarks let us know the repo URL here and we can provide feedback, etc. (eventually we hope to be able to merge these as PRs, but we are still working on the process for that).

alexandraabbas commented 4 months ago

I think making benchmarks/ more than just examples would be very beneficial!

No problem! Sounds good, I'll get back with the repo once I have something to show.

jjallaire commented 2 months ago

Hi @alexandraabbas, an update. We have now sorted things for taking contributions into benchmarks/. If you are still interested in contributing here please coordinate with @jwilles from Vector Institute as they are leading this effort. Thanks!