UKGovernmentBEIS / inspect_ai

Inspect: A framework for large language model evaluations
https://UKGovernmentBEIS.github.io/inspect_ai/
MIT License
385 stars 41 forks source link

Is there a recommended way to run a plan outside an eval? #33

Closed voberoi closed 3 weeks ago

voberoi commented 1 month ago

Let's say I've built a suite of evals for my task, and I've gotten it working well enough that I'd like to deploy a plan to production.

Is there a path today to go from a working plan/config in the evaluation suite to running that plan/config in production?

It seems like one way to do this would be to replicate this loop in my own code: https://github.com/UKGovernmentBEIS/inspect_ai/blob/b6b75a369eb702fca518c03d893f5067301151ec/src/inspect_ai/_eval/task/run.py#L258

... and then extract what I need from the TaskState structure to do whatever I want to with the output.

Any reason that would be inadvisable?

aisi-inspect commented 1 month ago

While this isn't really the design center of Inspect (which is fundamentally a harness of testing/evaluation) there is no reason you can't do something like what you are proposing. You would really just need to construct a TaskState, run the loop over the plan, and then inspect the TaskState at the end for output.

voberoi commented 1 month ago

Thanks that's helpful!

What can folks expect in terms of stability of things like the solver interface? Or how to go about instantiating generators?

jjallaire commented 3 weeks ago

The sovler interface is stable.