UKGovernmentBEIS / inspect_ai

Inspect: A framework for large language model evaluations
https://UKGovernmentBEIS.github.io/inspect_ai/
MIT License
385 stars 41 forks source link

Would more sophisticated request scheduling mechanisms be valuable? #40

Open schmatz opened 3 weeks ago

schmatz commented 3 weeks ago

I recently built an eval with Inspect to learn the framework. I used my personal model provider API keys for inference, which have fairly low rate limits. While I was able to use max_connections to tune my eval, it got me wondering if more sophisticated controls than max_connections and exponential backoff would be useful.

Would it be valuable if users of Inspect had a way of specifying provider request/token limits, and then scheduled the requests to the model providers using something like the token bucket algorithm? I don't know how big of a deal rate limits are for the primary users of the library and how much time is lost to exponential backoff. If you did think it would be valuable, let me know and I can explore a solution.

jjallaire commented 3 weeks ago

Rate limits are indeed a big deal across the board! There is currently exponential backoff built-in at the request level. You probably saw this, but here is our current guidance about tuning: https://ukgovernmentbeis.github.io/inspect_ai/eval-tuning.html#model-apis.

My general guidance to people is that if they see a small number of rate limits (e.g. total number of samples) they are probably in the right zone (as those first backoffs are only a couple of seconds) but if they see upwards of 3-4x the samples they are probably paying too much for backoff.

One thing to note is that there are two layers of backoff: the built-in backoff that e.g. the openai and anthropic packages do (which takes advantage of special http headers to try to get the backoff "just right") and then an outer backoff that we put in to prevent losing an entire eval to a rate limit error (it was considered better to wait than to discard a bunch of tokens/time). Subsequently we've added the ability to recover fully scored samples from failed runs (via inspect eval-retry or the eval_retry() function) so perhaps this outer backoff layer is not as critical anymore.

Your point is an excellent one though, if we had a token request limit to schedule against we could potentially be much smarter! I haven't thought about how we'd design this but I'd definitely be open to pushing on this a bit to see if we can make things better.

BTW, your deception under pressure eval is awesome! Is that something you'd consider letting us adapt as an example?

schmatz commented 3 weeks ago

One thing to note is that there are two layers of backoff: the built-in backoff that e.g. the openai and anthropic packages do (which takes advantage of special http headers to try to get the backoff "just right") and then an outer backoff that we put in to prevent losing an entire eval to a rate limit error (it was considered better to wait than to discard a bunch of tokens/time)

Ah, that's a good point, I didn't consider how the model provider libraries do backoff with the Retry-After headers and such.

Your point is an excellent one though, if we had a token request limit to schedule against we could potentially be much smarter! I haven't thought about how we'd design this but I'd definitely be open to pushing on this a bit to see if we can make things better.

Cool, if I have some spare time I'll have a go at modeling out the current behavior to see how much there is to gain here with more sophisticated behavior. It's possible there wouldn't be much benefit to offset the increase in complexity, though I have a hunch that it would be beneficial in some circumstances.

BTW, your deception under pressure eval is awesome! Is that something you'd consider letting us adapt as an example?

Thanks, glad you liked it! I'd be happy to have it adapted into an example. Is that something that you'd want your team to take care of? I can also mimic the existing style of the examples, write some docs, and make a PR if you'd like.

jjallaire commented 3 weeks ago

Cool, if I have some spare time I'll have a go at modeling out the current behavior to see how much there is to gain here with more sophisticated behavior. It's possible there wouldn't be much benefit to offset the increase in complexity, though I have a hunch that it would be beneficial in some circumstances.

One other thing to consider is organisations that have many researchers doing testing -- there the token limit is actually being applied across a bunch of disparate clients. In this case the limit is not known statically but needs to discovered. The solution here would be to do the token bucket thing in a proxy server.

Thanks, glad you liked it! I'd be happy to have it adapted into an example. Is that something that you'd want your team to take care of? I can also mimic the existing style of the examples, write some docs, and make a PR if you'd like.

If you could create a PR that would be fantastic! I think it wants to live in its own directory and have a source file with the task definition, another source file that drives analysis/visualisation, and then of course a README.md. If you are able to do this we'd be incredibly grateful (and would feature your example prominently as an exemplar of more complex evals).

schmatz commented 2 weeks ago

As an update, I've refactored the eval into its own directory and am in the process of writing documentation. I hope to have a PR out around Wednesday.

schmatz commented 2 weeks ago

Apologies on the delay on this, currently targeting Sunday.

schmatz commented 1 week ago

I have a preliminary commit refactoring the eval into its own directory here (no documentation yet, apologies.)

I have a few questions:

  1. I can either clone the deception paper's repo as a submodule or just include a subset of files. I know introducing submodules can be a minor pain sometimes for people using the repo - did you have a strong position either way on this? I'm considering just taking the prompt file and the repository license out of the original repo instead of including the entire thing.
  2. How do you want to handle the various dependencies of the eval? I've included a requirements.txt, but if there is another way you want me to handle it let me know.
  3. I've split up the eval into a few more files (task.py, solvers.py, scorer.py, dataset.py, etc.) Originally you had suggested putting it all as one file for the eval and one file for the visualization. I think it might be quite an intimidating file if those were combined, though I can refactor it as one file if you think that would be better (I can imagine the single file approach might have some advantages.)

I think after these questions and some testing I should be good from the code side and would just need to polish the documentation. Thank you for your help to get this shipped!

aisi-inspect commented 1 week ago

I have a preliminary commit refactoring the eval into its own directory here (no documentation yet, apologies.)

Great to hear! Responses to various queries inline below

  1. I can either clone the deception paper's repo as a submodule or just include a subset of files. I know introducing submodules can be a minor pain sometimes for people using the repo - did you have a strong position either way on this? I'm considering just taking the prompt file and the repository license out of the original repo instead of including the entire thing.

Definitely not submodules (for the reasons you suggest). If the total size of the files is reasonable let's just embed them. In cases where datasets are large we will sometimes download them on demand in the read_dataset() function, e.g. see _ensure_data() here: https://github.com/UKGovernmentBEIS/inspect_ai/blob/main/examples/agents/intercode-ctf/dataset.py

  1. How do you want to handle the various dependencies of the eval? I've included a requirements.txt, but if there is another way you want me to handle it let me know.

requirements.txt is the way that we've been handling this so yes, that sounds great.

  1. I've split up the eval into a few more files (task.py, solvers.py, scorer.py, dataset.py, etc.) Originally you had suggested putting it all as one file for the eval and one file for the visualization. I think it might be quite an intimidating file if those were combined, though I can refactor it as one file if you think that would be better (I can imagine the single file approach might have some advantages.)

Agreed w/ your approach (all I really care about is not everything in one file, multiple files to make it less overwhelming and more clear what is where is great).