alan-turing-institute / prompto

An open source library for asynchronous querying of LLM endpoints
https://alan-turing-institute.github.io/prompto/
MIT License
19 stars 1 forks source link

Allow handling of different rate limits #35

Closed rchan26 closed 4 months ago

rchan26 commented 5 months ago

We can make the max_queries_per_minute argument the "default" limit and allow passing of a dictionary where keys are models/APIs, and values are the corresponding rate limits

rchan26 commented 5 months ago

Maybe easiest to allow an experiment file to have a corresponding experiment config file somehow

rchan26 commented 4 months ago

I think what we can do is similar to how we pass in judge 'settings' (see e.g. https://github.com/alan-turing-institute/prompto/blob/main/examples/data/data/judge/settings.json). We could allow passing of like a pipeline parallel rate limit settings JSON

Typically, we might just have keys of the JSON/dictionary to be the "api" name and there to be another "rate_limit" key, but for #2, we might actually have settings where for the same API (e.g. OpenAI), we actually have different endpoints that we want to hit, which can have different rate limits. At this point, the pipeline can handle querying different endpoints by trying to first look for environment variables with the model_name identifier (e.g. ENV_VAR_{model_name}).

In order to fit both of these criteria, I think it should look like the judge settings where we just have an identifier name and that is itself a dictionary with "api", "model_name" (optional) and "rate_limit" keys. You can still pass in the max_queries_per_minute argument but this would act as a default rate limit if it's not been specified here (e.g. if you have default rate limit at 10, have queries to send to Gemini and OpenAI APIs, but only specify OpenAI API limits in the file, Gemini will just use the default 10).

The "model_name" in key is optional if maybe you're just setting a rate limit for the whole API type. You might also be a setting where you want to set a default rate limit for the API type but then a different one for a specific model of that API type (e.g. {"azure-openai-gpt3.5-and-4": {"api": "azure-openai", "rate-limit": 20}, "azure-openai-gpt3": {"api": "azure-openai", "model-name": "gpt3", "rate-limit": 40}}). In this setting, we must have set the default azure-openai envrionment variables and model-specific environment variables with gpt3 tagged on the end to the environment variable names. If there were other model-specific env variables for azure-openai, they'd just be put in the same azure-openai queue. There would be queues generated for other APIs if the Settings.parallel attribute is set to True but they'd all use the default rate limit in this case.

In summary, the parallel rate-limit settings JSON file specifies how we want to split up the APIs and models parallel processing. If Settings.parallel is set to True, we would split the experiment file into different "api"s and process them in parallel async queues. Currently, the default parallel processing just splits the prompts by "api" type but this would allow more granular control of their rate limits and also how to split the prompts within the same "api" type by their model-names.

rchan26 commented 4 months ago

it might actually make sense to allow "model-name" key in this settings file be a list of strings, to allow for a grouping of model names. in the case of Ollama, we do allow different endpoints to be declared by allowing passing of an env variable for the model name, e.g. OLLAMA_API_ENDPOINT_llama3. in this setting, you may want to send particular models to different endpoints and to do this you could do something like {"ollama_endpoint_1": {"api": "ollama", "model_name": ["llama3", "llama2"], "rate-limit": 15}, "ollama_endpoint_2": {"api": "ollama", "model_name": ["gemma", "mistral"]}}

In terms of environment variables, you would need to ensure that the way you've used them correspond to this. In this setting, it could lead to sending too much to the same endpoint if you make a mistake. For example, if you specified OLLAMA_API_ENDPOINT and only OLLAMA_API_ENDPOINT_mistral, then you'd be accidentally sending the gemma ones to OLLAMA_API_ENDPOINT. you'd need to specify OLLAMA_API_ENDPOINT_gemma too (and set this equal to OLLAMA_API_ENDPOINT_mistral).

This could happen in cases where model-name is just a string too, so the user would need to ensure that the way they're specifying the environment variables and the model-names in the file are correct and the way we construct the parallel queues matches up with how we want to process all the prompts in parallel

rchan26 commented 4 months ago

Note that the above drafting is not actually what was eventually implemented. See docs and examples notebook for specifying rate limits for details of what was actually done.