Closed rchan26 closed 4 months ago
Maybe easiest to allow an experiment file to have a corresponding experiment config file somehow
I think what we can do is similar to how we pass in judge 'settings' (see e.g. https://github.com/alan-turing-institute/prompto/blob/main/examples/data/data/judge/settings.json). We could allow passing of like a pipeline parallel rate limit settings JSON
Typically, we might just have keys of the JSON/dictionary to be the "api" name and there to be another "rate_limit" key, but for #2, we might actually have settings where for the same API (e.g. OpenAI), we actually have different endpoints that we want to hit, which can have different rate limits. At this point, the pipeline can handle querying different endpoints by trying to first look for environment variables with the model_name
identifier (e.g. ENV_VAR_{model_name}
).
In order to fit both of these criteria, I think it should look like the judge settings where we just have an identifier name and that is itself a dictionary with "api", "model_name" (optional) and "rate_limit" keys. You can still pass in the max_queries_per_minute
argument but this would act as a default rate limit if it's not been specified here (e.g. if you have default rate limit at 10, have queries to send to Gemini and OpenAI APIs, but only specify OpenAI API limits in the file, Gemini will just use the default 10).
The "model_name" in key is optional if maybe you're just setting a rate limit for the whole API type. You might also be a setting where you want to set a default rate limit for the API type but then a different one for a specific model of that API type (e.g. {"azure-openai-gpt3.5-and-4": {"api": "azure-openai", "rate-limit": 20}, "azure-openai-gpt3": {"api": "azure-openai", "model-name": "gpt3", "rate-limit": 40}}
). In this setting, we must have set the default azure-openai
envrionment variables and model-specific environment variables with gpt3
tagged on the end to the environment variable names. If there were other model-specific env variables for azure-openai
, they'd just be put in the same azure-openai
queue. There would be queues generated for other APIs if the Settings.parallel
attribute is set to True
but they'd all use the default rate limit in this case.
In summary, the parallel rate-limit settings JSON file specifies how we want to split up the APIs and models parallel processing. If Settings.parallel
is set to True
, we would split the experiment file into different "api"s and process them in parallel async queues. Currently, the default parallel processing just splits the prompts by "api" type but this would allow more granular control of their rate limits and also how to split the prompts within the same "api" type by their model-names.
it might actually make sense to allow "model-name" key in this settings file be a list of strings, to allow for a grouping of model names. in the case of Ollama, we do allow different endpoints to be declared by allowing passing of an env variable for the model name, e.g. OLLAMA_API_ENDPOINT_llama3
. in this setting, you may want to send particular models to different endpoints and to do this you could do something like {"ollama_endpoint_1": {"api": "ollama", "model_name": ["llama3", "llama2"], "rate-limit": 15}, "ollama_endpoint_2": {"api": "ollama", "model_name": ["gemma", "mistral"]}}
In terms of environment variables, you would need to ensure that the way you've used them correspond to this. In this setting, it could lead to sending too much to the same endpoint if you make a mistake. For example, if you specified OLLAMA_API_ENDPOINT
and only OLLAMA_API_ENDPOINT_mistral
, then you'd be accidentally sending the gemma ones to OLLAMA_API_ENDPOINT
. you'd need to specify OLLAMA_API_ENDPOINT_gemma
too (and set this equal to OLLAMA_API_ENDPOINT_mistral
).
This could happen in cases where model-name
is just a string too, so the user would need to ensure that the way they're specifying the environment variables and the model-names in the file are correct and the way we construct the parallel queues matches up with how we want to process all the prompts in parallel
Note that the above drafting is not actually what was eventually implemented. See docs and examples notebook for specifying rate limits for details of what was actually done.
We can make the
max_queries_per_minute
argument the "default" limit and allow passing of a dictionary where keys are models/APIs, and values are the corresponding rate limits