Open RyanMarten opened 3 weeks ago
As an exercise: How much more headroom could we use before we hit the token limit?
953 + 299 = 1,252 tokens per request on average for answering questions in openhermes
Rate limits for gpt-4o-mini:
Converted to seconds
Right now we are doing 500 QPS * 1,252 TPQ = 626,000 TPS
We could send 2,500,000 / 626,000 = 3.9936 tasks per request before hitting the token limit
With sending 4 completions (generic requests) in a single API request Potentially speeding up dataset generation time by 4x
Part of the curator can be doing a "test" run to get stats like cost and tokens per query to fine tune this with suggested presets for maximum speed (and also test if your parse_func is correct). Or if we want to get really smart, automatically / dynamically update this internally.
Correction - this works only with the completions, not the chat/completions API.
For chat/completions you can still get multiple responses, but you can only pass in a single prompt. This is done with the n
parameter. So only useful when you have identical requests.
https://platform.openai.com/docs/api-reference/chat/create#chat-create-n
As a simple test I ran the following:
import os
import time
from openai import OpenAI
client = OpenAI()
for n in [1, 2, 4, 6, 8, 10, 20, 30, 40, 50, 60, 80, 100, 200]:
start = time.time()
chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "Write me a poem",
}
],
model="gpt-4o-mini",
n=n,
)
end = time.time()
assert len(chat_completion.choices) == n
print(f"Time taken for {n} completions: {end - start} seconds")
Which results in
Time taken for 1 completions: 3.4117958545684814 seconds
Time taken for 2 completions: 3.027738094329834 seconds
Time taken for 4 completions: 3.482429027557373 seconds
Time taken for 6 completions: 3.586395740509033 seconds
Time taken for 8 completions: 4.403304815292358 seconds
Time taken for 10 completions: 4.241966009140015 seconds
Time taken for 20 completions: 8.47451114654541 seconds
Time taken for 30 completions: 6.781132936477661 seconds
Time taken for 40 completions: 13.190962076187134 seconds
Time taken for 50 completions: 8.775229930877686 seconds
Time taken for 60 completions: 6.232266902923584 seconds
Time taken for 80 completions: 21.37916326522827 seconds
Time taken for 100 completions: 7.5753490924835205 seconds
And then errors out since the max is 128
File "/Users/ryan/curator/.venv/lib/python3.12/site-packages/openai/_base_client.py", line 1058, in _request
raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'error': {'message': "Invalid 'n': integer above maximum value. Expected a value <= 128, but got 200 instead.", 'type': 'invalid_request_error', 'param': 'n', 'code': 'integer_above_max_value'}}
This is different than the batchAPI, this still sends requests to the /chat/completions API but sends a list of prompts in each request.
Since we are hitting request limits instead of token limits, we can take advantage of this.
https://platform.openai.com/docs/guides/rate-limits#batching-requests
Example without batching
Example with batching