Support multiple tasks per request

RyanMarten commented 3 weeks ago

This is different than the batchAPI, this still sends requests to the /chat/completions API but sends a list of prompts in each request.

Since we are hitting request limits instead of token limits, we can take advantage of this.

https://platform.openai.com/docs/guides/rate-limits#batching-requests

If your use case does not require immediate responses, you can use the Batch API to more easily submit and execute large collections of requests without impacting your synchronous request rate limits.

For use cases that do requires synchronous respones, the OpenAI API has separate limits for requests per minute and tokens per minute.

If you're hitting the limit on requests per minute but have available capacity on tokens per minute, you can increase your throughput by batching multiple tasks into each request. This will allow you to process more tokens per minute, especially with our smaller models.

Sending in a batch of prompts works exactly the same as a normal API call, except you pass in a list of strings to the prompt parameter instead of a single string.

Example without batching

from openai import OpenAI
client = OpenAI()

num_stories = 10
prompt = "Once upon a time,"

# serial example, with one story completion per request
for _ in range(num_stories):
    response = client.completions.create(
        model="curie",
        prompt=prompt,
        max_tokens=20,
    )
    # print story
    print(prompt + response.choices[0].text)

Example with batching

  from openai import OpenAI
client = OpenAI()

num_stories = 10
prompts = ["Once upon a time,"] * num_stories

# batched example, with 10 story completions per request
response = client.completions.create(
    model="curie",
    prompt=prompts,
    max_tokens=20,
)

# match completions to prompts by index
stories = [""] * len(prompts)
for choice in response.choices:
    stories[choice.index] = prompts[choice.index] + choice.text

# print stories
for story in stories:
    print(story)

RyanMarten commented 3 weeks ago

As an exercise: How much more headroom could we use before we hit the token limit?

953 + 299 = 1,252 tokens per request on average for answering questions in openhermes

Rate limits for gpt-4o-mini:

30,000 TPM for Online
150,000,000 TPM for Online
15,000,000,000 TPD for Batch

Converted to seconds

500 QPS for Online
2,500,000 TPS for Online

Right now we are doing 500 QPS * 1,252 TPQ = 626,000 TPS

We could send 2,500,000 / 626,000 = 3.9936 tasks per request before hitting the token limit

With sending 4 completions (generic requests) in a single API request Potentially speeding up dataset generation time by 4x

Part of the curator can be doing a "test" run to get stats like cost and tokens per query to fine tune this with suggested presets for maximum speed (and also test if your parse_func is correct). Or if we want to get really smart, automatically / dynamically update this internally.

RyanMarten commented 3 weeks ago

Correction - this works only with the completions, not the chat/completions API.

For chat/completions you can still get multiple responses, but you can only pass in a single prompt. This is done with the n parameter. So only useful when you have identical requests. https://platform.openai.com/docs/api-reference/chat/create#chat-create-n

RyanMarten commented 2 weeks ago

As a simple test I ran the following:

import os
import time
from openai import OpenAI

client = OpenAI()

for n in [1, 2, 4, 6, 8, 10, 20, 30, 40, 50, 60, 80, 100, 200]:
    start = time.time()
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": "Write me a poem",
            }
        ],
        model="gpt-4o-mini",
        n=n,
    )
    end = time.time()
    assert len(chat_completion.choices) == n
    print(f"Time taken for {n} completions: {end - start} seconds")

Which results in

Time taken for 1 completions: 3.4117958545684814 seconds
Time taken for 2 completions: 3.027738094329834 seconds
Time taken for 4 completions: 3.482429027557373 seconds
Time taken for 6 completions: 3.586395740509033 seconds
Time taken for 8 completions: 4.403304815292358 seconds
Time taken for 10 completions: 4.241966009140015 seconds
Time taken for 20 completions: 8.47451114654541 seconds
Time taken for 30 completions: 6.781132936477661 seconds
Time taken for 40 completions: 13.190962076187134 seconds
Time taken for 50 completions: 8.775229930877686 seconds
Time taken for 60 completions: 6.232266902923584 seconds
Time taken for 80 completions: 21.37916326522827 seconds
Time taken for 100 completions: 7.5753490924835205 seconds

And then errors out since the max is 128

  File "/Users/ryan/curator/.venv/lib/python3.12/site-packages/openai/_base_client.py", line 1058, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'error': {'message': "Invalid 'n': integer above maximum value. Expected a value <= 128, but got 200 instead.", 'type': 'invalid_request_error', 'param': 'n', 'code': 'integer_above_max_value'}}

RyanMarten commented 2 weeks ago

More on https://github.com/bespokelabsai/curator/tree/ryanm/n-completions

bespokelabsai / curator

Support multiple tasks per request #52