Rate limit reached for gpt-4 exists after waiting for a long time.

Pythagora-io / gpt-pilot

The first real AI developer

Other

30.8k stars 3.09k forks source link

Rate limit reached for gpt-4 exists after waiting for a long time. #269

Open Scratch-project opened 11 months ago

Scratch-project commented 11 months ago

When it says try again in 6ms, why doesn't it work when Pilot requests it again ? I have waited for a couple of minutes and still not accepting the request. Actually I've just opened Pilot again after 16hours of not using th API at all and this is the first thing I got back ......

I thought maybe because when I lunch Pilot with existing Programm it loads it and maybe its big enough to hit rate limit just by loading it, but then I waited for more than 5 minutes and it still refuses. My MAX_TOKENS is also set to default.

Already on line +4k, this same message started on line +2k debug.log

UPDATE: after watching the logs I saw that it's doing 4 requests to gpt-4 in 2 seconds which will hit the limit every time it tries to do so. What should I do ? How to make it only do one call or why is it doing multiple calls at the same times ? Maybe you should increase the time for waiting before the next request after a couple of tries to 1 minute ? As the rate limit for tokens is per minute, so just one request per minute ? Like this:

from openai.error import RateLimitError
import backoff

@backoff.on_exception(backoff.expo, RateLimitError)
def completions_with_backoff(**kwargs):
response = openai.Completion.create(**kwargs)
return response

This code is extracted from: How can I solve 429: 'Too Many Requests' errors?

I could try solving it but can't really find the code for that

Umpire2018 commented 11 months ago

log indicates that error occurs in llm_connection.py line 344 which is pilot/utils/llm_connection.py.

In here, you see there's a custom decorator retry_on_exception that handles retry. Core statement is

I will leave the rest to you. Happy debugging :)

The problem you have now is this TPM (tokens per minute) has been reached I think that's why wait and retry doesn't work. Maybe in this case, we need to reduce context length to solve it.
Useful link https://stackoverflow.com/questions/75859074/getting-ratelimiterror-while-implementing-openai-gpt-with-python https://platform.openai.com/docs/guides/rate-limits/overview

Scratch-project commented 11 months ago

log indicates that error occurs in llm_connection.py line 344 which is pilot/utils/llm_connection.py.

In here, you see there's a custom decorator retry_on_exception that handles retry. Core statement is I will leave the rest to you. Happy debugging :)

The problem you have now is this TPM (tokens per minute) has been reached I think that's why wait and retry doesn't work. Maybe in this case, we need to reduce context length to solve it.

Useful link https://stackoverflow.com/questions/75859074/getting-ratelimiterror-while-implementing-openai-gpt-with-python https://platform.openai.com/docs/guides/rate-limits/overview

Thank you for pointing out the problem. When saying "reduce context length" you mean the MAX_TOKENS variable in the .env file ? This is for the. allowed output. But I don't know how to change the input length. I mean this happens during the second development step which means I can't control what is entering to the model, can I ?

I've tried increasing the sleep time for one minute just after every failed request and reducing the MAX_TOKENS to 1000, but the error is still the exact same one. Does that means that it is reaching the max_tokens just from its input ? Can I determine the input max allowed tokens ?

Scratch-project commented 11 months ago

log indicates that error occurs in llm_connection.py line 344 which is pilot/utils/llm_connection.py.

In here, you see there's a custom decorator retry_on_exception that handles retry. Core statement is I will leave the rest to you. Happy debugging :)

The problem you have now is this TPM (tokens per minute) has been reached I think that's why wait and retry doesn't work. Maybe in this case, we need to reduce context length to solve it.

Useful link https://stackoverflow.com/questions/75859074/getting-ratelimiterror-while-implementing-openai-gpt-with-python https://platform.openai.com/docs/guides/rate-limits/overview

I'm trying to find the input size of tokens. In this function

def get_tokens_in_messages(messages: List[str]) -> int:
    tokenizer = tiktoken.get_encoding("cl100k_base")  # GPT-4 tokenizer
    tokenized_messages = [tokenizer.encode(message['content']) for message in messages]
    return sum(len(tokens) for tokens in tokenized_messages)

what should message be ? When calling it, I don't know what to pass it.

Scratch-project commented 11 months ago

hey @Scratch-project I'm having similar 429 issues using autopilot fjrdomingues/autopilot#187 (comment)

I have the same circumstances as you, as I haven't used the api ever, but I'm receiving rate limiting messages The same as you, you should not have sent any requests recently that would put you over the limit

Wonder if this is a wider issue? Something on openai's end?

I don't really know...... still struggling with it. I would say maybe the input message is too long ?

Scratch-project commented 11 months ago

UPDATE:

By setting USE_GPTPILOT_FOLDER=true in my configuration and reviewing the output of each individual development step, I identified the root cause of the error. It appears that the final step was excessively long and ambitious, attempting to encapsulate the entire interface of the application into a single development step. The initial step simply created empty files, while the second step aimed to incorporate every the whole interface resulting in an extraordinarily large token count.

This raises a crucial question: How can I instruct the system to break down such extensive development steps into smaller, more manageable ones? When confronted with substantial tasks like these, what strategies can be employed to ensure a smoother development process?

chris-wickens commented 11 months ago

I'm seeing the rate_limit_exceeded error with a Flutter project, looks like it attempts to send up summaries of a bunch of mostly boilerplate platform/build/icon files that aren't relevant to the project. Worth noting that since the error only appears in the logs and not the console this can lead to a lot of GPT credit being used up, I lost a dollar assuming it was just loading.

The18thWarrior commented 11 months ago

UPDATE:

By setting USE_GPTPILOT_FOLDER=true in my configuration and reviewing the output of each individual development step, I identified the root cause of the error. It appears that the final step was excessively long and ambitious, attempting to encapsulate the entire interface of the application into a single development step. The initial step simply created empty files, while the second step aimed to incorporate every the whole interface resulting in an extraordinarily large token count.

This raises a crucial question: How can I instruct the system to break down such extensive development steps into smaller, more manageable ones? When confronted with substantial tasks like these, what strategies can be employed to ensure a smoother development process?

Has there been any movement on this issue? I'm finding that the pilot is attempting to send ~200k tokens to gpt for a single dev step. Could an error such as this be detected then the current task is broken into smaller steps? I'm open to even trying to implement if I could get pointed in a direction.

ishaan-jaff commented 10 months ago

@The18thWarrior @Scratch-project @chris-wickens @doougal @Umpire2018

i'm the maintainer of LiteLLM we allow you to maximize your throughput/increase rate limits - load balance between multiple deployments (Azure, OpenAI) I believe litellm can be helpful here - and i'd love your feedback if we're missing something

Here's how to use it Docs: https://docs.litellm.ai/docs/routing

from litellm import Router

model_list = [{ # list of model deployments 
    "model_name": "gpt-3.5-turbo", # model alias 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "azure/chatgpt-v-2", # actual model name
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    }
}, {
    "model_name": "gpt-3.5-turbo", 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "azure/chatgpt-functioncalling", 
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    }
}, {
    "model_name": "gpt-3.5-turbo", 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "vllm/TheBloke/Marcoroni-70B-v1-AWQ", 
        "api_key": os.getenv("OPENAI_API_KEY"),
    }
}]

router = Router(model_list=model_list)

# openai.ChatCompletion.create replacement
response = router.completion(model="gpt-3.5-turbo", 
                messages=[{"role": "user", "content": "Hey, how's it going?"}])

print(response)

jaredcat commented 8 months ago

It appears that the final step was excessively long and ambitious, attempting to encapsulate the entire interface of the application into a single development step. The initial step simply created empty files, while the second step aimed to incorporate every the whole interface resulting in an extraordinarily large token count.

Running into this myself. It's giving the context of the entire code base to GPT. I also wasted like $2 thinking it was loading, but it just keeps sending the same huge prompt over and over.