[Bug]: randomly logs blow up with `router_cooldown_event_callback but _deployment is None for deployment_id=abc123. Doing nothing`

jamesbraza commented 2 days ago

What happened?

Randomly with litellm==1.48.2, my entire terminal will get filled with LiteLLM warnings about router_cooldown_event_callback.

I am not sure why it pops up, but I think it's undesirable as it will print pages and pages.

Relevant log output

2024-09-27 00:40:56,140 - LiteLLM - WARNING - in router_cooldown_event_callback but _deployment is None for deployment_id=abc123. Doing nothing

Twitter / LinkedIn details

No response

krrishdholakia commented 2 days ago

@jamesbraza this is a warning indicating models aren't being put in cooldown.

Are you setting id for your models?

abc123

This doesn't look to be a known model id for the model list, it would help to repro the error.

jamesbraza commented 2 days ago

Oh sorry I redacted the UUID and made it abc123, it was a huge UUID, not a specific model name. Should the deployment ID be a big UUID, or a specific model name?

My Router instantiation is like so, let me know if I am missing something:

router = Router(
    model_list=[{
        "model_name": "gpt-4o-2024-08-06",
        "litellm_params": {"model": "gpt-4o-2024-08-06", "temperature": 0.0},
    }]
)

krrishdholakia commented 2 days ago

it was a huge UUID,

Yes this is correct

krrishdholakia commented 2 days ago

The error being raised was that the model with that uuid couldn't be found.

https://github.com/BerriAI/litellm/blob/25edb4ed652f10c4048ff6e6ae54812ea7506d2a/litellm/router_utils/cooldown_callbacks.py#L32

jamesbraza commented 2 days ago

Okay, I hit this again just now:

litellm_router_instance.model_list was [{'model_name': 'gpt-4o-2024-08-06', 'litellm_params': {'model': 'gpt-4o-2024-08-06', 'temperature': 0.0}, 'model_info': {'id': 'b622c96ddd11ae0ab2d5badac10abf2bb7977f0b5de7d310a0dc617035bb4e25', 'db_model': False}}]
deployment_id was 4334f1af9ccd959d655dfa5645b1dfa72e376eee3b9ec388e23ed70814a08f6f
exception_status was 500

Directly above it in my logs was:

Traceback (most recent call last):
  File "/path/to/.venv/lib/python3.12/site-packages/httpx/_transports/default.py", line 72, in map_httpcore_exceptions
    yield
  File "/path/to/.venv/lib/python3.12/site-packages/httpx/_transports/default.py", line 377, in handle_async_request
    resp = await self._pool.handle_async_request(req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/.venv/lib/python3.12/site-packages/httpcore/_async/connection_pool.py", line 216, in handle_async_request
    raise exc from None
  File "/path/to/.venv/lib/python3.12/site-packages/httpcore/_async/connection_pool.py", line 196, in handle_async_request
    response = await connection.handle_async_request(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/.venv/lib/python3.12/site-packages/httpcore/_async/connection.py", line 101, in handle_async_request
    return await self._connection.handle_async_request(request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/.venv/lib/python3.12/site-packages/httpcore/_async/http11.py", line 143, in handle_async_request
    raise exc
  File "/path/to/.venv/lib/python3.12/site-packages/httpcore/_async/http11.py", line 113, in handle_async_request
    ) = await self._receive_response_headers(**kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/.venv/lib/python3.12/site-packages/httpcore/_async/http11.py", line 186, in _receive_response_headers
    event = await self._receive_event(timeout=timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/.venv/lib/python3.12/site-packages/httpcore/_async/http11.py", line 238, in _receive_event
    raise RemoteProtocolError(msg)
httpcore.RemoteProtocolError: Server disconnected without sending a response.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/path/to/.venv/lib/python3.12/site-packages/openai/_base_client.py", line 1554, in _request
    response = await self._client.send(
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/.venv/lib/python3.12/site-packages/httpx/_client.py", line 1674, in send
    response = await self._send_handling_auth(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/.venv/lib/python3.12/site-packages/httpx/_client.py", line 1702, in _send_handling_auth
    response = await self._send_handling_redirects(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/.venv/lib/python3.12/site-packages/httpx/_client.py", line 1739, in _send_handling_redirects
    response = await self._send_single_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/.venv/lib/python3.12/site-packages/httpx/_client.py", line 1776, in _send_single_request
    response = await transport.handle_async_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/.venv/lib/python3.12/site-packages/httpx/_transports/default.py", line 376, in handle_async_request
    with map_httpcore_exceptions():
  File "/path/to/.pyenv/versions/3.12.5/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/contextlib.py", line 158, in __exit__
    self.gen.throw(value)
  File "/path/to/.venv/lib/python3.12/site-packages/httpx/_transports/default.py", line 89, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.RemoteProtocolError: Server disconnected without sending a response.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/path/to/.venv/lib/python3.12/site-packages/litellm/llms/OpenAI/openai.py", line 944, in acompletion
    headers, response = await self.make_openai_chat_completion_request(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/.venv/lib/python3.12/site-packages/litellm/llms/OpenAI/openai.py", line 639, in make_openai_chat_completion_request
    raise e
  File "/path/to/.venv/lib/python3.12/site-packages/litellm/llms/OpenAI/openai.py", line 627, in make_openai_chat_completion_request
    await openai_aclient.chat.completions.with_raw_response.create(
  File "/path/to/.venv/lib/python3.12/site-packages/openai/_legacy_response.py", line 370, in wrapped
    return cast(LegacyAPIResponse[R], await func(*args, **kwargs))
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/.venv/lib/python3.12/site-packages/openai/resources/chat/completions.py", line 1412, in create
    return await self._post(
           ^^^^^^^^^^^^^^^^^
  File "/path/to/.venv/lib/python3.12/site-packages/openai/_base_client.py", line 1821, in post
    return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/.venv/lib/python3.12/site-packages/openai/_base_client.py", line 1515, in request
    return await self._request(
           ^^^^^^^^^^^^^^^^^^^^
  File "/path/to/.venv/lib/python3.12/site-packages/openai/_base_client.py", line 1588, in _request
    raise APIConnectionError(request=request) from err
openai.APIConnectionError: Connection error.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/path/to/.venv/lib/python3.12/site-packages/litellm/main.py", line 430, in acompletion
    response = await init_response
               ^^^^^^^^^^^^^^^^^^^
  File "/path/to/.venv/lib/python3.12/site-packages/litellm/llms/OpenAI/openai.py", line 995, in acompletion
    raise OpenAIError(
litellm.llms.OpenAI.openai.OpenAIError: Connection error.

I don't understand how I am getting a different deployment ID, I am thinking there might be a control flow issue within LiteLLM's Router here?

krrishdholakia commented 1 day ago

our id's are a stable hash. This would imply something in the model init / params is changing.

https://github.com/BerriAI/litellm/blob/789ce6b7476cd5e92da81ed240005aca205c8d20/litellm/router.py#L3927

Are you calling set_model_list() in your code?

the only time the id is set, is in here

https://github.com/BerriAI/litellm/blob/789ce6b7476cd5e92da81ed240005aca205c8d20/litellm/router.py#L4026

jamesbraza commented 1 day ago

No I don't call set_model_list anywhere in my source.

I can also confirm that my above litellm_router_instance.model_list makes sense.

In [1]: import litellm

In [2]: router = litellm.Router(
   ...:     model_list=[{
   ...:         "model_name": "gpt-4o-2024-08-06",
   ...:         "litellm_params": {"model": "gpt-4o-2024-08-06", "temperature": 0.0},
   ...:     }]
   ...: )

In [3]: router.model_list
Out[3]:
[{'model_name': 'gpt-4o-2024-08-06',
  'litellm_params': {'model': 'gpt-4o-2024-08-06', 'temperature': 0.0},
  'model_info': {'id': 'b622c96ddd11ae0ab2d5badac10abf2bb7977f0b5de7d310a0dc617035bb4e25',
   'db_model': False}}]

In [4]: router = litellm.Router(
   ...:     model_list=[{
   ...:         "model_name": "gpt-4-turbo-2024-04-09",
   ...:         "litellm_params": {"model": "gpt-4-turbo-2024-04-09", "temperature": 0.0},
   ...:     }]
   ...: )

In [5]: router.model_list
Out[5]:
[{'model_name': 'gpt-4-turbo-2024-04-09',
  'litellm_params': {'model': 'gpt-4-turbo-2024-04-09', 'temperature': 0.0},
  'model_info': {'id': '4334f1af9ccd959d655dfa5645b1dfa72e376eee3b9ec388e23ed70814a08f6f',
   'db_model': False}}]

So gpt-4o-2024-08-06 is the inference model, and gpt-4-turbo-2024-04-09 is a grader model (gets invoked after the inference model).

I think there is some race condition going on related to Router cooldown. I am not sure if it's a race condition in my own code or in LiteLLM's code.

_deployment = litellm_router_instance.get_deployment(model_id=deployment_id)

Regardless, I don't understand why the deployment didn't previously exist, because this error happened like 20 mins into a run, so both models should have been invoked by then

BerriAI / litellm