BerriAI / litellm

Python SDK, Proxy Server (LLM Gateway) to call 100+ LLM APIs in OpenAI format - [Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropic, Sagemaker, HuggingFace, Replicate, Groq]
https://docs.litellm.ai/docs/
Other
12.53k stars 1.46k forks source link

[Bug]: vertex ai gemini 1.5 flash LiteLLM Proxy: Response cut off mid-sentence (sometimes in the middle of a word) #5964

Open JamDon2 opened 1 day ago

JamDon2 commented 1 day ago

What happened?

When using LiteLLM Proxy with streaming often (around 20% of the time) the response gets cut off. The model was going to use a tool in that response, but it was cut off before that.

I am using Vertex AI with Gemini 1.5 Flash. There is nothing in the logs, and no errors.

Relevant log output

No response

Twitter / LinkedIn details

No response

JamDon2 commented 1 day ago

image Here is an example. Here I repeated the same prompt to see whether it would be cut off the same way. The temperature is 0 for reproducibility, but the same happens with different values

krrishdholakia commented 1 day ago

There is nothing in the logs, and no errors.

Did the stream just end? Can you try sharing an example with --detailed_debug enabled @JamDon2

iirc their stream sometimes changes and returns partial json's - https://github.com/BerriAI/litellm/blob/0d0f46a826c42f52db56bfdc4e0dbf6913652671/litellm/tests/test_streaming.py#L865

Perhaps this is related to that?

JamDon2 commented 14 hours ago

I'm currently looking through the logs, and I see this error sometimes: ValueError: User doesn't exist in db. 'user_id'=admin. Create user via/user/newcall.

It appears randomly, not when making a request, and the UI is not open.

This looks like the relevant part. So what this means, is that the Vertex AI endpoint returned "I", and then stopped the completion?

INFO:     172.18.0.1:41794 - "POST /v1/chat/completions HTTP/1.1" 200 OK
10:26:04 - LiteLLM Proxy:DEBUG: proxy_server.py:2579 - async_data_generator: received streaming chunk - ModelResponse(id='chatcmpl-ID_REDACTED', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content='I', role='assistant', function_call=None, tool_calls=None), logprobs=None)], created=1727605564, model='gemini-1.5-flash', object='chat.completion.chunk', system_fingerprint=None)
10:26:04 - LiteLLM Proxy:DEBUG: proxy_server.py:2579 - async_data_generator: received streaming chunk - ModelResponse(id='chatcmpl-ID_REDACTED', choices=[StreamingChoices(finish_reason='stop', index=0, delta=Delta(content=None, role=None, function_call=None, tool_calls=None), logprobs=None)], created=1727605564, model='gemini-1.5-flash', object='chat.completion.chunk', system_fingerprint=None)
krrishdholakia commented 12 hours ago

Hmm none of this explains why a stream would stop. Can you email me (krrish@berri.ai) the complete logs or we can debug over a call? https://calendly.com/d/4mp-gd3-k5k/litellm-1-1-onboarding-chat