Closed Manouchehri closed 2 days ago
@Manouchehri can you please rewrite this issue to point to the litellm-specific changes needed here.
From what i can tell our streaming vertex calls should already be SSE
can you please rewrite this issue to point to the litellm-specific changes needed here.
You can easily see there is not a single alt=sse
anywhere in LiteLLM.
It is also really obvious there's no way the current code would handle SSE, as it's expecting JSON only here.
@Manouchehri see here - https://github.com/BerriAI/litellm/blob/6b14cf765708376490c5d88d3e54edc173c343b6/litellm/llms/vertex_httpx.py#L1358
We iterate through the received chunk, and parse the json from it
that is what is then given for the streaming call
Streaming vertex calls are made to a separate endpoint
They are also called with stream=True
in the httpx call https://github.com/BerriAI/litellm/blob/6b14cf765708376490c5d88d3e54edc173c343b6/litellm/llms/vertex_httpx.py#L457
Closing issue as vertex ai streaming on litellm is already a streaming call.
If you can share a test case where the behaviour is not as expected please do so. Will help us understand the gaps.
Streaming vertex calls are made to a separate endpoint
I still see no alt=sse
...?
Closing issue as vertex ai streaming on litellm is already a streaming call.
It's not a SSE streaming call though..
you're looking at a deprecated endpoint. alt=sse is for PALM models. Not for gemini.
this is streaming on gemini - https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/inference#streaming
It's not a SSE streaming call though..
why do you say that?
the response i get back from the httpx call is a set of chunked events
is there something i'm missing from our call? i don't think it's alt=sse since it's not on their gemini docs, but i can test it to confirm
Both response payloads I shared in https://github.com/BerriAI/litellm/issues/4459#issue-2380775205 were done with Gemini 1.5 Pro less than 60 minutes ago.
why do you say that?
I can add ?alt=sse
to base_url
manually to confirm it doesn't work if you'd like. Give me a few minutes.
i can repro this via curl.
This is so weird. this is not on their gemini streaming docs
My bad. thanks for raising this @Manouchehri
You probably already know this by now, but yeah the current LiteLLM code does not handle SSE for Vertex AI.
@Manouchehri we do get back the response as sse
alt=sse
changes the response received to being in the correct json chunk format
without it, the response is received as partial json chunks which is why we need to use ijson
to correctly handle this.
Working on a fix to use their alt=sse
param
I think there's a bug in the new code, seems like responses are being cut off sometimes.
data: {"nonce": "f9cc5f30da5975", "candidates": [{"content": {"role": "model","parts": [{"text": "You"}]}}]}
data: {"nonce": "15ccc4a0ae", "candidates": [{"content": {"role": "model","parts": [{"text": "'re right, I have been a bit glitchy lately! I apologize if"}]},"safetyRatings": [{"category": "HARM_CATEGORY_HATE_SPEECH","probability": "NEGLIGIBLE","probabilityScore": 0.08288509,"severity": "HARM_SEVERITY_NEGLIGIBLE","severityScore": 0.10827419},{"category": "HARM_CATEGORY_DANGEROUS_CONTENT","probability": "NEGLIGIBLE","probabilityScore": 0.041721944,"severity": "HARM_SEVERITY_NEGLIGIBLE","severityScore": 0.031586528},{"category": "HARM_CATEGORY_HARASSMENT","probability": "NEGLIGIBLE","probabilityScore": 0.22201821,"severity": "HARM_SEVERITY_NEGLIGIBLE","severityScore": 0.12040904},{"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT","probability": "NEGLIGIBLE","probabilityScore": 0.14810015,"severity": "HARM_SEVERITY_LOW","severityScore": 0.22374786}]}]}
data: {"nonce": "ac160d0aed", "candidates": [{"content": {"role": "model","parts": [{"text": " my responses have been interrupted. I'm still under development and learning to be"}]},"safetyRatings": [{"category": "HARM_CATEGORY_HATE_SPEECH","probability": "NEGLIGIBLE","probabilityScore": 0.043272704,"severity": "HARM_SEVERITY_NEGLIGIBLE","severityScore": 0.0524832},{"category": "HARM_CATEGORY_DANGEROUS_CONTENT","probability": "NEGLIGIBLE","probabilityScore": 0.022395115,"severity": "HARM_SEVERITY_NEGLIGIBLE","severityScore": 0.019897413},{"category": "HARM_CATEGORY_HARASSMENT","probability": "NEGLIGIBLE","probabilityScore": 0.08864924,"severity": "HARM_SEVERITY_NEGLIGIBLE","severityScore": 0.0388167},{"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT","probability": "NEGLIGIBLE","probabilityScore": 0.11435278,"severity": "HARM_SEVERITY_NEGLIGIBLE","severityScore": 0.117107384}]}]}
data: {"nonce": "cf76b0540b", "candidates": [{"content": {"role": "model","parts": [{"text": " the best language model I can be. \n\nIs there anything in particular you noticed me cutting off during? I'd love to know more so I can"}]},"safetyRatings": [{"category": "HARM_CATEGORY_HATE_SPEECH","probability": "NEGLIGIBLE","probabilityScore": 0.03871872,"severity": "HARM_SEVERITY_NEGLIGIBLE","severityScore": 0.04228497},{"category": "HARM_CATEGORY_DANGEROUS_CONTENT","probability": "NEGLIGIBLE","probabilityScore": 0.06860357,"severity": "HARM_SEVERITY_NEGLIGIBLE","severityScore": 0.073587105},{"category": "HARM_CATEGORY_HARASSMENT","probability": "NEGLIGIBLE","probabilityScore": 0.083837815,"severity": "HARM_SEVERITY_NEGLIGIBLE","severityScore": 0.05601694},{"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT","probability": "NEGLIGIBLE","probabilityScore": 0.21430598,"severity": "HARM_SEVERITY_NEGLIGIBLE","severityScore": 0.18258592}]}]}
data: {"nonce": "b47e01a2226b", "candidates": [{"content": {"role": "model","parts": [{"text": " improve. 😊 \n"}]},"safetyRatings": [{"category": "HARM_CATEGORY_HATE_SPEECH","probability": "NEGLIGIBLE","probabilityScore": 0.036266077,"severity": "HARM_SEVERITY_NEGLIGIBLE","severityScore": 0.038282175},{"category": "HARM_CATEGORY_DANGEROUS_CONTENT","probability": "NEGLIGIBLE","probabilityScore": 0.059497934,"severity": "HARM_SEVERITY_NEGLIGIBLE","severityScore": 0.07504185},{"category": "HARM_CATEGORY_HARASSMENT","probability": "NEGLIGIBLE","probabilityScore": 0.07208697,"severity": "HARM_SEVERITY_NEGLIGIBLE","severityScore": 0.04733565},{"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT","probability": "NEGLIGIBLE","probabilityScore": 0.18794174,"severity": "HARM_SEVERITY_NEGLIGIBLE","severityScore": 0.1387987}]}]}
data: {"nonce": "461b9ac9", "candidates": [{"content": {"role": "model","parts": [{"text": ""}]},"finishReason": "STOP"}],"usageMetadata": {"promptTokenCount": 197,"candidatesTokenCount": 71,"totalTokenCount": 268}}
@Manouchehri your chunk stream looks fine to me
data: {"nonce": "461b9ac9", "candidates": [{"content": {"role": "model","parts": [{"text": ""}]},"finishReason": "STOP"}],"usageMetadata": {"promptTokenCount": 197,"candidatesTokenCount": 71,"totalTokenCount": 268}}
i also don't see this when making a regular curl request to the proxy
curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer sk-1234' \
--data '{
"model": "gemini-1.5-flash-gemini",
"messages": [
{
"role": "user",
"content": "I think you'\''re getting cut off sometimes"
}
],
"stream": true,
}
'
can you share a curl with the error, for repro
Oh it's really odd to trigger, you have to have multiple messages in the thread. (Not at my laptop until later this weekend, otherwise I'd give you an exact curl command.)
It didn't happen on the first one or two messages for me. Only later/longer convos.
Unable to repro @Manouchehri
Request
curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer sk-1234' \
--data '{
"model": "gemini-1.5-flash-gemini",
"messages": [
{"role": "user", "content": "Hey, how'\''s it going?"},
{
"role": "assistant",
"content": "I'\''m doing well. Would like to hear the rest of the story?"
},
{"role": "user", "content": "Na"},
{
"role": "assistant",
"content": "No problem, is there anything else i can help you with today?"
},
{
"role": "user",
"content": "I think you'\''re getting cut off sometimes"
}
],
"stream": true
}
'
Response:
data: {"id":"chatcmpl-48a2e8ff-0584-4e6d-ba12-f53099b21ae6","choices":[{"index":0,"delta":{"content":"You","role":"assistant"}}],"created":1719611255,"model":"gemini-1.5-flash","object":"chat.completion.chunk"}
data: {"id":"chatcmpl-48a2e8ff-0584-4e6d-ba12-f53099b21ae6","choices":[{"index":0,"delta":{"content":"'re right! I am a large language model, and sometimes my responses can"}}],"created":1719611255,"model":"gemini-1.5-flash","object":"chat.completion.chunk"}
data: {"id":"chatcmpl-48a2e8ff-0584-4e6d-ba12-f53099b21ae6","choices":[{"index":0,"delta":{"content":" get cut off. It's likely due to limitations with the interface,"}}],"created":1719611255,"model":"gemini-1.5-flash","object":"chat.completion.chunk"}
data: {"id":"chatcmpl-48a2e8ff-0584-4e6d-ba12-f53099b21ae6","choices":[{"index":0,"delta":{"content":" or maybe there's a connection issue. \n\nLet's try again. What would you like to talk about? \n"}}],"created":1719611255,"model":"gemini-1.5-flash","object":"chat.completion.chunk"}
data: {"id":"chatcmpl-48a2e8ff-0584-4e6d-ba12-f53099b21ae6","choices":[{"finish_reason":"stop","index":0,"delta":{}}],"created":1719611255,"model":"gemini-1.5-flash","object":"chat.completion.chunk"}
data: [DONE]
If you have a consistent repro, can you file a separate issue and we can track it there.
Try using 1.5 Pro, and like half a dozen messages that are much longer. (I’ll try on Sunday too.)
Unable to repro on my end @Manouchehri
Just ran the streaming call 10 times and it worked each time
What is your config?
able to repro for cloudflare proxy
Hero!
What happened?
https://cloud.google.com/vertex-ai/generative-ai/docs/learn/streaming#rest-sse
For very long/slow prompts, having SSE for streaming seems better. e.g. some proxies will buffer non-
text/event-stream
responses.With SSE:
Without SSE:
Relevant log output
No response
Twitter / LinkedIn details
https://www.linkedin.com/in/davidmanouchehri/