BerriAI / litellm

Call all LLM APIs using the OpenAI format. Use Bedrock, Azure, OpenAI, Cohere, Anthropic, Ollama, Sagemaker, HuggingFace, Replicate (100+ LLMs)
https://docs.litellm.ai/docs/
Other
10.11k stars 1.13k forks source link

[Bug] Gemini streaming chunks can only receive one chunk: #4339

Closed jh10001 closed 6 days ago

jh10001 commented 1 week ago

What happened?

I'm trying to use the Gemini API, but when I use stream=True, I only receive one chunk and one blank chunk then the stream ends.

resp = completion(model='gemini/gemini-1.5-flash', messages=messages, stream=True)
for chunk in resp:
    print(chunk)

If I use stream=False, everything works correctly.

Relevant log output

Request to litellm:
litellm.completion(messages=[{'role': 'user', 'content': '\n\n\n--------\nWrite a story'}], stream=True, model='gemini/gemini-1.5-flash')

SYNC kwargs[caching]: False; litellm.cache: None; kwargs.get('cache')['no-cache']: False
Final returned optional params: {'stream': True}

POST Request Sent from LiteLLM:
curl -X POST \
https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:streamGenerateContent?key=*********************** \
-H 'Content-Type: *****' \
-d '{'contents': [{'role': 'user', 'parts': [{'text': '\n\n\n--------\nWrite a story'}]}], 'generationConfig': {}}'

06:03:05 - LiteLLM:WARNING: utils.py:338 - `litellm.set_verbose` is deprecated. Please set `os.environ['LITELLM_LOG'] = 'DEBUG'` for debug logs.
PROCESSED CHUNK PRE CHUNK CREATOR: {'text': 'The', 'tool_use': None, 'is_finished': True, 'finish_reason': 'stop', 'usage': {'prompt_tokens': 7, 'completion_tokens': 1, 'total_tokens': 8}, 'index': 0}; custom_llm_provider: vertex_ai_beta
model_response finish reason 3: stop; response_obj={'text': 'The', 'tool_use': None, 'is_finished': True, 'finish_reason': 'stop', 'usage': {'prompt_tokens': 7, 'completion_tokens': 1, 'total_tokens': 8}, 'index': 0}
model_response.choices[0].delta: Delta(content=None, role=None, function_call=None, tool_calls=None); completion_obj: {'content': 'The'}
self.sent_first_chunk: False
hold - False, model_response_str - The
returning model_response: ModelResponse(id='chatcmpl-42ffe201-f30e-46c7-bf42-889332cad079', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content='The', role='assistant', function_call=None, tool_calls=None), logprobs=None)], created=1719007386, model='gemini-1.5-flash', object='chat.completion.chunk', system_fingerprint=None)
PROCESSED CHUNK POST CHUNK CREATOR: ModelResponse(id='chatcmpl-42ffe201-f30e-46c7-bf42-889332cad079', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content='The', role='assistant', function_call=None, tool_calls=None), logprobs=None)], created=1719007386, model='gemini-1.5-flash', object='chat.completion.chunk', system_fingerprint=None)
Logging Details LiteLLM-Async Success Call
PROCESSED CHUNK PRE CHUNK CREATOR: {'text': ' attic. Amelia, her fingers tracing the worn spines of forgotten books, shivered.', 'tool_use': None, 'is_finished': True, 'finish_reason': 'stop', 'usage': {'prompt_tokens': 7, 'completion_tokens': 34, 'total_tokens': 41}, 'index': 0}; custom_llm_provider: vertex_ai_beta
Goes into checking if chunk has hiddden created at param
Chunks have a created at hidden param
Chunks sorted
token_counter messages received: [{'role': 'user', 'content': '\n\n\n--------\nWrite a story'}]
Token Counter - using generic token counter, for model=gemini-1.5-flash
LiteLLM: Utils - Counting tokens for OpenAI model=gpt-3.5-turbo
Token Counter - using generic token counter, for model=gemini-1.5-flash
LiteLLM: Utils - Counting tokens for OpenAI model=gpt-3.5-turbo
Looking up model=gemini/gemini-1.5-flash in model_cost_map
final cost: 5.25e-06; prompt_tokens_cost_usd_dollar: 4.2e-06; completion_tokens_cost_usd_dollar: 1.05e-06
PROCESSED CHUNK PRE CHUNK CREATOR: {'text': ' a treasure in the attic, Amelia, waiting for the right person to find it."\n\nNow, with a faded photograph of her grandmother clutched in her hand, Amelia felt an inexplicable urge to discover the truth. The photograph, taken in a', 'tool_use': None, 'is_finished': True, 'finish_reason': 'stop', 'usage': {'prompt_tokens': 7, 'completion_tokens': 146, 'total_tokens': 153}, 'index': 0}; custom_llm_provider: vertex_ai_beta

Twitter / LinkedIn details

No response

hbqdev commented 1 week ago

I run into this issue as well with open-webui

hbqdev commented 1 week ago

I tried setting stream: False in the config.yaml but that also didn't fix the issue.

Manouchehri commented 1 week ago

Oh, that's really interesting.. I ran into this as well with Vertex AI, but I assumed it was because I was using Cloudflare AI Gateway.

Manouchehri commented 1 week ago

Here's an example response:

[
  {
    "candidates": [
      {
        "content": {
          "role": "model",
          "parts": [
            {
              "text": "Not"
            }
          ]
        }
      }
    ]
  },
  {
    "candidates": [
      {
        "content": {
          "role": "model",
          "parts": [
            {
              "text": " much, just hanging out in the digital world, waiting for someone to ask me"
            }
          ]
        },
        "safetyRatings": [
          {
            "category": "HARM_CATEGORY_HATE_SPEECH",
            "probability": "NEGLIGIBLE",
            "probabilityScore": 0.037538007,
            "severity": "HARM_SEVERITY_NEGLIGIBLE",
            "severityScore": 0.020606477
          },
          {
            "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
            "probability": "NEGLIGIBLE",
            "probabilityScore": 0.12995382,
            "severity": "HARM_SEVERITY_NEGLIGIBLE",
            "severityScore": 0.059210256
          },
          {
            "category": "HARM_CATEGORY_HARASSMENT",
            "probability": "NEGLIGIBLE",
            "probabilityScore": 0.08093671,
            "severity": "HARM_SEVERITY_NEGLIGIBLE",
            "severityScore": 0.025323061
          },
          {
            "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
            "probability": "NEGLIGIBLE",
            "probabilityScore": 0.4208971,
            "severity": "HARM_SEVERITY_NEGLIGIBLE",
            "severityScore": 0.12410682
          }
        ]
      }
    ]
  },
  {
    "candidates": [
      {
        "content": {
          "role": "model",
          "parts": [
            {
              "text": " a question or give me a task! What about you? What's going"
            }
          ]
        },
        "safetyRatings": [
          {
            "category": "HARM_CATEGORY_HATE_SPEECH",
            "probability": "NEGLIGIBLE",
            "probabilityScore": 0.053107906,
            "severity": "HARM_SEVERITY_NEGLIGIBLE",
            "severityScore": 0.027014788
          },
          {
            "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
            "probability": "NEGLIGIBLE",
            "probabilityScore": 0.11757213,
            "severity": "HARM_SEVERITY_NEGLIGIBLE",
            "severityScore": 0.053899158
          },
          {
            "category": "HARM_CATEGORY_HARASSMENT",
            "probability": "NEGLIGIBLE",
            "probabilityScore": 0.10266401,
            "severity": "HARM_SEVERITY_NEGLIGIBLE",
            "severityScore": 0.03567855
          },
          {
            "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
            "probability": "NEGLIGIBLE",
            "probabilityScore": 0.27825677,
            "severity": "HARM_SEVERITY_NEGLIGIBLE",
            "severityScore": 0.10743748
          }
        ]
      }
    ]
  },
  {
    "candidates": [
      {
        "content": {
          "role": "model",
          "parts": [
            {
              "text": " on in your world today? 😊 \n"
            }
          ]
        },
        "safetyRatings": [
          {
            "category": "HARM_CATEGORY_HATE_SPEECH",
            "probability": "NEGLIGIBLE",
            "probabilityScore": 0.04216654,
            "severity": "HARM_SEVERITY_NEGLIGIBLE",
            "severityScore": 0.027066175
          },
          {
            "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
            "probability": "NEGLIGIBLE",
            "probabilityScore": 0.086632065,
            "severity": "HARM_SEVERITY_NEGLIGIBLE",
            "severityScore": 0.07172113
          },
          {
            "category": "HARM_CATEGORY_HARASSMENT",
            "probability": "NEGLIGIBLE",
            "probabilityScore": 0.07696084,
            "severity": "HARM_SEVERITY_NEGLIGIBLE",
            "severityScore": 0.03581319
          },
          {
            "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
            "probability": "NEGLIGIBLE",
            "probabilityScore": 0.26246357,
            "severity": "HARM_SEVERITY_NEGLIGIBLE",
            "severityScore": 0.083441176
          }
        ]
      }
    ]
  },
  {
    "candidates": [
      {
        "content": {
          "role": "model",
          "parts": [
            {
              "text": ""
            }
          ]
        },
        "finishReason": "STOP"
      }
    ],
    "usageMetadata": {
      "promptTokenCount": 6,
      "candidatesTokenCount": 42,
      "totalTokenCount": 48
    }
  }
]
krrishdholakia commented 6 days ago

looking into this.

thanks @jh10001

krrishdholakia commented 6 days ago

able to repro. That's really weird.

krrishdholakia commented 6 days ago

so it looks like google returns a finishreason on every chunk

processed_chunk: {'candidates': [{'content': {'parts': [{'text': 'A'}], 'role': 'model'}, 'finishReason': 'STOP', 'index': 0}], 'usageMetadata': {'promptTokenCount': 18, 'candidatesTokenCount': 1, 'totalTokenCount': 19}}