BerriAI / litellm

Python SDK, Proxy Server to call 100+ LLM APIs using the OpenAI format - [Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropic, Sagemaker, HuggingFace, Replicate, Groq]
https://docs.litellm.ai/docs/
Other
11.68k stars 1.34k forks source link

[Bug]: Strange Behavior in HuggingChat (Chat-UI) #1222

Open gururise opened 8 months ago

gururise commented 8 months ago

What happened?

When using LiteLLM as proxy for Together.ai and Mistral-7b-Instruct-v0.2 there are two strange issues that occur during inference when using the Chat-UI frontend by Huggingface.

  1. The text is displayed in large chunks rather than streamed to the UI word by word
  2. The formatting is messed up and not respected. Newlines are ignored: image

As seen below, when using gpt-3.5-turbo, the formatting is fine and the streaming word-by-word works: image

Here is MODELS from .env.local for litellm proxy:

MODELS=`[
    {
      "name": "mistral-7b",
      "displayName": "mistralai/Mistral-7B-Instruct-v0.2",
      "description": "Mistral 7B is a new Apache 2.0 model, released by Mistral AI that outperforms Llama2 13B in benchmarks.",
      "websiteUrl": "https://mistral.ai/news/announcing-mistral-7b/",
      "preprompt": "",
      "chatPromptTemplate" : "<s>{{#each messages}}{{#ifUser}}[INST] {{#if @first}}{{#if @root.preprompt}}{{@root.preprompt}}\n{{/if}}{{/if}}{{content}} [/INST]{{/ifUser}}{{#ifAssistant}}{{content}}</s>{{/ifAssistant}}{{/each}}",
      "parameters": {
        "temperature": 0.1,
        "top_p": 0.95,
        "repetition_penalty": 1.2,
        "top_k": 50,
        "truncate": 3072,
        "max_new_tokens": 1024,
        "stop": ["</s>"]
      },
      "endpoints": [{
        "type" : "openai",
        "baseURL": "http://localhost:8000/v1"
      }]
      "promptExamples": [
        {
          "title": "Write an email from bullet list",
          "prompt": "As a restaurant owner, write a professional email to the supplier to get these products every week: \n\n- Wine (x10)\n- Eggs (x24)\n- Bread (x12)"
        }, {
          "title": "Code a snake game",
          "prompt": "Code a basic snake game in python, give explanations for each step."
        }, {
          "title": "Assist in a task",
          "prompt": "How do I make a delicious lemon cheesecake?"
        }
      ]
    }
]`

To set up the gpt-3.5-turbo model:

MODELS=`[
    {
      "name": "gpt-3.5-turbo",
      "displayName": "GPT 3.5 Turbo",
      "endpoints" : [{
        "type": "openai"
      }],
      "promptExamples": [
        {
          "title": "Write an email from bullet list",
          "prompt": "As a restaurant owner, write a professional email to the supplier to get these products every week: \n\n- Wine (x10)\n- Eggs (x24)\n- Bread (x12)"
        }, {
          "title": "Code a snake game",
          "prompt": "Code a basic snake game in python, give explanations for each step."
        }, {
          "title": "Assist in a task",
          "prompt": "How do I make a delicious lemon cheesecake?"
        }
      ]
    }
]`

Relevant log output

No response

Twitter / LinkedIn details

No response

krrishdholakia commented 8 months ago

Hey @gururise do we know if the large chunk yielding is happening on together ai's side?

Re: newline, what's a fix for this? I believe this is part of the string being returned by togetherai

gururise commented 8 months ago

EDIT: I think I've confirmed there is something wrong/different with the together_ai implementation. If I use openai as the LLM provider with LiteLLM proxy, the application works as expected, but if I switch to together_ai as the LLM provider, things do not work nicely.

Hey @gururise do we know if the large chunk yielding is happening on together ai's side?

When I run litellm in debug mode, I can see the tokens being streamed individually.

Re: newline, what's a fix for this? I believe this is part of the string being returned by togetherai

Looking at the debug log when using together_ai, the newlines are escaped. Any ideas why? Is this something LiteLLM is doing?

Here is a snippet of the debug log when I am using together_ai (notice the newline towards the end is escaped):

_reason=None, index=0, delta=Delta(content=' Number', role=None))], created=1703264237, model='mistralai/Mixtral-8x7B-Instruct-v0.1', object='chat.completion.chunk', system_fingerprint=None, usage=Usage())
success callbacks: [<litellm.proxy.hooks.parallel_request_limiter.MaxParallelRequestsHandler object at 0x7f552a323fd0>]
returned chunk: ModelResponse(id='chatcmpl-dcffe220-a020-4eab-80df-df0214839ccb', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' Number', role=None))], created=1703264237, model='mistralai/Mixtral-8x7B-Instruct-v0.1', object='chat.completion.chunk', system_fingerprint=None, usage=Usage())
value of chunk: b'' 
value of chunk: b'data: {"choices":[{"text":"]"}],"id":"8399e7011ff32f77-LAX","token":{"id":28793,"text":"]","logprob":-0.0066871643,"special":false},"generated_text":null,"details":null,"stats":null,"usage":null}' 
PROCESSED CHUNK PRE CHUNK CREATOR: b'data: {"choices":[{"text":"]"}],"id":"8399e7011ff32f77-LAX","token":{"id":28793,"text":"]","logprob":-0.0066871643,"special":false},"generated_text":null,"details":null,"stats":null,"usage":null}'
model_response: ModelResponse(id='chatcmpl-b4739587-4105-4f86-9e8b-dabafae4645b', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=None, role=None))], created=1703264237, model='mistralai/Mixtral-8x7B-Instruct-v0.1', object='chat.completion.chunk', system_fingerprint=None, usage=Usage()); completion_obj: {'content': ']'}
model_response finish reason 3: None
hold - False, model_response_str - ]
model_response: ModelResponse(id='chatcmpl-b4739587-4105-4f86-9e8b-dabafae4645b', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=']', role=None))], created=1703264237, model='mistralai/Mixtral-8x7B-Instruct-v0.1', object='chat.completion.chunk', system_fingerprint=None, usage=Usage())
PROCESSED CHUNK POST CHUNK CREATOR: ModelResponse(id='chatcmpl-b4739587-4105-4f86-9e8b-dabafae4645b', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=']', role=None))], created=1703264237, model='mistralai/Mixtral-8x7B-Instruct-v0.1', object='chat.completion.chunk', system_fingerprint=None, usage=Usage())
Logging Details LiteLLM-Success Call
success callbacks: [<litellm.proxy.hooks.parallel_request_limiter.MaxParallelRequestsHandler object at 0x7f552a323fd0>]
line in async streaming: ModelResponse(id='chatcmpl-b4739587-4105-4f86-9e8b-dabafae4645b', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=']', role=None))], created=1703264237, model='mistralai/Mixtral-8x7B-Instruct-v0.1', object='chat.completion.chunk', system_fingerprint=None, usage=Usage())
returned chunk: ModelResponse(id='chatcmpl-b4739587-4105-4f86-9e8b-dabafae4645b', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=']', role=None))], created=1703264237, model='mistralai/Mixtral-8x7B-Instruct-v0.1', object='chat.completion.chunk', system_fingerprint=None, usage=Usage())
value of chunk: b'' 
value of chunk: b'data: {"choices":[{"text":"\\n"}],"id":"8399e7011ff32f77-LAX","token":{"id":13,"text":"\\n","logprob":-0.000019788742,"special":false},"generated_text":null,"details":null,"stats":null,"usage":null}' 
PROCESSED CHUNK PRE CHUNK CREATOR: b'data: {"choices":[{"text":"\\n"}],"id":"8399e7011ff32f77-LAX","token":{"id":13,"text":"\\n","logprob":-0.000019788742,"special":false},"generated_text":null,"details":null,"stats":null,"usage":null}'
model_response: ModelResponse(id='chatcmpl-5745b053-586c-4bb2-be7c-9de42c721c31', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=None, role=None))], created=1703264237, model='mistralai/Mixtral-8x7B-Instruct-v0.1', object='chat.completion.chunk', system_fingerprint=None, usage=Usage()); completion_obj: {'content': '\\n'}
model_response finish reason 3: None
hold - False, model_response_str - \n

Alright, I think I have confirmed it is something to do with the way LITELLM is handling togetherai. When I continue to use the LiteLLM Proxy but switch the provider to openai (gpt-3.5-turbo) everything works exactly as expected. The streaming occurs token by token and the output is parsed correctly:

TESTING LITELLM PROXY using OPEN AI (gpt-3.5-turbo): image

Snippet of debug log from openai as provider.

completion obj content: Restaurant
model_response: ModelResponse(id='chatcmpl-a0cb7cb8-2691-4760-9ccb-3f7f438e2cfe', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=None, role=None))], created=1703264355, model='gpt-3.5-turbo', object='chat.completion.chunk', system_fingerprint=None, usage=Usage()); completion_obj: {'content': 'Restaurant'}
model_response finish reason 3: None
hold - False, model_response_str - Restaurant
model_response: ModelResponse(id='chatcmpl-8Yd9YizyHpZHuTezfVJLO7JPPDkVa', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(tool_calls=None, function_call=None, content='Restaurant', role=None))], created=1703264355, model='gpt-3.5-turbo', object='chat.completion.chunk', system_fingerprint=None, usage=Usage())
PROCESSED ASYNC CHUNK POST CHUNK CREATOR: ModelResponse(id='chatcmpl-8Yd9YizyHpZHuTezfVJLO7JPPDkVa', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(tool_calls=None, function_call=None, content='Restaurant', role=None))], created=1703264355, model='gpt-3.5-turbo', object='chat.completion.chunk', system_fingerprint=None, usage=Usage())
Logging Details LiteLLM-Success Call
line in async streaming: ModelResponse(id='chatcmpl-8Yd9YizyHpZHuTezfVJLO7JPPDkVa', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(tool_calls=None, function_call=None, content='Restaurant', role=None))], created=1703264355, model='gpt-3.5-turbo', object='chat.completion.chunk', system_fingerprint=None, usage=Usage())
success callbacks: [<litellm.proxy.hooks.parallel_request_limiter.MaxParallelRequestsHandler object at 0x7f40ab127f10>]
returned chunk: ModelResponse(id='chatcmpl-8Yd9YizyHpZHuTezfVJLO7JPPDkVa', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(tool_calls=None, function_call=None, content='Restaurant', role=None))], created=1703264355, model='gpt-3.5-turbo', object='chat.completion.chunk', system_fingerprint=None, usage=Usage())
INSIDE ASYNC STREAMING!!!
value of async completion stream: <openai.AsyncStream object at 0x7f40a7d0feb0>
value of async chunk: ChatCompletionChunk(id='chatcmpl-8Yd9YizyHpZHuTezfVJLO7JPPDkVa', choices=[Choice(delta=ChoiceDelta(content=' Name', function_call=None, role=None, tool_calls=None), finish_reason=None, index=0, logprobs=None)], created=1703264348, model='gpt-3.5-turbo-0613', object='chat.completion.chunk', system_fingerprint=None)
PROCESSED ASYNC CHUNK PRE CHUNK CREATOR: ChatCompletionChunk(id='chatcmpl-8Yd9YizyHpZHuTezfVJLO7JPPDkVa', choices=[Choice(delta=ChoiceDelta(content=' Name', function_call=None, role=None, tool_calls=None), finish_reason=None, index=0, logprobs=None)], created=1703264348, model='gpt-3.5-turbo-0613', object='chat.completion.chunk', system_fingerprint=None)
krrishdholakia commented 8 months ago

acknowledging this - will work on it today. thank you for the debugging so far @gururise

krrishdholakia commented 8 months ago

Looking at the raw tgai call, doesn't look like they're streaming in chunks.

Screenshot 2023-12-25 at 6 45 55 AM
krrishdholakia commented 8 months ago

Running with together_ai/mistralai/Mistral-7B-Instruct-v0.2, unable to repro with trivial example

Screenshot 2023-12-25 at 7 25 30 AM Screenshot 2023-12-25 at 7 25 18 AM
krrishdholakia commented 8 months ago

Testing with this curl request:

curl --location 'http://0.0.0.0:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer sk-1234' \
--data '{
  "model": "tgai-mistral",
  "messages": [
        {
          "role": "user",
          "content": "As a restaurant owner, write a professional email to the supplier to get these products every week: \n\n- Wine (x10)\n- Eggs (x24)\n- Bread (x12)"
        }
      ],
  "stream": true,
  "temperature": 0.1,
  "top_p": 0.95,
  "repetition_penalty": 1.2,
  "top_k": 50,
  "truncate": 3072,
  "max_new_tokens": 1024,
  "stop": ["</s>"]
}'

I'm unable to repro the large chunk problem (see "content" for each line)

Screenshot 2023-12-25 at 7 29 46 AM

cc: @gururise do you know what the exact call being received by litellm is?

krrishdholakia commented 8 months ago

@gururise bump on this/

nigh8w0lf commented 7 months ago

Seeing the same issue with formatting, TogetherAI with Mixtral-8x7B-Instruct-v0.1

The output is not formatted as reported above by OP I'm using the litellm proxy server. Used huggingface chat-ui and LibreChat, both had the same problem with formatting.

nigh8w0lf commented 7 months ago

I can see the tokens streamed Individually but as well, but like OP mentioned they are displayed in chunks at a time, as if the response is being first cached until it hits some sort of limit and is then displayed on Chat-UI. Will test on LibreChat to see if it's the same behaviour.

nigh8w0lf commented 7 months ago

same behaviour in LibreChat as well so it looks like it's an issue with the proxy when using TogetherAI, and happens with any model on TogetherAI

nigh8w0lf commented 7 months ago

@gururise have you found a workaround to this issue or are you not using Together API's?

gururise commented 7 months ago

@gururise have you found a workaround to this issue or are you not using Together API's?

Unfortunately, I have found no workaround in LiteLLM. I haven't had time to look further into this issue, perhaps if you have time to provide some more debugging information, @krrishdholakia can fix this issue.

krrishdholakia commented 7 months ago

I'll do some further testing here, and try and repro this. I'm not seeing this when i just test the proxy chat completion endpoint w/ tgai and streaming on postman

nigh8w0lf commented 7 months ago

thanks @gururise @krrishdholakia happy to help with debugging info.

krrishdholakia commented 7 months ago

@nigh8w0lf can you let me know if you're seeing this issue when making a normal curl request to the proxy endpoint?

and also the version of litellm being used?

nigh8w0lf commented 7 months ago

@krrishdholakia I can see that the tokens are streamed when running in curl or when running the proxy in debug mode,the chunking seems to happen when the tokens are displayed in HF Chat-UI and LibreChat. there is also the formatting issue when the tokens are displayed in HF Chat-UI and LibreChat.

nigh8w0lf commented 7 months ago

sorry forget to mention the litellm version, I'm using 1.17.5

nigh8w0lf commented 7 months ago

updated to Litellm 1.17.14, still the same issue. Wondering if the chunking is because the API is too fast 😆

krrishdholakia commented 7 months ago

is this then a client side issue w/ Librechat / HF Chat UI?

cc: @Manouchehri i believe you're also using us with librechat, are you seeing similar buffering?

krrishdholakia commented 7 months ago

@nigh8w0lf do you see this buffering happening for a regular openai call via the proxy?

I remember trying librechat w/ bedrock and that seemed to work fine

Manouchehri commented 7 months ago

I've been using Azure OpenAI, Bedrock, and Cohere. None of them had this issue from what I remember. =)

nigh8w0lf commented 7 months ago

@krrishdholakia it doesn't happen with any other API, only with TogetherAI

gururise commented 7 months ago

@krrishdholakia Just to add, I tried HF Chat UI with LiteLLM(OpenAI API) and it worked as expected. As @nigh8w0lf says, this issue only occurs when using LiteLLM with TogetherAI.

EDIT: If you look at the debug log I attached to an earlier comment. You can see that LiteLLM is returning escaped newline characters when used with TogetherAI.

ishaan-jaff commented 7 months ago

Related PR: I saw this with Sagemaker: https://github.com/BerriAI/litellm/pull/1569

ishaan-jaff commented 7 months ago

Pushed a fix for TogetherAI will be live on 1.18.13

https://github.com/BerriAI/litellm/commit/2d26875eb0b9bd347e9b9d8c7d6fced739d9d5be

@gururise @nigh8w0lf can I get your help confirming the issue is fixed on 1.18.3+ ?

nigh8w0lf commented 7 months ago

The chunking issue seems to be fixed, I can see the response being streamed correctly.

The formatting issue is still present for TogetherAI API.

Seeing some new behavior after this update, HF Chat-UI thinks the response is incomplete? the Continue button appears after the response has been streamed completely, I have not seen this before, it's happening with all API's

image @gururise does it happen for you?

nigh8w0lf commented 7 months ago

Don't see the "Continue" button issue when using HF Chat-UI without the proxy.

krrishdholakia commented 7 months ago

Do you see this when calling openai directly? @nigh8w0lf

nigh8w0lf commented 7 months ago

Do you see this when calling openai directly? @nigh8w0lf

no, I don't see it when using the openai directly, I see it only when using the proxy.

nigh8w0lf commented 7 months ago

when I say directly, I mean using HF Chat-UI without Litellm.

nigh8w0lf commented 6 months ago

I have switched to Librechat as the frontend, the "Continue" issue is no longer a concern but the formatting issue is still exists on v1.20.0

ishaan-jaff commented 6 months ago

@nigh8w0lf can we track the formatting bug on a new issue - since this issue was about the togetheriAI chunks hanging?

nigh8w0lf commented 6 months ago

@ishaan-jaff Sure I can log a new bug report, the Initial bug report above mentions both issues by the way hence why I was continuing here.

nigh8w0lf commented 6 months ago

https://github.com/BerriAI/litellm/issues/1792 @ishaan-jaff