BerriAI / litellm

Python SDK, Proxy Server (LLM Gateway) to call 100+ LLM APIs in OpenAI format - [Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropic, Sagemaker, HuggingFace, Replicate, Groq]
https://docs.litellm.ai/docs/
Other
13.67k stars 1.6k forks source link

[Bug]: Anthropic Function Streaming doesn't pass "stream" parameter. #3728

Closed EricWF closed 5 months ago

EricWF commented 5 months ago

What happened?

It appears the library fails to pass the "stream" parameter to Anthropic when creating streaming messages with tooling.

The attached log comes from running the test_acompletion_claude_3_function_call_with_streaming function.

Notice the stream parameter is not passed within the CURL command. As such, the response does not stream. As a result, the response isn't a streaming response, and so streaming tool calls do not work.

Manually re-running the curl command with '"stream": true' appended to the end of the payload corrects the issue.

A little digging suggests that litellm is not setup to handle streaming function calls at all, given the strings "input_json_delta" and "partial_json" are present in the anthropic streaming output, but are found nowhere inside the litellm source code.

In the meantime, it might be worth updating the documentation to reflect the lack of support, and making the test_acompletion_claude_3_function_call_with_streaming test fail due to failing to stream the response.

Relevant log output

Request to litellm:
litellm.acompletion(model='claude-3-opus-20240229', messages=[{'role': 'user', 'content': "What's the weather like in Boston today in fahrenheit?"}], tools=[{'type': 'function', 'function': {'name': 'get_current_weather', 'description': 'Get the current weather in a given location', 'parameters': {'type': 'object', 'properties': {'location': {'type': 'string', 'description': 'The city and state, e.g. San Francisco, CA'}, 'unit': {'type': 'string', 'enum': ['celsius', 'fahrenheit']}}, 'required': ['location']}}}], tool_choice='auto', stream=True)

self.optional_params: {}
ASYNC kwargs[caching]: False; litellm.cache: None; kwargs.get('cache'): None
Final returned optional params: {'stream': True, 'tools': [{'type': 'function', 'function': {'name': 'get_current_weather', 'description': 'Get the current weather in a given location', 'parameters': {'type': 'object', 'properties': {'location': {'type': 'string', 'description': 'The city and state, e.g. San Francisco, CA'}, 'unit': {'type': 'string', 'enum': ['celsius', 'fahrenheit']}}, 'required': ['location']}}}]}
self.optional_params: {'stream': True, 'tools': [{'type': 'function', 'function': {'name': 'get_current_weather', 'description': 'Get the current weather in a given location', 'parameters': {'type': 'object', 'properties': {'location': {'type': 'string', 'description': 'The city and state, e.g. San Francisco, CA'}, 'unit': {'type': 'string', 'enum': ['celsius', 'fahrenheit']}}, 'required': ['location']}}}]}

POST Request Sent from LiteLLM:
curl -X POST \
https://api.anthropic.com/v1/messages \
-H 'accept: application/json' -H 'anthropic-version: 2023-06-01' -H 'content-type: application/json' -H 'x-api-key: no-way-jose-********************' -H 'anthropic-beta: tools-2024-05-16' \
-d '{'model': 'claude-3-opus-20240229', 'messages': [{'role': 'user', 'content': [{'type': 'text', 'text': "What's the weather like in Boston today in fahrenheit?"}]}], 'tools': [{'name': 'get_current_weather', 'description': 'Get the current weather in a given location', 'input_schema': {'type': 'object', 'properties': {'location': {'type': 'string', 'description': 'The city and state, e.g. San Francisco, CA'}, 'unit': {'type': 'string', 'enum': ['celsius', 'fahrenheit']}}, 'required': ['location']}}], 'max_tokens': 4096}'

Twitter / LinkedIn details

No response

krrishdholakia commented 5 months ago

we don't make a stream request for tooling, we make a regular call -> translate it and return it in a streamed response.

The reason for that is that it's hard to translate across formats for tool calling streaming responses

Open to suggestions on how we can improve here. If you have a version of this, that does work across formats, would welcome a pr here! @EricWF

EricWF commented 5 months ago

I am very much surprised to learn this isn't a supported use case after seeing a number of commits about "anthropic streaming tool calls"

It's not difficult to translate between the two. I did some digging into the litellm code, and it seems viable to implement.

Here's an openAI tool call chunk:

ChatCompletionChunk(
    ...
    choices=[
        Choice(
            delta=ChoiceDelta(
                ...
                tool_calls=[
                    ChoiceDeltaToolCall(
                        index=0,
                        id=None,
                        function=ChoiceDeltaToolCallFunction(
                            arguments='ASTIC',
                            name=None
                        ),
                        type=None
                    )
                ]
            ),
            ...
        )
    ],
   ...
)

And here's the equivalent for anthropic,

ToolsBetaContentBlockDeltaEvent(
    delta=
        InputJsonDelta(
            partial_json='ge": "HELLO', 
            type='input_json_delta'
        ), 
    index=1,
    type='content_block_delta'
)

The partial_json translates directly to arguments in the tool call.

Also the introduction of the tool call happens in almost the exact same manner. Again, here's openai

ChoiceDeltaToolCall(
    index=0,
    id='call_qhj5Sb80ZOruV5bbS8uCvPwg',
    function=
        ChoiceDeltaToolCallFunction(
            arguments='',
            name='yell_really_really_really_loudly'
        ), 
    type='function'
)

And from anthropic

ToolsBetaContentBlockStartEvent(
    content_block=
        ToolUseBlock(
            id='toolu_015DeyCWQQbvdgzLdZrVyvnH',
            input={}, 
            name='yell_really_really_really_loudly',
            type='tool_use'
        ), 
   index=1,
   type='content_block_start'
)

As you can see, there's a pretty direct mapping from the streaming responses of anthropic to the streaming responses of openai.

At the moment however, I don't have cycles to implement it myself, at least not on top of the existing complexity in litellm. I picked up litellm so I wouldn't have to implement it myself :-(

krrishdholakia commented 5 months ago

That's a fair point @EricWF - i remember looking at this when anthropic was returning xml and choosing to wait for the complete response, worth revisiting with their new format.

Curious - Aren't you still rebuilding the chunks to form a complete tool call?

Why doesn't the existing implementation solve your problem

EricWF commented 5 months ago

Ah, so I've built a chat TUI, and for certain tool calls, save_file for example, I stream the contents of the file, with syntax highlighting as they arrive. I have to do some funky tricks to invent valid json from the partial response, but it works rather nicely.

Even when I can't pretty-print, it's still a better experience to see the JSON arrive, In the end I need to reconstruct the full tool call to actually call it, but the responsiveness of live printing is worth the effort.

Additionally, because I'm passing the tools to every message, the limitation prevents the streaming of non-tool messages (IIRC, I may be incorrect in that).

Thanks for taking the time to discuss this further.

azgo14 commented 5 months ago

also was debugging this discrepancy today and was surprised that the anthropic + tools use-case was not streaming.

my use-case is to construct pydantic objects from the tool response in a streaming manner (like how it's done here https://github.com/jxnl/instructor) and stream that response from the api to my frontend

krrishdholakia commented 5 months ago

Got it - i'm assuming both of you show the response as it's coming in, to your users? @EricWF @azgo14

azgo14 commented 5 months ago

yup

EricWF commented 5 months ago

@krrishdholakia Indeed.

bachya commented 5 months ago

Not sure if it's related, but I can confirm that in 1.40.0, even "regular" (non-tooling) streaming requests don't work with Anthropic like they used to: litellm appears to wait for the entire response, then send all the chunks back at once. Nothing more to add to @EricWF's message, other than to say that if I, too, manually adjust the cURL request being made to include stream: true, everything works as expected.

krrishdholakia commented 5 months ago

how do you adjust the curl? @bachya

krrishdholakia commented 5 months ago

i'll try to repro with a fix by tomorrow

bachya commented 5 months ago

how do you adjust the curl? @bachya

Same as @EricWF: I'm taking the cURL request from the logs and manually re-running it with that parameter in place.

krrishdholakia commented 5 months ago

@bachya found the issue - it wasn't passing the 'stream' param in the async call to the httpx client.

Fixed it - https://github.com/BerriAI/litellm/commit/5e12307a48bb21c5ec308899a87247bd6a4a78cd

should be live soon in v1.40.1

bachya commented 5 months ago

@bachya found the issue - it wasn't passing the 'stream' param in the async call to the httpx client.

Fixed it - https://github.com/BerriAI/litellm/commit/5e12307a48bb21c5ec308899a87247bd6a4a78cd

should be live soon in v1.40.1

Appreciate you, @krrishdholakia!

bachya commented 5 months ago

@krrishdholakia Any timing on when the new release will be cut?

bachya commented 5 months ago

@krrishdholakia I've noticed we're now on 1.40.3 and https://github.com/BerriAI/litellm/commit/5e12307a48bb21c5ec308899a87247bd6a4a78cd doesn't appear to be included; intended?

krrishdholakia commented 5 months ago

hey @bachya i see it included - see the tag

Screenshot 2024-06-05 at 2 23 10 PM

i also see it live on main https://github.com/BerriAI/litellm/blob/94e42dd06342b7e8a8669621ab6d1bd171cb478d/litellm/llms/anthropic.py#L164

are you still seeing this?

bachya commented 5 months ago

@krrishdholakia Ahh, missed that: GitHub hid the little release breadcrumb bar. Just tested 1.40.1 and it worked great; thank you!

krrishdholakia commented 5 months ago

Great! closing ticket then