Long delay when using streaming + tools

holdenmatt commented 2 months ago

(Sorry if this isn't the right place to report this, I wasn't sure).

I'm trying to switch from gpt-4o to claude-3.5-sonnet in an app I'm building, but high streaming tool latency is preventing me from doing so. Looks like this was discussed in #454 but wondering how I should proceed?

The total latency of Claude vs gpt-4o is pretty similar, and I think fine.

The issue is that Claude waits a long time before any content is streamed (I often see ~5s delays vs ~500ms for gpt-4o). This is a poor user experience in my app, because users get no feedback that any generation is happening. This will prevent me from switching, even though I much prefer Claude's output quality!

Do you have any plans to fix this? Or do you recommend not using tools + streaming with Claude?

Example timing and test code below, if helpful.

Timing comparison

claude-3-5-sonnet: Stream created at 0ms First content received: 4645ms Streaming time: 46ms Total time: 4691ms

gpt-4o: Stream created at 343ms First content received: 368ms Streaming time: 2100ms Total time: 2468ms

Test code:

import Anthropic from "@anthropic-ai/sdk";
import OpenAI from "openai";

const openai = new OpenAI();
const anthropic = new Anthropic();

const provider: "anthropic" | "openai" = "anthropic";

export async function POST() {
  const startTime = performance.now();
  let streamCreated: number | undefined = undefined;
  let firstContentReceived: number | undefined = undefined;

  const messages = [
    {
      role: "user" as const,
      content: `Write a poem about pirates.`,
    },
  ];

  const schema = {
    type: "object" as const,
    properties: {
      poem: { type: "string", description: "The poem" },
    },
    required: ["poem"],
  };

  if (provider === "openai") {
    const stream = await openai.chat.completions.create({
      model: "gpt-4o",
      stream: true,
      messages,
      tools: [
        {
          type: "function",
          function: {
            name: "poem",
            description: "Generate a poem",
            parameters: schema,
          },
        },
      ],
    });

    streamCreated = performance.now();

    for await (const chunk of stream) {
      console.log(JSON.stringify(chunk.choices[0]?.delta?.tool_calls, null, 2));
      if (firstContentReceived === undefined) {
        firstContentReceived = performance.now();
      }
    }
  } else if (provider === "anthropic") {
    const stream = anthropic.messages
      .stream({
        model: "claude-3-5-sonnet-20240620",
        max_tokens: 2000,
        messages,
        tools: [
          {
            name: "poem",
            description: "Generate a poem",
            input_schema: schema,
          },
        ],
      })
      // When a JSON content block delta is encountered this
      // event will be fired with the delta and the currently accumulated object
      .on("inputJson", (delta, snapshot) => {
        console.log(`delta: ${delta}`);
        if (firstContentReceived === undefined) {
          firstContentReceived = performance.now();
        }
      });

    streamCreated = performance.now();
    await stream.done();
  }

  const endTime = performance.now();
  if (streamCreated) {
    console.log(`Stream created at ${Math.round(streamCreated - startTime)}ms`);
  }
  if (firstContentReceived) {
    console.log(
      `First content received: ${Math.round(firstContentReceived - startTime)}ms`,
    );
    console.log(`Streaming time: ${Math.round(endTime - firstContentReceived)}ms`);
  }
  console.log(`Total time: ${Math.round(endTime - startTime)}ms`);
}

samj-anthropic commented 2 months ago

Hi @holdenmatt, unfortunately this is a model limitation (same issue noted in https://github.com/anthropics/anthropic-sdk-typescript/issues/454#issuecomment-2221073472). We're planning on improving this with future models.

holdenmatt commented 2 months ago

I see, thanks. If I want faster streaming, would you recommend I move away from tools and try to coax a JSON schema via the system prompt instead?

samj-anthropic commented 2 months ago

Hi @holdenmatt -- one clarification to the above: we stream out each key/value pair together, so long values will result in buffering (the delays you're seeing). In the example you provided, Claude is producing a poem (a long string) as a value, which is why you're seeing the delay. However, a large object with many smaller keys/values wouldn't have this issue.

If I want faster streaming, would you recommend I move away from tools and try to coax a JSON schema via the system prompt instead?

That could work, this delay you're seeing should only be happening in that specific kind of tool use (where Claude is producing long keys/values).

holdenmatt commented 2 months ago

Ah, that would explain why I run into this but other folks I talk to haven't seen it.

The specific use case for me is generating LaTeX code from text prompts for https://texsandbox.com/

The latex output could be long, depending on the prompt. The reason I use function calling instead of text completion is I want to allow the model to "branch" between the good "latex" case and an "error" case if it doesn't know what to do, or eg the input prompt doesn't make sense.

I could avoid tools here if that would improve streaming, but I'd need some other way to signal "this is valid code" vs "this is an error message"

holdenmatt commented 1 month ago

fyi - I fixed this by moving away from tool calling, and streaming now feels fast again.

I hacked my own poor man's function calling on text generation, by prompting the model to write latex or error on the first line, followed by code or an error message.

This works fine (so you can close this if you like) but it was the biggest issue I ran into switching from gpt-4o to claude-3.5-sonnet. I quite often use functions/tools with long JSON values, so feature request to improve this in the future. Thanks!

ZECTBynmo commented 1 month ago

Is there an issue we can track for improvements to streaming + tool use, or do you plan to post updates here?

Kitenite commented 3 weeks ago

Hey team, is there a planned date for fixing this? This is a big limiter for our user experience for code-gen. Since the result is returned as a stream anyway, is there a way to get those delta earlier?

darylsew commented 2 weeks ago

+1, think this basically makes tool use not viable for our use case - not limited to the typescript API, also a problem in python

Kitenite commented 2 weeks ago

+1, think this basically makes tool use not viable for our use case - not limited to the typescript API, also a problem in python

If this helps, there's a hacky workaround similar to the solution mentioned above that's currently working for me and someone else by streaming raw text and forcing a JSON format. Then progressively resolve the text into the partial object. It's surprisingly reliable so far.

https://github.com/vercel/ai/issues/3422#issuecomment-2450459211

anthropics / anthropic-sdk-typescript

Long delay when using streaming + tools #529

Timing comparison

Test code: