[Multi Actions] handle cases where we have content and calls functions in the same message

fontanierh commented 5 months ago

It looks a bit like a gpt4 bug. In this case, we likely want to ignore the generation.

Here's an example run where it happened: https://dust.tt/w/78bda07b39/a/0e9889c787/runs/3243be92381110215723b48a155aa1a6d8f7ea3790d38acb322a3d52bc0b50ac

Right now, we handle it poorly since we show the generation being streamed, and then disappear when the function call message is complete, and then start streaming again with the "real" generation.

fontanierh commented 4 months ago

This is not very urgent, it happens quite rarely

PopDaph commented 4 months ago

Ignore generation or next event chain of thought TBD

PopDaph commented 4 months ago

(un-assigning since you created a bunch of more important tasks to handle for release)

fontanierh commented 4 months ago

Updated the card name to reflect that this is also (and now mostly) about Claude's "" stuff, i.e the chain of thought that comes with claude function calling.

We likely want to "productize" it.

fontanierh commented 4 months ago

@spolu As per IRL, there's no proper way to detect that the model is chain-of-thought-ing before streaming is complete. What we're going to do:

let it stream in the agent message [current behaviour]
let it disappear once the real generation starts [current behaviour]
For Anthropic, try to detect (in front) the "" tags so we can avoid showing them
Add a content field on every action table
save the chain of thought on the action (note: not really that perfect, because you can have several concurrent executions, so unclear where we'd store that ? maybe a new table 😬 )
display the chain of thought in the "view actions details" modal
a little animation to show the chain of thought "leave the agent message and fly into the side modal" would be awesome (cc @Duncid)

We think only 1 , 2 (already done) and 3 (todo) are really prio for now

spolu commented 4 months ago

Aligned

On 3 I guess we'd want to show it differently rather than "hide" it for Anthropic?

Good question on where to store it...

Maybe it's an Agent message thing after all and would have to be an array?

fontanierh commented 4 months ago

On 3 I guess we'd want to show it differently rather than "hide" it for Anthropic?

Yeah def, I meant hide the actual xml tag. The content should be displayed and likely differently indeed.

Maybe it's an Agent message thing after all and would have to be an array?

Not a massive fan of the array. I thiiiink a separate table linked to an agent message wouldn't be too horrible ?

fontanierh commented 4 months ago

So this reminds me -- how are we going to deal with the new action chips / drawer UI @flvndvd ? With the pulsing tag thing and "view actions details", we need a way to understand that the "action phase" is over and that we're now generating the real tokens 🤔

fontanierh commented 4 months ago

flvndvd commented 4 months ago

IMHO displaying CoT is a great addition and really fills a gap. Without overfitting too much on Anthropic, making the chain-of-thought messages look good could really help our users. I'd push to display it in the conversation UI without hiding it in the drawer, at least while the agent message isn't over. I'm pretty sure we can figure something out with @Duncid.

fontanierh commented 4 months ago

But still, we'd want to stop having the pulsating chip and show the "view action details" button if we're done running actions but still generating the actual message no ? I don't think the current design allows to do that, because you have no way to know in the frontend that the tokens being streamed are not just CoT tokens (you can only know a posteriori).

Maybe what we need is to always have the "view details" button available, and have the chip go brrr while the action is being executed ?

flvndvd commented 4 months ago

But still, we'd want to stop having the pulsating chip and show the "view action details" button if we're done running actions but still generating the actual message no ? I don't think the current design allows to do that, because you have no way to know in the frontend that the tokens being streamed are not just CoT tokens (you can only know a posteriori).

Maybe what we need is to always have the "view details" button available, and have the chip go brrr while the action is being executed ?

I missed this point in my earlier comment, but we definitely need an event that shows all actions are done. This way, we can ditch the pulsating chip and show the "View action details" CTA instead. Is this just happening with Anthropic, or are we seeing CoT messages with other models too?

We could take a straightforward approach and just filter out the "thinking" XML tag. Our UI tends to prioritize the message content, which usually works out fine. In the worst case, we might show some text briefly before the pulsating chip reappears. If we can spot this text, maybe this could be a decent workaround for now, right?

fontanierh commented 4 months ago

missed this point in my earlier comment, but we definitely need an event that shows all actions are done.

My point is that we can't really have such an event. There's no way to know that we're not going to get a function call after some streamed tokens. You can only know reliably what was CoT and what was actual generation after the whole run is complete (except in some very specific cases where the assistant exhausts all of his action steps)

So for most cases on Claude Opus, we can indeed rely on <thinking> tags, but that solution won't work for the smaller Claudes (that do CoT without <thinking>) and for the occasional GPT CoT. IMO we want a UI design that is more resilient to not really knowing if we're done running actions for this run.

flvndvd commented 4 months ago

💯Agree!

Duncid commented 4 months ago

Hey! Catching up here.

My understanding:

CoT events can happen at multiple moments during a multiple action sequence (every step?)
We can't make the difference between normal message and CoT message at this point

We can't make the difference between normal message and CoT message at this point

If that is correct, it's going to be impossible to do something good UI wise. I would work in the direction of being able to make the difference with certainty.

If we manage to track those with certainty, we have plenty of ways we can use them in the UI, in relation with the status chip.

Do we have examples of conversations with CoT?

spolu commented 4 months ago

We wont' have certainty due to the nature of models :/

@fontarnierh is that accurate that Opus will ~ always use the tag? Is that the case for smaller models?

fontanierh commented 4 months ago

@spolu

@fontarnierh is that accurate that Opus will ~ always use the tag? Is that the case for smaller models?

Opus does it most of the time. Not always. Smaller models seem to rarely use the tag from what I can tell.

@Duncid

I would work in the direction of being able to make the difference with certainty.

I believe it's a limitation of the current technology, there's not really anything we can do IMO. I think there might be ways to make it better no ? Let the text disappear from the agent message with some nice animation, have it go somewhere else (eg the drawer) etc ?

fontanierh commented 4 months ago

@Duncid

Do we have examples of conversations with CoT?

Try to make a simple assistant with >= 1 action using Claude 3 Opus. You'll get CoT almost everytime.

fontanierh commented 4 months ago

You can try with @chainOfGod @Duncid

Duncid commented 4 months ago

@fontanierh can you help me make sense of what I'm seeing?

https://dust.tt/w/0ec9852c2f/assistant/0e0c65e06a

<thinking> is clearly CoT, what about the other <...>? Sometimes, <thinking> stays, sometimes, it disappear before final msg generation.

fontanierh commented 4 months ago

Sometimes, <thinking> stays

So typically, there is some of that <thinking> before the actions and before the "real" message of the agent (that's the chain of tought). Then, in the real message, we often have more of that <thinking> stuff, it's kind of CoT but it is part of the real generation. Claude Opus also often output some other XML stuff such as <result> or <search_score> or things like that... usually in the generation

However, the smaller claude models, eg sonnet, (and sometimes GPT4) do CoT without any tags or specific markers.

So there are 2 separate things you are seeing:

CoT, which is content that is streamed before the real generation, and often before the model executes any action (and also sometimes a little bit in between actions or after the actions). That content then disappears and is replaced by the "real" generation
Claude Opus XML tags. It likes to do this. We can decide to hide the XML tags (or productize them in some way), but this is very specific to Opus, and hard to rely on because they are not consistent (<thinking> is very common but still not a 100% thing)

Does that make sense ? We can chat about it if needed

fontanierh commented 4 months ago

for 1., it would be cool to have a way UI-wise to make the "disappearing" part not too bad (maybe with some fade-out ? and maybe we can see it in the Actions Details)

Duncid commented 4 months ago

Principles

The "agent message" is the assistant's answer. The user reads it while it streams and considers it high-value info. Therefore, anything written in that space should be stable (stay in place) and never disappear.

XML stuff

XML should not appear in messages unless the user directly asks for it.

That is tricky because we'll have difficulty making the difference.
Are we sure Claude does not answer with XML-like tags because of the way we prompt it? I don't think Claude AI in the Claude AI chat does that. It reminds me of my soupinou making JSON yesterday.

So maybe solution here is to investigate why these XML stuff are here and how does claude deal with it in its own integration.

CoT

Everything clearly identified as CoT could be used in the UI, in connection with the status chip and logged in the Actions Details.

However, it's difficult to form an opinion:

It feels like it's making the process woefully slow (stream the CoT, then stream an answer)
Messages can be very long:

The user has asked me to search for information about Glean. I don't have any prior context about what Glean is, so I will need to do a broad search to try to understand the key facts about it, such as:
What type of product or service is Glean? Is it a company, software, app, or something else?
What does Glean do? What problem does it aim to solve or what benefits does it provide?
Who are the target users or customers for Glean?
If Glean is a company, who founded it and what is their background? Where is the company located?
How does Glean work or what are its key features?
How is Glean different from or better than alternatives or competitors?
Has Glean received any funding, awards, press coverage or other notable recognition?

Sometimes the thinking is redundant with the message:

<thinking> The user is asking what I can do for them. To answer this, I should explain my capabilities at a high level, including:
I have access to web browsing and search tools that allow me to find information online to help answer questions and assist with tasks
I can engage in general conversation, answer questions, and provide explanations on a wide range of topics
I can help break down complex problems, provide analysis and insights, and offer suggestions or advice
I can assist with writing, proofreading and editing text
I can help with basic math, calculations, and data analysis
I don't think I need to use any specific tools to answer this question, as the user is just looking for a general overview of my capabilities. I can provide that explanation directly without needing to search for or browse any external information.
</thinking>
There are a number of ways I can assist you:
I have access to web browsing and search capabilities, so I can help find information and answer questions on a wide variety of topics. If there's something you're curious about or a question you need answered, I can search the web for relevant information to provide an explanation or response.
I can engage in open-ended conversation and provide a friendly ear if you just want to chat. I'm happy to discuss your interests, hobbies, current events, or whatever is on your mind.
If you have a complex problem or situation you're dealing with, I can help break it down, offer analysis and insights, and provide suggestions or advice. I can be a sounding board to talk through challenges with.
For any writing tasks, whether it's essays, articles, stories, scripts, etc., I can assist with drafting, proofreading, editing and providing feedback to help improve your writing.
I can also help out with any basic math you need - calculations, conversions, solving equations, data analysis and visualization, probability and statistics problems, etc.
Let me know if any of those types of assistance would be helpful, or if there are other ways I can lend a hand! I'm here to help in whatever way I can.

So IMO the problem is that there are different philosophy on handling MultiAction:

Our and OpenAI philosophy: We don't strongly narrate the operations and reasoning of the AI. AI decides to execute n action, then decides to execute a new step of n action or not, but no text is streamed to explain why
Anthropic approach is more AutoGPT like and comments at every step why it understands of the demand and rational behind further actions

We could decide to take the Anthropic way and make steps more part of message, in which case CoT would be integrated, each step would have its status chip.

I believe more in the OpenAI approach and I would just let CoT in the "ActionDetails", We could work in making Action Details streamable—User can open the drawer and follow actions. CoT are streamed there.

fontanierh commented 4 months ago

@Duncid

I believe more in the OpenAI approach and I would just let CoT in the "ActionDetails", We could work in making Action Details streamable—User can open the drawer and follow actions. CoT are streamed there.

Agreed, but the thing is we cannot understand that tokens being streamed are CoT in a lot of cases. We only know that a posteriori. So we are forced to let those tokens stream in the agent message.

The question is more what we do with those tokens (that are already in the agent message body in the UI) once we realize they were just CoT tokens.

fontanierh commented 4 months ago

It feels like it's making the process woefully slow (stream the CoT, then stream an answer)

Def agreed as well. I prefer not having CoT. But we support Claude, and with Claude we cannot opt out of CoT. Even if you insist very hard that you don't want it, it does it.

fontanierh commented 4 months ago

IMO as long as we're "multi model", we don't really get to pick if we prefer to follow the OpenAI or the Anthropic solution, we need a UI that is resilient to both setups.

spolu commented 4 months ago

We are all in agreement on what we believe is the best outcome from a product standpoint.

Now we have constraints related to each model behavior:

gpt4: rare case of generation along with tool use
anthropic: systematic generation along with tool use

The real problem is therefore Anthropic.

I think we want to find a UI trick to make it OK and lean hard on them to provide more information about this case. Really they just need to re-align their model to emit a token to flag CoT tokens as such before moving to generation or tool use.

Duncid commented 4 months ago

Why don't we just ignore Anthropic CoT?

fontanierh commented 4 months ago

Again @Duncid, we don't know it's CoT until the whole agent message is done streaming. We cannot ignore it.

Duncid commented 4 months ago

Well, there is no magic here.

If we don't know how to spot CoT, we can't handle them in UI!

fontanierh commented 4 months ago

We know they are CoT after. They disappear from the agent message. We need to make that not look too bad.

fontanierh commented 4 months ago

The requirements are:

We need to have CoT tokens stream into the agent message (we cannot avoid that)
At some point, we understand the tokens streamed so far were actually CoT (because we get some event that indicates another generation is beginning -- could be either the "real generation" or another CoT)
When this happens, we want to do something with those CoT tokens that are in the agent message. Maybe some kind of animation where they disappear and go into the actions details ?

fontanierh commented 4 months ago

We can think of those CoT tokens as some kind of spinner / loading state in the agent message that we are forced to have

spolu commented 4 months ago

Also with Opus with ~always have a <thinking> hint that this is CoT. But less often for smaller models

fontanierh commented 4 months ago

Also with Opus with ~always have a hint that this is CoT. But less often for smaller models

Always-ish. We sometimes don't have it. Sometimes it's a different xml tag

fontanierh commented 4 months ago

Almost never with the smaller models in my XP. And 100% never when it's gpt4 (but rare)

dust-tt / dust

[Multi Actions] handle cases where we have content and calls functions in the same message #5159

Principles

XML stuff

CoT