Closed fontanierh closed 4 months ago
This is not very urgent, it happens quite rarely
Ignore generation or next event chain of thought TBD
(un-assigning since you created a bunch of more important tasks to handle for release)
Updated the card name to reflect that this is also (and now mostly) about Claude's "
We likely want to "productize" it.
@spolu As per IRL, there's no proper way to detect that the model is chain-of-thought-ing before streaming is complete. What we're going to do:
content
field on every action tableWe think only 1 , 2 (already done) and 3 (todo) are really prio for now
Aligned
On 3 I guess we'd want to show it differently rather than "hide" it for Anthropic?
Good question on where to store it...
Maybe it's an Agent message thing after all and would have to be an array?
On 3 I guess we'd want to show it differently rather than "hide" it for Anthropic?
Yeah def, I meant hide the actual xml tag. The content should be displayed and likely differently indeed.
Maybe it's an Agent message thing after all and would have to be an array?
Not a massive fan of the array. I thiiiink a separate table linked to an agent message wouldn't be too horrible ?
So this reminds me -- how are we going to deal with the new action chips / drawer UI @flvndvd ? With the pulsing tag thing and "view actions details", we need a way to understand that the "action phase" is over and that we're now generating the real tokens 🤔
related to https://github.com/dust-tt/dust/issues/5350
IMHO displaying CoT is a great addition and really fills a gap. Without overfitting too much on Anthropic, making the chain-of-thought messages look good could really help our users. I'd push to display it in the conversation UI without hiding it in the drawer, at least while the agent message isn't over. I'm pretty sure we can figure something out with @Duncid.
But still, we'd want to stop having the pulsating chip and show the "view action details" button if we're done running actions but still generating the actual message no ? I don't think the current design allows to do that, because you have no way to know in the frontend that the tokens being streamed are not just CoT tokens (you can only know a posteriori).
Maybe what we need is to always have the "view details" button available, and have the chip go brrr while the action is being executed ?
But still, we'd want to stop having the pulsating chip and show the "view action details" button if we're done running actions but still generating the actual message no ? I don't think the current design allows to do that, because you have no way to know in the frontend that the tokens being streamed are not just CoT tokens (you can only know a posteriori).
Maybe what we need is to always have the "view details" button available, and have the chip go brrr while the action is being executed ?
I missed this point in my earlier comment, but we definitely need an event that shows all actions are done. This way, we can ditch the pulsating chip and show the "View action details" CTA instead. Is this just happening with Anthropic, or are we seeing CoT messages with other models too?
We could take a straightforward approach and just filter out the "thinking" XML tag. Our UI tends to prioritize the message content, which usually works out fine. In the worst case, we might show some text briefly before the pulsating chip reappears. If we can spot this text, maybe this could be a decent workaround for now, right?
missed this point in my earlier comment, but we definitely need an event that shows all actions are done.
My point is that we can't really have such an event. There's no way to know that we're not going to get a function call after some streamed tokens. You can only know reliably what was CoT and what was actual generation after the whole run is complete (except in some very specific cases where the assistant exhausts all of his action steps)
So for most cases on Claude Opus, we can indeed rely on <thinking>
tags, but that solution won't work for the smaller Claudes (that do CoT without <thinking>
) and for the occasional GPT CoT.
IMO we want a UI design that is more resilient to not really knowing if we're done running actions for this run.
💯Agree!
Hey! Catching up here.
My understanding:
We can't make the difference between normal message and CoT message at this point
If that is correct, it's going to be impossible to do something good UI wise. I would work in the direction of being able to make the difference with certainty.
If we manage to track those with certainty, we have plenty of ways we can use them in the UI, in relation with the status chip.
Do we have examples of conversations with CoT?
We wont' have certainty due to the nature of models :/
@fontarnierh is that accurate that Opus will ~ always use the
@spolu
@fontarnierh is that accurate that Opus will ~ always use the tag? Is that the case for smaller models?
Opus does it most of the time. Not always. Smaller models seem to rarely use the tag from what I can tell.
@Duncid
I would work in the direction of being able to make the difference with certainty.
I believe it's a limitation of the current technology, there's not really anything we can do IMO. I think there might be ways to make it better no ? Let the text disappear from the agent message with some nice animation, have it go somewhere else (eg the drawer) etc ?
@Duncid
Do we have examples of conversations with CoT?
Try to make a simple assistant with >= 1 action using Claude 3 Opus. You'll get CoT almost everytime.
You can try with @chainOfGod
@Duncid
@fontanierh can you help me make sense of what I'm seeing?
https://dust.tt/w/0ec9852c2f/assistant/0e0c65e06a
<thinking>
is clearly CoT, what about the other <...>
?
Sometimes, <thinking>
stays, sometimes, it disappear before final msg generation.
Sometimes,
<thinking>
stays
So typically, there is some of that <thinking>
before the actions and before the "real" message of the agent (that's the chain of tought).
Then, in the real message, we often have more of that <thinking>
stuff, it's kind of CoT but it is part of the real generation.
Claude Opus also often output some other XML stuff such as <result>
or <search_score>
or things like that... usually in the generation
However, the smaller claude models, eg sonnet, (and sometimes GPT4) do CoT without any tags or specific markers.
So there are 2 separate things you are seeing:
<thinking>
is very common but still not a 100% thing)Does that make sense ? We can chat about it if needed
for 1., it would be cool to have a way UI-wise to make the "disappearing" part not too bad (maybe with some fade-out ? and maybe we can see it in the Actions Details)
The "agent message" is the assistant's answer. The user reads it while it streams and considers it high-value info. Therefore, anything written in that space should be stable (stay in place) and never disappear.
XML should not appear in messages unless the user directly asks for it.
So maybe solution here is to investigate why these XML stuff are here and how does claude deal with it in its own integration.
Everything clearly identified as CoT could be used in the UI, in connection with the status chip and logged in the Actions Details.
However, it's difficult to form an opinion:
The user has asked me to search for information about Glean. I don't have any prior context about what Glean is, so I will need to do a broad search to try to understand the key facts about it, such as:
What type of product or service is Glean? Is it a company, software, app, or something else?
What does Glean do? What problem does it aim to solve or what benefits does it provide?
Who are the target users or customers for Glean?
If Glean is a company, who founded it and what is their background? Where is the company located?
How does Glean work or what are its key features?
How is Glean different from or better than alternatives or competitors?
Has Glean received any funding, awards, press coverage or other notable recognition?
<thinking> The user is asking what I can do for them. To answer this, I should explain my capabilities at a high level, including:
I have access to web browsing and search tools that allow me to find information online to help answer questions and assist with tasks
I can engage in general conversation, answer questions, and provide explanations on a wide range of topics
I can help break down complex problems, provide analysis and insights, and offer suggestions or advice
I can assist with writing, proofreading and editing text
I can help with basic math, calculations, and data analysis
I don't think I need to use any specific tools to answer this question, as the user is just looking for a general overview of my capabilities. I can provide that explanation directly without needing to search for or browse any external information.
</thinking>
There are a number of ways I can assist you:
I have access to web browsing and search capabilities, so I can help find information and answer questions on a wide variety of topics. If there's something you're curious about or a question you need answered, I can search the web for relevant information to provide an explanation or response.
I can engage in open-ended conversation and provide a friendly ear if you just want to chat. I'm happy to discuss your interests, hobbies, current events, or whatever is on your mind.
If you have a complex problem or situation you're dealing with, I can help break it down, offer analysis and insights, and provide suggestions or advice. I can be a sounding board to talk through challenges with.
For any writing tasks, whether it's essays, articles, stories, scripts, etc., I can assist with drafting, proofreading, editing and providing feedback to help improve your writing.
I can also help out with any basic math you need - calculations, conversions, solving equations, data analysis and visualization, probability and statistics problems, etc.
Let me know if any of those types of assistance would be helpful, or if there are other ways I can lend a hand! I'm here to help in whatever way I can.
So IMO the problem is that there are different philosophy on handling MultiAction:
We could decide to take the Anthropic way and make steps more part of message, in which case CoT would be integrated, each step would have its status chip.
I believe more in the OpenAI approach and I would just let CoT in the "ActionDetails", We could work in making Action Details streamable—User can open the drawer and follow actions. CoT are streamed there.
@Duncid
I believe more in the OpenAI approach and I would just let CoT in the "ActionDetails", We could work in making Action Details streamable—User can open the drawer and follow actions. CoT are streamed there.
Agreed, but the thing is we cannot understand that tokens being streamed are CoT in a lot of cases. We only know that a posteriori. So we are forced to let those tokens stream in the agent message.
The question is more what we do with those tokens (that are already in the agent message body in the UI) once we realize they were just CoT tokens.
It feels like it's making the process woefully slow (stream the CoT, then stream an answer)
Def agreed as well. I prefer not having CoT. But we support Claude, and with Claude we cannot opt out of CoT. Even if you insist very hard that you don't want it, it does it.
IMO as long as we're "multi model", we don't really get to pick if we prefer to follow the OpenAI or the Anthropic solution, we need a UI that is resilient to both setups.
We are all in agreement on what we believe is the best outcome from a product standpoint.
Now we have constraints related to each model behavior:
The real problem is therefore Anthropic.
I think we want to find a UI trick to make it OK and lean hard on them to provide more information about this case. Really they just need to re-align their model to emit a token to flag CoT tokens as such before moving to generation or tool use.
Why don't we just ignore Anthropic CoT?
Again @Duncid, we don't know it's CoT until the whole agent message is done streaming. We cannot ignore it.
Well, there is no magic here.
If we don't know how to spot CoT, we can't handle them in UI!
We know they are CoT after. They disappear from the agent message. We need to make that not look too bad.
The requirements are:
We can think of those CoT tokens as some kind of spinner / loading state in the agent message that we are forced to have
Also with Opus with ~always have a <thinking>
hint that this is CoT. But less often for smaller models
Also with Opus with ~always have a hint that this is CoT. But less often for smaller models
Always-ish. We sometimes don't have it. Sometimes it's a different xml tag
Almost never with the smaller models in my XP. And 100% never when it's gpt4 (but rare)
It looks a bit like a gpt4 bug. In this case, we likely want to ignore the generation.
Here's an example run where it happened: https://dust.tt/w/78bda07b39/a/0e9889c787/runs/3243be92381110215723b48a155aa1a6d8f7ea3790d38acb322a3d52bc0b50ac
Right now, we handle it poorly since we show the generation being streamed, and then disappear when the function call message is complete, and then start streaming again with the "real" generation.