Closed rezzie-rich closed 1 month ago
Can you elaborate on what this would entail?
Currently, there is a problem with the LLM's context window. Models with small context windows stand no chance against models with 128k+ window. There is already a short & long-term memory system integrated. However, i don't think that memory is encoded and discipte the current memory structure context window is still an issue. I did see an attempt to add a separate memgpt-agent as discussed on #530. It would be great if the memgpt features can be integrated into all the agents so the context window of an llm becomes irrelevant.
Theoretically, output generated by a 4k window model can also be stored in the short-term memory, so when the model continues to generate response after reaching max token, it can successfully pickup where it left off without getting lost.
Yeah, I think this is a better way to implement MemGPT. I think I'll actually close the previous issue because having a "MemGPT Agent" doesn't really make sense.
I would like to note that we already have a PR open for MemGPT-like functionality, the memory condenser.
However, MemGPT was built specifically for this purpose, so it may be worth incorporating MemGPT as an alternative version of the memory condenser that could be swapped in for #2021. I think the architecture could be largely similar.
Currently, the memgpt is integrated into autogen to overcome the context window limitation.
https://microsoft.github.io/autogen/docs/ecosystem/memgpt/
@xingyaoww i know there is already a PR open with over a month long contribution for better memory management. However, memgpt is an overall better implementation for the following reasons:
1- If I'm correct, #2021 only targets codeactagent, but memgpt can be used as the default memory for all the agents, which will improve overall quality from planning to execution. 2- Currently #2021 is testing with a 32k context window, but the memgpt can allow even the 4k and 8k window models to be as capable as 128k window models. Two of the leading models, Llama3 and phi3, have 8k and 4k window. Those models seem to be broken or perform less when the context window is extended. 3- memgpt can make context window irrelevant.
Hey @rezzie-rich ! Just to clarify, I think we'd be very excited to have a contribution implementing memgpt in OpenDevin, and would like to benchmark it against other approaches. I think the #2021 actually creates some general purpose infrastructure for short-term and long-term memory, and memgpt could be substituted in for the memory condenser that is in there.
@neubig awesome! Eagerly waiting to see the results after the merge. Wish i could contribute with some heavy lifting as well. However, i did come across this article for an agent framework that uses an additional "working memory" besides the short and long-term memory. It might be worth checking out.
https://towardsdatascience.com/ai-agent-capabilities-engineering-34c7785f413e
Big fan of the team 🫶 lol
Hey @neubig ! I have started working on this issue.
@khushvind @xingyaoww @enyst
If there was a way to flag certain messages as important so when the memory is condensed and given to the agents, they know what to prioritize, that would be great serving as a guard rail.
Oftentimes, working with llm over a very long context task, at some point, it starts to produce responses disregarding some crucial context. Having flagged messages will help it follow through like bullet points.
Yeah, we should also add that. I think giving more weightage to user inputs can be a good start. Current implementation, https://github.com/OpenDevin/OpenDevin/pull/2937 gives equal importance to all the message, and can lose the context of the actual task given by user after some summarizations.
I'm not much familiar with it, is there any prior work or study on methods to classify the importance of agent messages?
One way I can think of doing it is through a specific agent. flagging a message as important requires an understanding of the context, and LLMs are good at identifying the key points. maybe a particular agent analyzes the event-stream/messages and flags it before it condenses could work. Otherwise, it's kind of like vector db where information is ranked and stored in high-dimentional space. But vector db is mostly statistics, what we need is a greater understanding of the context and prioritizing.
Or a parallel priority-stream(event stream) that only stores the weightage messages. Which would require an agent to analyze the messages as well.
This is a very good point. One thing we were trying was to keep user messages for longer time, to make sure we don't lose the task with whatever the user may have specified. That was rather rough, and it doesn't concern agent messages.
Current implementation has a condensable
flag on messages, that would in theory allow us to make a decision to keep some, but it doesn't atm choose to keep any other than the system message and something like it.
I've done some work on prompts that would make the LLM, during summarization, choose in a more appropriate manner, that is, considering the importance to some messages in function of both time and contents, but it wasn't systematic, not clearly better, and anyway it was for the old agent (not CodeAct). Maybe you're right that it would work better to separate the two: have an LLM make a pass and assign weights, then perform the summarization. Or maybe a better step-by-step prompt.
I think currently we just need to make sure that the summaries are detailed enough that they do keep information, such that the agent is guided forward helpfully. I am yet to test @khushvind 's prompts on the SWE-bench tasks, but I'll get to do that. If you wish to try it, it's PR #2937
I've done some work on prompts that would make the LLM, during summarization, choose in a more appropriate manner, that is, considering the importance to some messages in function of both time and contents, but it wasn't systematic, not clearly better, and anyway it was for the old agent (not CodeAct). Maybe you're right that it would work better to separate the two: have an LLM make a pass and assign weights, then perform the summarization. Or maybe a better step-by-step prompt.
If we could have a separate 'memory-monitor-agent' that analyze the current event steam and messages (maybe both the user input and system messages if that includes new important findings regarding the user message) before summarized then highlights and stores the key contexts in a parallel priority-stream ( a separate event stream of priotize message); the priority-stream then could be used as a guard rail alongside the the event-steam to guide the agents. In this case, agents will access 2 memories, 1(event stream/current memory) for task completion, and 2(priority stream) to validate the task accuracy and appropriateness.
Having multiple steams of messages can work better with a long context base workload. Each memory stream will perform different purposes like being used to validate the outcome and regular CoT. This way, if the current memory stream loses or overlooks some crucial context, the separate memory stream can inject that back into the workflow.
There could also be another memory stream that stores the routes or options to avoid or the faulty approach. This can potentially prevent loops. However, this may require some additional safe guarding to avoid storing valid approaches and preventing complications.
Main issue with single memory stream:
Solution:
It's been a while since I was thinking deeply in this space; but I remember ages back seeing some stuff described by Dave Shapiro around memory; using a sort of tree-like structure, where the it sort of abstracted to higher level summaries as it went up.
Trying to find the blog/video/whatever it was but currently failing. I believe the idea was to be able to use the higher level summaries to get the RAG-like aspect into the right 'area'; and then can 'drill down' for more specific memory context if needed. It might have been this one on 'Temporal Hierarchical Memories':
There was also some stuff I remember him describing about 'sparse priming representations'; which from what I remember was basically like.. compressing the prompt/etc into a form that the LLM is able to still get the same context back from it, but without using as many tokens; essentially 'getting it into the right state of mind' so to speak.
I wonder if these concepts might be useful here; probably alongside the 'multiple streams of thought' and 'memory curator' aspects described above (as at face value; they sound like a pretty good approach to me)
I don't think this was the blog I was looking for; but it's also about memory management:
And this one might be interesting/useful too:
There was also some stuff I remember him describing about 'sparse priming representations'; which from what I remember was basically like.. compressing the prompt/etc into a form that the LLM is able to still get the same context back from it, but without using as many tokens; essentially 'getting it into the right state of mind' so to speak.
this describes a vector database where texts are converted into numbers and stored in a relational high-dimensional space.
I think memgpt is a great boilerplate for memory management as the project is only made for that. since @enyst already worked on a 'memory monitor'/'memory curator' agent, with some modification that should be integrated easily since it's going to perform standalone with memgpt instead of codeact(IG). @khushvind I'm hoping adding multiple memory stream wouldn't be that complicated since the project already uses memory stream. however, regardless of the approach, highlighting the crucial context and the faulty approaches will increase the output quality and needs to be included.
Can the event-steam be embedded? If there's no loss of content, more information can be fed to agents with the same context window.
We did that in the past, in the first agent (monologue), we were embedding every event as soon as it was happening, and prompting the LLM to know how to retrieve it if necessary. Currently there is no mechanism for CodeAct.
We did that in the past, in the first agent (monologue), we were embedding every event as soon as it was happening, and prompting the LLM to know how to retrieve it if necessary. Currently there is no mechanism for CodeAct.
How did it work? Or u guys never tested monolog agent without embedding?
If we could have a separate 'memory-monitor-agent' that analyze the current event steam and messages (maybe both the user input and system messages if that includes new important findings regarding the user message) before summarized then highlights and stores the key contexts in a parallel priority-stream ( a separate event stream of priotize message); the priority-stream then could be used as a guard rail alongside the the event-steam to guide the agents. In this case, agents will access 2 memories, 1(event stream/current memory) for task completion, and 2(priority stream) to validate the task accuracy and appropriateness.
Having multiple steams of messages can work better with a long context base workload. Each memory stream will perform different purposes like being used to validate the outcome and regular CoT. This way, if the current memory stream loses or overlooks some crucial context, the separate memory stream can inject that back into the workflow.
There could also be another memory stream that stores the routes or options to avoid or the faulty approach. This can potentially prevent loops. However, this may require some additional safe guarding to avoid storing valid approaches and preventing complications.
Main issue with single memory stream:
- Contains all the memories in one place. It's good for CoT but has a risk of exceeding context window. With summarization, key context can be lost or less emphasized.
Solution:
- Multiple memory stream with dedicated purpose can overcome the issue of losing key contexts by storing them separately. These additional memory streams can be used through recall to recalibrate the agent's action. Since these additional steams will store only limited memories, it will be much smaller compared to the main memory stream and can be included in the context window for additional arguments.
Added some slight explanation. @khushvind
@enyst, if embedding monolog agent didn't result in a loss of context, it might be beneficial to add that with memgpt.
Monologue agent hasn't been tested without embeddings, I don't think, but FWIW there's no reason to result in loss of context because of it (it just embedded stuff in addition to the rest). Instead, it's possible it didn't solve losses from summarization well enough. FWIW, in my opinion, it wasn't making a significant difference on performance, but it helped with user experience: longer sessions!
embedding helps AI to understand the natural language better. I wonder if the context is embedded first and then summarized, will it increase the quality of the summarized context?
embedding helps AI to understand the natural language better. I wonder if the context is embedded first and then summarized, will it increase the quality of the summarized context?
NVM embedding first can obscure some of the finer details and nuances of the original text, making the subsequent summarization less precise. So, if embedding is done, it should be done after summarizing and the performance needs to be tested with and without embedding.
I don't follow, sorry, I think there is a misunderstanding here. Embedding text doesn't change the text, it just creates a vector, which is then stored it in a vector database, to be retrieved later.
I don't follow, sorry, I think there is a misunderstanding here. Embedding text doesn't change the text, it just creates a vector, which is then stored it in a vector database, to be retrieved later.
Maybe should have used a different agent, tutorme just dragged it for no reason 😅
@rezzie-rich What the LLM seems to be saying is a non-issue here, don't worry about it. 😄
You may want to keep in mind that LLMs are trained to 1) say the words that best follow the words of the user, and 2) give the user what they want. If you prompt it appropriately, it'll happily answer the opposite. 😅
Back to the issue at hand, we will add both mechanisms eventually. Please feel free to test the linked PR when it's stable, to see how it works for you, your feedback would be appreciated!
give the user what they want. If you prompt it appropriately, it'll happily answer the opposite. 😅
Is the reason I'm emphasizing contextual understanding. It can only answer well when it knows what it needs to know.
Embedding loses the subtle meanings in natural language like metaphor and references. If not in system messages, but for user inputs, the agents must understand even the subtle expressions. It will mostly impact the browser during rag.
Summarizing already poses the risk of losing context. Embedding before summarizing just makes the risk even higher.
contextual embedding may be able to capture more than regular word embedding but it still poses the risk of missing the subtle nuances.
LLM is instruct-tuned to better understand natural language and its subtleness. Summarizing using AI's capabilities to understand language will capture the meaning closest to the original text. after summarization, it can be contextually embedded which still requires testing with or without embedding.
RE: https://github.com/OpenDevin/OpenDevin/issues/2487#issuecomment-2254833846
this describes a vector database where texts are converted into numbers and stored in a relational high-dimensional space.
@rezzie-rich That's not what the 'Sparse Priming Representation' stuff is about. It's more about abstracting/summarizing the key points from the supplied text, while removing any 'irrelevant boilerplate' / wasted tokens / etc. You can see the methodology in the prompts:
Obviously that may still not be useful/relevant here, but I want to make sure we're at least not discounting it based on false assumptions.
@enyst there may have been a misunderstanding from my side regarding embedding. I overestimated the capabilities of GPT4o and sonnet-3.5's when it comes to finding the needle in long context tasks, lol. It turned out shittier than I expected. the following articles give some good insight into this.
though the article talks about non-contextual word embedding like Word2Vec and GloVe, the best approach for embedding IMO would be contextual embedding algorithms like ELMo, BERT, and RoBERTa; which not only embedds the word but also the surrounding context.
Previously, I have experimented with Phi3-medium as a suitable model for local OD. However, the 128k model has worse performance than the 4k model and 4k is just too small for the task. I'm currently looking into internLM2.5-chat which has a 1M context window version. Though the 1M actually performs better than gpt4o 128k (the needle benchmark is depressing for gpt) but internLM has a near 100% score up to 200k which is kind of unlike any other model so I'm trying to train a specialized model with 200k limit. I hope once it's ready, it can be tested with OD.
the experiment in the above link also uses the 'comprehender' & 'prompt engineering' steps like it's mentioned in issue #3151. I believe it's worth looking into and adapting some form of it.
@enyst @khushvind
Correct me if I'm wrong. Currently, the summarization is being done with a set number of words/tokens, like 200 word limit. However, instead of set values, IMO, it should use +-25/80% of the context window size of the model. This way, whether a 32k/64k/128k/even 1m context window llm is used, the memory management will be based on the percentage of that model making it dynamic.
I think there is already an effort to include max output token as a variable. IMO, that is great and should be used everywhere. If the agent knows the limit beforehand, it can modularize the task more accurately. So instead of using any set amount of value for summarized context or similar values, using the percentage of the used model's limit will make it dynamic and allow the maximum performance.
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been stalled for over 30 days with no activity.
Is it possible to create a memgpt feature and make it available to all the agents rather than having a separate agent like it's discussed in #530?