Closed Pwuts closed 4 months ago
Not sure if we are supposed to comment here or not (sorry if this is the wrong place, feel free to delete/move my comment as needed):
Longevity: context length management
I ran into this when I tinkered with recursively splitting up plans using a "compile_plan" command to come up with atomic todo lists that would eventually end up resembing BIFs #4107
My idea at the time was to artificially constrain each step to end up being mappable to a built-in-function AND fit into the context window, and if not divide & conquer. Each would then be recursively executed by a sub-agent with little/no outer dependencies in terms of outer context, resembling a call stack, as mentioned here: https://github.com/Significant-Gravitas/AutoGPT/issues/3933#issuecomment-1538470999
From that perspective, tackling "task management" like this also means managing our context window better, by identifying independent sub-tasks to delegate those to sub-agents: #70
This issue has been superseded by the following roadmap item:
Performance: what makes or breaks it?
Q: What makes a generalist agent such as Auto-GPT perform or fail? A: all of the below, and more.
Task processing ⚙️
Comprehension
First of all, the agent has to understand the task it is given. Otherwise, there is no way it can be executed correctly, regardless of how well the rest of the application works.
Conversion
Once the task is understood, the agent may convert it to a format that is usable in the rest of the program flow. Examples are a high-level plan, a to-do list, or just a clarified task description.
495
2409
3593
3850
4107
5077
Adherence
It is paramount that the agent sticks to the given task and its scope, and does not alter or expand them without the user's involvement, during setup and over the course of a session.
789
934
4129
4242
4619
Self-correction
When the execution of the task does not go according to plan, this should be recognized by the agent and dealt with appropriately. It is not uncommon for agents not to recognize that an action did not have the intended result and continue executing as if all is fine, which can lead to hallucinations.
4450
Prompting 💬
There are many factors that influence the efficacy of a prompt, but it should be noted that LLMs are not deterministic, linear or time-invariant: changing one word may have unpredictable and seemingly unrelated effects, and LLMs may return different completions when prompted multiple times with the same prompt.
Any prompt must be evaluated for its average performance, and the system as a whole must be designed to correct for negative deviations.
For a guide on how to write good prompts for OpenAI's LLMs, see their GPT best practices.
A few basic principles:
Related:
1166
1289
3954
4053
Longevity: context length management
Whereas LLMs have a limit to their context length, the basic agent concept does not. This calls for solutions to manage the variable-length parts of the prompt, such as the execution history. The simplest approach is to compress and/or truncate the execution history in order to fit more in the prompt. Another is to use a semantic document store and to select and inject items based on their current relevance.
3536
5182
Tools 🛠️
Suitable tools should be available to the agent Obviously, without the right tools, an agent won't be able to do the job. With good task processing, they can sometimes get close though, using the tools that are available to achieve a partial solution.
The available tools must be suitable for use with an LLM-powered agent The input, output, and side effects of a tool must be very well defined to the LLM, e.g. in the system prompt or the output message of an action with side effects. Also, when a tool fails in some way, the error message should allow the agent to understand the issue, so that it can be dealt with appropriately.
Cost / Speed / Performance 📊
Aside from an agent's capacity to fulfil tasks, its efficiency in terms of time and money should also be considered as a part of its total performance. This comes down to efficient use of resources, and proper choice of LLMs to use for different internal processes of the agent.
Example considerations:
Measuring performance 📈
To measure the impact of changes and intended improvements on performance, we use our Benchmark. This benchmark is also used and recognized by various other agent developers (see the README).
Notes:
Example of verified improvement over a number of revisions: