Auto-GPT Performance 📈

Pwuts commented 1 year ago

Note This issue is a work in progress. It will be expanded and elaborated further based on advancing insight and questions (so feel free to ask!).

Performance: what makes or breaks it?

Q: What makes a generalist agent such as Auto-GPT perform or fail? A: all of the below, and more.

Task processing ⚙️

Comprehension

First of all, the agent has to understand the task it is given. Otherwise, there is no way it can be executed correctly, regardless of how well the rest of the application works.

Conversion

Once the task is understood, the agent may convert it to a format that is usable in the rest of the program flow. Examples are a high-level plan, a to-do list, or just a clarified task description.

495
2409
3593
3850
4107
5077

Adherence

It is paramount that the agent sticks to the given task and its scope, and does not alter or expand them without the user's involvement, during setup and over the course of a session.

789
934
4129
4242
4619

Self-correction

When the execution of the task does not go according to plan, this should be recognized by the agent and dealt with appropriately. It is not uncommon for agents not to recognize that an action did not have the intended result and continue executing as if all is fine, which can lead to hallucinations.

4450

Prompting 💬

There are many factors that influence the efficacy of a prompt, but it should be noted that LLMs are not deterministic, linear or time-invariant: changing one word may have unpredictable and seemingly unrelated effects, and LLMs may return different completions when prompted multiple times with the same prompt.

Any prompt must be evaluated for its average performance, and the system as a whole must be designed to correct for negative deviations.

For a guide on how to write good prompts for OpenAI's LLMs, see their GPT best practices.

A few basic principles:

Unnecessary complexity or bloat in the prompt can lead to decreased performance. Vice versa, making a prompt more focused can lead to better performance.
Especially for GPT-3.5, detailed instructions help comprehension and adherence a lot.

1166
1289
3954
4053

Longevity: context length management

Whereas LLMs have a limit to their context length, the basic agent concept does not. This calls for solutions to manage the variable-length parts of the prompt, such as the execution history. The simplest approach is to compress and/or truncate the execution history in order to fit more in the prompt. Another is to use a semantic document store and to select and inject items based on their current relevance.

3536
5182

Tools 🛠️

Suitable tools should be available to the agent Obviously, without the right tools, an agent won't be able to do the job. With good task processing, they can sometimes get close though, using the tools that are available to achieve a partial solution.
The available tools must be suitable for use with an LLM-powered agent The input, output, and side effects of a tool must be very well defined to the LLM, e.g. in the system prompt or the output message of an action with side effects. Also, when a tool fails in some way, the error message should allow the agent to understand the issue, so that it can be dealt with appropriately.

Cost / Speed / Performance 📊

Aside from an agent's capacity to fulfil tasks, its efficiency in terms of time and money should also be considered as a part of its total performance. This comes down to efficient use of resources, and proper choice of LLMs to use for different internal processes of the agent.

Example considerations:

Running a memory system will increase the consumption of API tokens, and may only increase performance in long runs. So in a portion of usage scenarios, the cost of running the memory system will not outweigh the performance benefits.
Using multiple smaller prompts as a part of an agent's internal process (e.g. plan-execute-evaluate) will significantly increase overall token usage while mostly benefiting the agent's performance on complex tasks.
GPT-4 is slower and more expensive, but more able to process complex tasks and more capable of dealing with setbacks.

Measuring performance 📈

To measure the impact of changes and intended improvements on performance, we use our Benchmark. This benchmark is also used and recognized by various other agent developers (see the README).

Notes:

As with prompts, the performance of an agent as a whole for a given challenge must be expressed as an average performance number, in this case a plain success rate. The agent may behave differently between tries, leading to varying results.
- When a challenge gets a success rate between 30% and 80%, a >95% success rate is probably a matter of minor prompt optimization.
Most if not all of the current benchmark challenges have a time cut-off. If an agent is capable but too slow, it can still fail the challenge.

Example of verified improvement over a number of revisions:

Boostrix commented 1 year ago

Not sure if we are supposed to comment here or not (sorry if this is the wrong place, feel free to delete/move my comment as needed):

Longevity: context length management

I ran into this when I tinkered with recursively splitting up plans using a "compile_plan" command to come up with atomic todo lists that would eventually end up resembing BIFs #4107

My idea at the time was to artificially constrain each step to end up being mappable to a built-in-function AND fit into the context window, and if not divide & conquer. Each would then be recursively executed by a sub-agent with little/no outer dependencies in terms of outer context, resembling a call stack, as mentioned here: https://github.com/Significant-Gravitas/AutoGPT/issues/3933#issuecomment-1538470999

From that perspective, tackling "task management" like this also means managing our context window better, by identifying independent sub-tasks to delegate those to sub-agents: #70

Pwuts commented 4 months ago

This issue has been superseded by the following roadmap item:

6964

Significant-Gravitas / AutoGPT