Significant-Gravitas / AutoGPT

AutoGPT is the vision of accessible AI for everyone, to use and to build on. Our mission is to provide the tools, so that you can focus on what matters.
https://agpt.co
Other
168.35k stars 44.41k forks source link

Auto-GPT Performance 📈 #5190

Closed Pwuts closed 4 months ago

Pwuts commented 1 year ago

Note This issue is a work in progress. It will be expanded and elaborated further based on advancing insight and questions (so feel free to ask!).

Performance: what makes or breaks it?

Q: What makes a generalist agent such as Auto-GPT perform or fail? A: all of the below, and more.

Task processing ⚙️

Comprehension

First of all, the agent has to understand the task it is given. Otherwise, there is no way it can be executed correctly, regardless of how well the rest of the application works.

Conversion

Once the task is understood, the agent may convert it to a format that is usable in the rest of the program flow. Examples are a high-level plan, a to-do list, or just a clarified task description.

Adherence

It is paramount that the agent sticks to the given task and its scope, and does not alter or expand them without the user's involvement, during setup and over the course of a session.

Self-correction

When the execution of the task does not go according to plan, this should be recognized by the agent and dealt with appropriately. It is not uncommon for agents not to recognize that an action did not have the intended result and continue executing as if all is fine, which can lead to hallucinations.

Prompting 💬

There are many factors that influence the efficacy of a prompt, but it should be noted that LLMs are not deterministic, linear or time-invariant: changing one word may have unpredictable and seemingly unrelated effects, and LLMs may return different completions when prompted multiple times with the same prompt.

Any prompt must be evaluated for its average performance, and the system as a whole must be designed to correct for negative deviations.

For a guide on how to write good prompts for OpenAI's LLMs, see their GPT best practices.

A few basic principles:

Related:

Longevity: context length management

Whereas LLMs have a limit to their context length, the basic agent concept does not. This calls for solutions to manage the variable-length parts of the prompt, such as the execution history. The simplest approach is to compress and/or truncate the execution history in order to fit more in the prompt. Another is to use a semantic document store and to select and inject items based on their current relevance.

Tools 🛠️

Cost / Speed / Performance 📊

Aside from an agent's capacity to fulfil tasks, its efficiency in terms of time and money should also be considered as a part of its total performance. This comes down to efficient use of resources, and proper choice of LLMs to use for different internal processes of the agent.

Example considerations:

Measuring performance 📈

To measure the impact of changes and intended improvements on performance, we use our Benchmark. This benchmark is also used and recognized by various other agent developers (see the README).

Notes:

Example of verified improvement over a number of revisions: image

Boostrix commented 1 year ago

Not sure if we are supposed to comment here or not (sorry if this is the wrong place, feel free to delete/move my comment as needed):

Longevity: context length management

I ran into this when I tinkered with recursively splitting up plans using a "compile_plan" command to come up with atomic todo lists that would eventually end up resembing BIFs #4107

My idea at the time was to artificially constrain each step to end up being mappable to a built-in-function AND fit into the context window, and if not divide & conquer. Each would then be recursively executed by a sub-agent with little/no outer dependencies in terms of outer context, resembling a call stack, as mentioned here: https://github.com/Significant-Gravitas/AutoGPT/issues/3933#issuecomment-1538470999

From that perspective, tackling "task management" like this also means managing our context window better, by identifying independent sub-tasks to delegate those to sub-agents: #70

Pwuts commented 4 months ago

This issue has been superseded by the following roadmap item: