Examples for benchmarking

p-i- commented 1 year ago

Duplicates

[X] I have searched the existing issues

Summary 💡

In order to improve AutoGPT, we need examples of simple tasks which it should be able to ace but DOESN'T.

If you've found a good candidate task, please dump the contents ai_settings.yaml JSON in a comment.

Examples 🌈

ai_goals:
- Get one PR number for the Auto-GPT project that is conflict-free and can be merged
ai_name: Manager-GPT
ai_role: an AI designed to look for GitHub pull requests that are free of conflicts

This gets lost doing google searches.

Motivation 🔦

If we have decent benchmarks we can figure out where the technology is weakest, and apply focus at the key points.

Boostrix commented 1 year ago

Like I mentioned elsewhere, I firmly believe the project should consider adding a new directory to the repository where all such ai-settings.yaml files could be committed/maintained (to have history etc). The point being, for any sort of AI, you inevitably need some form of benchmark and training data.

Thus, if the ~120k of folks using this project currently could share some of their dysfunctional ai-settings.yaml files, this benchmark suite could grow rapidly over the course of a couple of weeks, and then it would be a piece of cake to run this suite every once in a while to see which improvements to AutoGPT help the agent complete more tasks / in a more correct fashion.

Thus, please consider having some benchmark suite for yaml files - ideally, one category of files that are known to work well (for regression testing purposes) and another one for yaml files that "break" Auto-GPT (endless loops, repeating stuff etc)

Once this is in place, this could help improve the project rather rapidly.

For future reference:

https://github.com/Significant-Gravitas/Auto-GPT/issues/15#issuecomment-1498370501

Basically this whole github issue revolves around tests: We'll need to run benchmarks in github action to validate it's not "loosing" capability at every pull request. the benchmark has to use the same version of GPT every time and has to test the whole spectrum of what autogpt can do:

write text

browse the internet

execute commands

etc, etc...

https://github.com/Significant-Gravitas/Auto-GPT/issues/15#issuecomment-1498370501

github-actions[bot] commented 1 year ago

This issue has automatically been marked as stale because it has not had any activity in the last 50 days. You can unstale it by commenting or removing the label. Otherwise, this issue will be closed in 10 days.

github-actions[bot] commented 1 year ago

This issue was closed automatically because it has been stale for 10 days with no activity.

Significant-Gravitas / AutoGPT