entropy-research / Devon

Devon: An open-source pair programmer
GNU Affero General Public License v3.0
3.19k stars 256 forks source link

Agent regression tests #40

Closed akiradev0x closed 4 months ago

akiradev0x commented 4 months ago

In order to make it easy for people to contribute, we need a good system for checking regressions.

At the moment, I propose using a simple file editing task or something where the model has to search, create and edit files. Running this with each supported model on a PR or something like that. Unsure of frequency as of now.

cyborgmarina commented 4 months ago

As of now, Devon can't be run in headless mode, or am I wrong here? So wouldn't a requirement for this issue be a headless mode for Devon?

Idea: spin off this issue into a "Regression Testing Spec document" where we can draft a TESTING.md file with step-by-step guide following your file editing task (search, create, edit files)

akiradev0x commented 4 months ago

Really good idea, lets get a headless mode working. This shouldnt be too hard, basically we just remove the user tools and force the task to be some task set via the cli in __main__.py

akiradev0x commented 4 months ago

just added headless mode.

akiradev0x commented 4 months ago

closing issue and creating another for step by step testing workflows