Since the SWE-Bench team has developed a more stable containerized evaluation harness, this project will no longer be maintained. However, to continue promoting easier evaluation runs, I've set up a hosted solution for running evaluations that I'm trying out right now.
Note: This is a test solution to help streamline the evaluation process. Please send any feedback to albert@moatless.ai.
This is a Dockerfile based solution of the SWE-Bench evaluation framework.
The solution is designed so that each "testbed" for testing a version of a repository is built in a separate Docker image. Each test is then run in its own Docker container. This approach ensures more stable test results because the environment is completely isolated and is reset for each test. Since the Docker container can be recreated each time, there's no need for reinstallation, speeding up the benchmark process.
Docker images for testbeds used in the SWE-Bench_Lite
dataset has been built and tested on gold predictions.
2 benchmark instances are currently failing.
See results in the evaluations/SWE-bench_Lite_golden folder.
Docker images for testbeds used in the SWE-Bench
dataset has been built and tested on the check-harness
predictions
published by SWE-bench.
10 benchmark instances are currently failing.
See results in the evaluations/SWE-bench_check_harness folder.
I have tested running Docker benchmarks on the SWE-Agents GPT-4 benchmark and Auto Code Rover's first benchmark run.
The SWE-Agent GPT-4 predictions yield exactly the same results of 18% (54) resolved issues as SWE-Agent's own results, which seems to show that the Docker image approach works with the same accuracy.
However, the Docker benchmark provides better results for AutoCodeRover. In AutoCodeRover's own benchmarks, they achieve 16.00% (48), 15.67% (47), and 16.67% (50) resolved issues. In swe-bench-docker, the same predictions result in 18.00% (54), 19% (57) and 19% (57) resolved issues. This adds up to a pass@3 of 26% (78) compared to 22.33% (67) reported in the AutoCodeRover paper. This suggests that other agents' benchmarks may show lower results than they actually achieve because it's challenging to conduct evaluations with completely accurate results.
There are currently three different Docker images for running benchmarks.
Testbeds are set up in a Conda environment similar to the original SWE-bench environment.
Since each benchmark is tested in its own container, using Conda may be overkill. Testbeds are set up with only the
correct Python version installed via Pyenv. This approach has been shown to result in fewer erroneous benchmark
instances in repositories where it has been tested, and the image becomes smaller. Currently, django
, psf/requests
and scikit-learn
use this type of Docker image. Hopefully, more repositories can be run this way.
In scikit-learn
, some benchmarks seem to fail because Cython code isn't compiled. To avoid building the project before each test, an image is built for each benchmark instance.
Run run_evaluation.py
to evaluate a predictions file. A log for each test is written to log_dir in the same format as in the SWE-bench evaluation tools, and the same tooling can then be used to generate a report.
Each prediction will be provided to the docker image in a base64 encoded environment variable. This might fail if the predictions are too large. To avoid this the export environment variable SWEBENCH_DOCKER_FORK_DIR
can be set to provide the prediction in a file in a mounted volume instead.
git clone https://github.com/aorwall/SWE-bench-docker.git
export SWEBENCH_DOCKER_FORK_DIR=/path/to/SWE-bench-docker
Run evaluation
python run_evaluation.py
--predictions_path [Required] Path to the predictions file
--log_dir [Required] Path to directory to save evaluation log files
--swe_bench_tasks [Required] Path to SWE-bench task instances file or dataset
--namespace [Optional] Namespace of the Docker repository
--log_suffix [Optional] Suffix to append to log file names
--skip_existing [Optional] Skip evaluating task instances with logs that already exist
--timeout [Optional] Timeout for installation + test script execution
--num_processes [Optional] Number of processes to run in parallel (-1 for unlimited)
It might be worth pulling all Images before running the script to achieve more consistent timing in the evaluation.
scripts/pull_docker_images.sh [Dockerfiles directory] [Namespace]
Generates Dockerfiles for all test beds in a SWE-Bench benchmark dataset. These can then be used to build Docker images.
python run_dockerfile_generator.py
--swe_bench_tasks [Required] Path to SWE-bench task instances file or dataset
--namespace [Required] Namespace of the Docker repository
--docker_dir [Required] Path to the directory where the Dockerfiles will be saved
This script builds Docker images from all Dockerfiles.
scripts/build_docker_images.sh [Dockerfiles directory] [Namespace]
This script builds Docker images from all Dockerfiles.
scripts/push_docker_images.sh [Dockerfiles directory] [Namespace]
Run a single instance and print logs to stdout.
python run_single_instance.py
--instance_id [Required] Instance ID of the task to run
--swe_bench_tasks [Optional] Path to SWE-bench task instances file or dataset (default is princeton-nlp/SWE-bench_Lite)
--namespace [Optional] Namespace of the Docker repository
--predictions_path [Optional] Path to the predictions file, if not set the golden patch will be used
Run any or all tests in an instance repo and print logs to stdout.
python run_instance_tests.py
--instance_id [Required] Instance ID of the task to run
--swe_bench_tasks [Optional] Path to SWE-bench task instances file or dataset (default is princeton-nlp/SWE-bench_Lite)
--namespace [Optional] Namespace of the Docker repository
--predictions_path [Optional] Path to the predictions file, if not set the golden patch will be used
--test_directives [Optional] List of tests to run, e.g. "path/to/test.py::test1 path/to/test.py::test2". If empty, run all tests.
--test_output_dir [Optional] Path to directory to save test output
scripts/build_docker_images.sh [Namespace] [Testbed directory]