Closed granawkins closed 3 months ago
MENTAT CODE REVIEW IN ACTIVE DEVELOPMENT. Only in use on mentat and internal repos. Please Reply with feedback.
This pull request introduces several important updates to integrate SWE-Bench's test results into the benchmark runner, enhance the sampler with new fields for better alignment with SWE-Bench, and improve the documentation. These changes collectively enhance the benchmarking capabilities and ensure better compatibility with SWE-Bench standards. It's also commendable to see the effort to keep the documentation in sync with the code changes.
MENTAT CODE REVIEW IN ACTIVE DEVELOPMENT. Only in use on mentat and internal repos. Please Reply with feedback.
This pull request makes significant strides in integrating SWE-Bench's test results into the benchmark runner and enhancing the sampler's compatibility with SWE-Bench. The effort to update documentation alongside these changes is commendable. Some suggestions have been made to further improve error handling, code readability, and user experience.
MENTAT CODE REVIEW IN ACTIVE DEVELOPMENT. Only in use on mentat and internal repos. Please Reply with feedback.
This pull request introduces several enhancements and new features to improve integration with SWE-Bench and the overall benchmarking process. The changes are well thought out and aim to provide a more detailed and accurate benchmarking experience. Some additional suggestions have been made to further improve the robustness, usability, and documentation of these features.
Current status:
python3 benchmarks/benchmark_runner.py --swe_bench
now generates our standard results.html with a new Tests Passed field (bool for each benchmark), giving us a % swe-bench pass rate. Generates too much data , need to streamline. python3 benchmarks/context_benchmark.py --swe_bench
compares (A) edited paths vs (B) included paths, and gives precision/recall. Will add this to the benchmark_runner the same way.New arguments to benchmark_runner.py:
--swe_bench
: downloads, saves, validates, caches then runs SWE-Bench samples from huggingface--auto_context_tokens
: sets the config value for the benchmark runsNew fields in BenchmarkResult:
Some caveats:
summoning-the-shoggoth/swe_bench
and the script will look for it there.Ideas for next:
Pull Request Checklist