Swe bench results - Githubissues

granawkins commented 3 months ago

Pull Request Checklist

[x] Add SWE-Bench's test results to our Benchmark Runner
[x] Add auto-context benchmark based on SWE-Bench
[x] Documentation has been updated, or this change doesn't require that

mentatbot[bot] commented 3 months ago

MENTAT CODE REVIEW IN ACTIVE DEVELOPMENT. Only in use on mentat and internal repos. Please Reply with feedback.

This pull request introduces several important updates to integrate SWE-Bench's test results into the benchmark runner, enhance the sampler with new fields for better alignment with SWE-Bench, and improve the documentation. These changes collectively enhance the benchmarking capabilities and ensure better compatibility with SWE-Bench standards. It's also commendable to see the effort to keep the documentation in sync with the code changes.

mentatbot[bot] commented 3 months ago

MENTAT CODE REVIEW IN ACTIVE DEVELOPMENT. Only in use on mentat and internal repos. Please Reply with feedback.

This pull request makes significant strides in integrating SWE-Bench's test results into the benchmark runner and enhancing the sampler's compatibility with SWE-Bench. The effort to update documentation alongside these changes is commendable. Some suggestions have been made to further improve error handling, code readability, and user experience.

mentatbot[bot] commented 3 months ago

MENTAT CODE REVIEW IN ACTIVE DEVELOPMENT. Only in use on mentat and internal repos. Please Reply with feedback.

This pull request introduces several enhancements and new features to improve integration with SWE-Bench and the overall benchmarking process. The changes are well thought out and aim to provide a more detailed and accurate benchmarking experience. Some additional suggestions have been made to further improve the robustness, usability, and documentation of these features.

granawkins commented 3 months ago

Current status:

[x] python3 benchmarks/benchmark_runner.py --swe_bench now generates our standard results.html with a new Tests Passed field (bool for each benchmark), giving us a % swe-bench pass rate. Generates too much data , need to streamline.
[x] python3 benchmarks/context_benchmark.py --swe_bench compares (A) edited paths vs (B) included paths, and gives precision/recall. Will add this to the benchmark_runner the same way.
[x] After the above, do a complete run of our 127 selected swe-benchmarks.

granawkins commented 3 months ago

New arguments to benchmark_runner.py:

--swe_bench: downloads, saves, validates, caches then runs SWE-Bench samples from huggingface
--auto_context_tokens: sets the config value for the benchmark runs

New fields in BenchmarkResult:

test_eval_passed (bool): For each benchmark, whether all the selected tests pass after editing. The aggregate % is Mentat’s SWE-Bench score.
context_precision (float): sample.context (files which should be edited) vs Mentat auto-context (no included files)
context_recall(float): “”

Some caveats:

Setting up the test environment is not straightforward because each repo has peculiarities. I’ve done something quick-and-dirty with pytest that works for 127/225 samples, and for now I’m ignoring the rest. Validation takes several hours so I've cached it in summoning-the-shoggoth/swe_bench and the script will look for it there.
I set the SWE-Bench sample.context (target) as the files edited in the patch, but our auto-context system selects intervals. For now the context_benchmark only compares paths, so it’s inaccurate.

Ideas for next:

Make repo-specific test runners (like we did for Exercism) to get all 225 samples working
Do sample.context and context_benchmark w/r/t lines instead of files
Run it on a server instead of my laptop
Benchmark Claude-Opus vs GPT-4 vs Gemini
Hookup to Butler? Agent mode?

AbanteAI / mentat

Swe bench results #549

Pull Request Checklist