SWE-Benchmarks (from Huggingface)

granawkins commented 3 months ago

SWE-bench is a public dataset of difficult coding tasks, generated from open-source github commits.

Their schema overlaps with our Sample schema 90%, so we can run all of their 22,000+ benchmarks with a few small tweaks.

Pull Request Checklist

[x] Update the Sample with new fields from SWE-bench and add a from_swe_bench classmethod
[x] Add benchmarks/swe_bench_runner.py to download from huggingface and save as samples, then run the samples
[x] Use BenchmarkResults to save/view
[ ] Add test_command and PASS_TO_PASS fields to BenchmarkResults
[ ] Finish auto-context benchmarks (#500) so benchmark_runner runs/displays auto-context precision/recall
[ ] Documentation has been updated, or this change doesn't require that

Latest

Run with:

python benchmarks/benchmark_runner.py --swe_bench --max_benchmarks 5

This will download the dev split (225 benchmarks) save them locally as samples, then run them with our existing benchmark_runner system and produce results.json and results.html.

mentatbot[bot] commented 3 months ago

MENTAT CODE REVIEW IN ACTIVE DEVELOPMENT. Only in use on mentat and internal repos. Please Reply with feedback.

Overall, the integration with SWE-bench is a valuable addition, enhancing the benchmarking capabilities with a large set of coding tasks. The changes are well-structured, and the addition of new fields to the Sample class aligns with the requirements of SWE-bench. However, there are a few areas where the code could be improved for robustness, especially regarding error handling and URL construction. Additionally, considering a more robust version migration strategy could help maintain backward compatibility as the project evolves.

swayducky commented 1 month ago

Out of curiosity, how does Mentat perform on the benchmark?

AbanteAI / mentat

SWE-Benchmarks (from Huggingface) #544

Pull Request Checklist

Latest