Closed granawkins closed 3 months ago
MENTAT CODE REVIEW IN ACTIVE DEVELOPMENT. Only in use on mentat and internal repos. Please Reply with feedback.
Overall, the integration with SWE-bench is a valuable addition, enhancing the benchmarking capabilities with a large set of coding tasks. The changes are well-structured, and the addition of new fields to the Sample
class aligns with the requirements of SWE-bench. However, there are a few areas where the code could be improved for robustness, especially regarding error handling and URL construction. Additionally, considering a more robust version migration strategy could help maintain backward compatibility as the project evolves.
Out of curiosity, how does Mentat perform on the benchmark?
SWE-bench is a public dataset of difficult coding tasks, generated from open-source github commits.
Their schema overlaps with our
Sample
schema 90%, so we can run all of their 22,000+ benchmarks with a few small tweaks.Pull Request Checklist
Sample
with new fields fromSWE-bench
and add afrom_swe_bench
classmethodbenchmarks/swe_bench_runner.py
to download from huggingface and save as samples, then run the samplesBenchmarkResults
to save/viewtest_command
andPASS_TO_PASS
fields to BenchmarkResultsLatest
Run with:
This will download the
dev
split (225 benchmarks) save them locally as samples, then run them with our existing benchmark_runner system and produce results.json and results.html.