AbanteAI / mentat

Mentat - The AI Coding Assistant
https://mentat.ai
Apache License 2.0
2.42k stars 226 forks source link

SWE-Benchmarks (from Huggingface) #544

Closed granawkins closed 3 months ago

granawkins commented 3 months ago

SWE-bench is a public dataset of difficult coding tasks, generated from open-source github commits.

Their schema overlaps with our Sample schema 90%, so we can run all of their 22,000+ benchmarks with a few small tweaks.

Pull Request Checklist

Latest

Run with:

python benchmarks/benchmark_runner.py --swe_bench --max_benchmarks 5

This will download the dev split (225 benchmarks) save them locally as samples, then run them with our existing benchmark_runner system and produce results.json and results.html.

mentatbot[bot] commented 3 months ago

MENTAT CODE REVIEW IN ACTIVE DEVELOPMENT. Only in use on mentat and internal repos. Please Reply with feedback.

Overall, the integration with SWE-bench is a valuable addition, enhancing the benchmarking capabilities with a large set of coding tasks. The changes are well-structured, and the addition of new fields to the Sample class aligns with the requirements of SWE-bench. However, there are a few areas where the code could be improved for robustness, especially regarding error handling and URL construction. Additionally, considering a more robust version migration strategy could help maintain backward compatibility as the project evolves.

swayducky commented 1 month ago

Out of curiosity, how does Mentat perform on the benchmark?