AbanteAI / mentat

Mentat - The AI Coding Assistant
https://mentat.ai
Apache License 2.0
2.42k stars 226 forks source link

Swe bench results #549

Closed granawkins closed 3 months ago

granawkins commented 3 months ago

Pull Request Checklist

mentatbot[bot] commented 3 months ago

MENTAT CODE REVIEW IN ACTIVE DEVELOPMENT. Only in use on mentat and internal repos. Please Reply with feedback.

This pull request introduces several important updates to integrate SWE-Bench's test results into the benchmark runner, enhance the sampler with new fields for better alignment with SWE-Bench, and improve the documentation. These changes collectively enhance the benchmarking capabilities and ensure better compatibility with SWE-Bench standards. It's also commendable to see the effort to keep the documentation in sync with the code changes.

mentatbot[bot] commented 3 months ago

MENTAT CODE REVIEW IN ACTIVE DEVELOPMENT. Only in use on mentat and internal repos. Please Reply with feedback.

This pull request makes significant strides in integrating SWE-Bench's test results into the benchmark runner and enhancing the sampler's compatibility with SWE-Bench. The effort to update documentation alongside these changes is commendable. Some suggestions have been made to further improve error handling, code readability, and user experience.

mentatbot[bot] commented 3 months ago

MENTAT CODE REVIEW IN ACTIVE DEVELOPMENT. Only in use on mentat and internal repos. Please Reply with feedback.

This pull request introduces several enhancements and new features to improve integration with SWE-Bench and the overall benchmarking process. The changes are well thought out and aim to provide a more detailed and accurate benchmarking experience. Some additional suggestions have been made to further improve the robustness, usability, and documentation of these features.

granawkins commented 3 months ago

Current status:

granawkins commented 3 months ago

New arguments to benchmark_runner.py:

New fields in BenchmarkResult:

Some caveats:

Ideas for next:

  1. Make repo-specific test runners (like we did for Exercism) to get all 225 samples working
  2. Do sample.context and context_benchmark w/r/t lines instead of files
  3. Run it on a server instead of my laptop
  4. Benchmark Claude-Opus vs GPT-4 vs Gemini
  5. Hookup to Butler? Agent mode?