“Impressively, when running DeepSeek-V2-coder, a small language model with multiple sampling, the model outperformed state-of-the-art models like GPT-4o or Claude 3.5 Sonnet, achieving a new state-of-the-art 56% in SWE-Bench Lite (a benchmark that evaluates a model’s capacity to solve GitHub issues), while these two models, combined, achieved 43% (Mixed models).”
Loved to see Moatless Tools used to set a new SoTA on SWE-Bench Lite by using multi-shot (active search).
Read the paper
From a related article on Medium:
“Impressively, when running DeepSeek-V2-coder, a small language model with multiple sampling, the model outperformed state-of-the-art models like GPT-4o or Claude 3.5 Sonnet, achieving a new state-of-the-art 56% in SWE-Bench Lite (a benchmark that evaluates a model’s capacity to solve GitHub issues), while these two models, combined, achieved 43% (Mixed models).”