Reproducing swebench-lite results

OpenAutoCoder / Agentless

Agentless🐱: an agentless approach to automatically solve software development problems

MIT License

663 stars 71 forks source link

Reproducing swebench-lite results #25

Open justinchiu-cohere opened 3 weeks ago

justinchiu-cohere commented 3 weeks ago

Thanks for releasing the repo, as well as the trajectories for swebench-lite! I am trying to reproduce the results with gpt-4o, but am seeing a fix rate of 59/300, as opposed to the 27.33% reported.

Other than the --plausible flag in rerank, are there any other possible causes for this?
Did you notice a large amount of variance between runs?
I changed the prompts slightly, adding a sentence before # Examples to clarify that we are giving output examples. Could this lead to large changes in resolution?

brutalsavage commented 3 weeks ago

Hi Justin

We had ran agentless multiple times ourselves and the while the results have some variance, it should not be as large as down to 59/300. I would expect a range between ~70s/300 to ~80s/300 (even without plausible flag). As a reference you can see that recently OpenAI had ran our configuration and got 24.3% and it seems they only generate 1 sample per bug.

Please check your configurations are correct, you can refer to the README_swebenchlite.md file to completely recover our experimental settings.

Thanks

justinchiu-cohere commented 3 weeks ago

I tried a fresh clone, and ran the commands in README_swebenchlite.md again. However, after the repair step I'm seeing wc -l results/repair_run_1/output.jsonl == 284, as opposed to the expected 300 in the v0.1.0 release. Oddly, this is also true for results/repair_run_2/output.jsonl as well.

I'll debug a bit more, try evaluating the locations, and also try feeding in the locations from the v0.1.0 release to the repair step instead. Could oai gpt outages cause some of the prompts to fail midway, but be labeled as completed and thus not run again on a subsequent repair call?

GCVulnerability commented 3 weeks ago

I noticed you added a lot of new models like deepseek and gpt-4o-mini, do you have a reference evaluation result on these models? Similar to Justin, I can't seem to fully reproduce the reported performance. But the high cost prevents me from running gpt-4o multiple times.