Open justinchiu-cohere opened 3 weeks ago
Hi Justin
We had ran agentless multiple times ourselves and the while the results have some variance, it should not be as large as down to 59/300. I would expect a range between ~70s/300 to ~80s/300 (even without plausible flag). As a reference you can see that recently OpenAI had ran our configuration and got 24.3% and it seems they only generate 1 sample per bug.
Please check your configurations are correct, you can refer to the README_swebenchlite.md file to completely recover our experimental settings.
Thanks
I tried a fresh clone, and ran the commands in README_swebenchlite.md
again. However, after the repair step I'm seeing wc -l results/repair_run_1/output.jsonl
== 284, as opposed to the expected 300 in the v0.1.0 release. Oddly, this is also true for results/repair_run_2/output.jsonl
as well.
I'll debug a bit more, try evaluating the locations, and also try feeding in the locations from the v0.1.0 release to the repair step instead. Could oai gpt outages cause some of the prompts to fail midway, but be labeled as completed and thus not run again on a subsequent repair call?
I noticed you added a lot of new models like deepseek and gpt-4o-mini, do you have a reference evaluation result on these models? Similar to Justin, I can't seem to fully reproduce the reported performance. But the high cost prevents me from running gpt-4o multiple times.
Thanks for releasing the repo, as well as the trajectories for swebench-lite! I am trying to reproduce the results with gpt-4o, but am seeing a fix rate of 59/300, as opposed to the 27.33% reported.
--plausible
flag in rerank, are there any other possible causes for this?# Examples
to clarify that we are giving output examples. Could this lead to large changes in resolution?