Where are the rest of the runs, and how do you get your accuracy numbers?

Naqu6 commented 2 weeks ago

Hi, cool project :)

I took a look at the evals and noticed that there's only 127 eval files. Further, only 107 of them seem to pass the tests.

Would it be possible for you to post the rest of the eval files?

If not, a list of instances that you resolved would be great.

Thanks!

theskcd commented 2 weeks ago

Hey @Naqu6 ! Thanks for reaching out. Yeah I need to publish the results on the benchmark and cleanup the list and publish the data.

Regarding the failures, I cross checked the failures with other implementations as well and noticed that those tests were broken (every implementation solving swe-bench is counting them as success) further using the docket setup we can repro if the patch passes the test CMD on not (which is how I calculated the number and figured out that it was really passing), didn't want to just count them as success for the metric!

My plan is to upload everything (every eval file and patch generated) to the repo soon along with doing another deep dive into the generated patches

Naqu6 commented 2 weeks ago

Nice to meet you @theskcd :)

Regarding the failures, I cross checked the failures with other implementations as well and noticed that those tests were broken (every implementation solving swe-bench is counting them as success) further using the docket setup we can repro if the patch passes the test CMD on not (which is how I calculated the number and figured out that it was really passing), didn't want to just count them as success for the metric! My plan is to upload everything (every eval file and patch generated) to the repo soon along with doing another deep dive into the generated patches

I'm aware of this too and am talking with John (SWE-bench author) about this tomorrow. Happy to chat about what I've discovered, if you want to setup a meeting my email is in the linkedin on my profie.

The following instances have flaky tests:

{'django__django-13315',
 'django__django-13447',
 'django__django-13590',
 'django__django-13710',
 'django__django-13757',
 'django__django-13933',
 'django__django-13964',
 'django__django-14017',
 'django__django-14238',
 'django__django-14382',
 'django__django-14608',
 'django__django-14672',
 'django__django-14752',
 'django__django-14997',
 'django__django-14999',
 'django__django-15320',
 'django__django-15738',
 'django__django-15814',
 'django__django-15819',
 'django__django-16229',
 'django__django-16379',
 'django__django-16400',
 'django__django-17051',
 'matplotlib__matplotlib-23987',
 'psf__requests-2317',
 'psf__requests-2674',
 'psf__requests-3362',
 'psf__requests-863',
 'sympy__sympy-13146',
 'sympy__sympy-13177'}

That is, the golden patches sometimes flake and fail the tests instead of passing.

sympy__sympy-16988 and django__django-15790 also flake sometimes (incorrect answers are marked as correct sometimes).

If there's any chance you could upload your passes/fails for SWE-Bench lite outside of these 32 instances, that would be great! Also if you've discovered anything else that flakes please let me know :)

codestoryai / swe_bench_traces

Where are the rest of the runs, and how do you get your accuracy numbers? #1