Run evaluation on full SWE-Bench

rezzie-rich commented 4 months ago

Love the progress so far!

Will you guys test and publish the full swe-bench and the 25% subset test besides just the swe-bench lite?

On auto-code-rover repo, it says 22% on swe-bench lite and 16% on full swe-bench. However, you guys have ACR at 16% on swe-bench lite. Is that the result you guys got or a typo?

rbren commented 4 months ago

Thanks for pointing that out! Not sure if it was a typo, or if we were using an old result of theirs.

Let's remove the graph until we can generate a better one

neubig commented 4 months ago

This number is from the most recent version of the AutoCodeRover paper! I think we should clarify this in the graph.

frankxu2004 commented 4 months ago

From the ACR paper, it shows:

Note that ACR-avg is the comparable number here, as it's the average of 3 runs (meaning that it's pass@1 rate). I see that the 22.33% number inside the repo is the ACR-all, which is the union of 3 runs ( meaning that it's pass@3 rate). I think it's still valid number comparison that for pass@1, ACR is 16%

rbren commented 4 months ago

Ahh thank you for the clarification! I will remove my PR

neubig commented 4 months ago

So I think the AutoCodeRover is fine as-is, but I agree we should still run on all of SWE-Bench. The main bottleneck for this is time and cost, it costs about $6,000 to run on all of SWE-Bench with GPT-4

libowen2121 commented 4 months ago

@rezzie-rich Thank you for the question! As @frankxu2004 clarified, we only report the pass@1 results in the graph. Our evaluation containerization only supports SWE-bench-lite for now and we will extend it to support the full test set!

rezzie-rich commented 4 months ago

Gpt-4 is expensive. I think it would be cool if you guys could run the full bench using llama3 70b and 8b as it would give a unique and realistic expectation of running with open llm.

It's hard to compare swe-bench with humans, but as a rule of thumb, an average jr. developer should be able to complete 10-25% while an average sr. developer can 20-40%.

If we can have opendevin complete 25%+ using an open llm (preferably with less than 34b parameters), it's a game changer!

rezzie-rich commented 4 months ago

https://evalplus.github.io/leaderboard.html

Leaderboard for open code llm is kinda diluted. However, i found this up to date leaderboard that seems pretty legit.

It has codeQwen-1.5-7b-chat listed above Claude-3-opus right next to gpt-4. Small llm like this should be able to run the bench faster and lot cheaper compared to gpt-4.

If the leaderboard is accurate, it makes codeQwen a valid replacement for gpt-4.

If opendevin can complete 20-25% of the full swe-bench using a 7b model, that would prove the practically and real use case of ai agents in software development.

My thoughts: Testing the agents on smaller models will also be good for marketing and user satisfaction as well as improve the agents' quality. Since most will try opendevin seeing the gpt-4 results but use it with a local model for budget, it creates an unsatisfying experience. Instead, they could try seeing the local models score and replicate the results, making it more satisfying. Which also leaves room for more performance gain once used with closed llm. It's better to promise less than the offering.

rezzie-rich commented 3 months ago

https://chat.lmsys.org/?leaderboard

llama3-70b-instruct is performing better than half of GPT-4 versions. I think it would be great to have benchmarks done using llama3 in the spirit of open-source community while keeping the usage practical.

I know the quant models degrade in performance. however, Q8 models are almost indistinguishable from fp16. a modern performance CPU with 128 GB RAM can easily handle it while keeping it relatively cheaper.

xingyaoww commented 3 months ago

@rezzie-rich Good point -- However, Llama-3 only has 8k context window which means it is hardly useful in our agent usecases. I just tested the recent deepseek-V2 MoE - Check results here: https://huggingface.co/spaces/OpenDevin/evaluation

It got ~5% on SWE-Bench lite, and from what i can tell qualitatively, a lot of error cases (~70%) are due to limited context window (32k) of their API. I can only imagine this been way worse on llama-3 due to its 8k window.

rezzie-rich commented 3 months ago

https://huggingface.co/gradientai/Llama-3-70B-Instruct-Gradient-1048k

This version has a million context window.

Btw, LOVE the new huggingface space!

xingyaoww commented 3 months ago

@rezzie-rich Thanks a ton for sharing!!! Will try to get some GPU and test it right away!!!

BradKML commented 3 months ago

Seconding this but not just switching models like @rezzie-rich does (great idea BTW if it can be included into Ollama or some other tool), but also are there any alternative benchmarks for seeing how good they can solve competitive coding problems (or data science problems) for confirmation of quality over mere LLM? Maybe mixing big and small LLMs (e.g. Qwen + LLaMA combo) for added acceleration?

yuntongzhang commented 2 months ago

Hi, I'm late to the discussion, but would like to update on the pass@1 score in the original AutoCodeRover paper.

Turns out that the SWE-bench evaluation environment used in our original experiments gives underestimated scores due to missing of system-level dependencies. Some correct patches were deemed as wrong after running the SWE-bench acceptance tests in that environment.

Thanks to the SWE-bench-docker project, our original patches were re-evaluated, and the actual pass@1 score is 19% instead of 16%. More details can be found here. The 19% pass@1 score is also reflected on SWE-bench leaderboard.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

0xdevalias commented 1 month ago

Shouldn't be stale IMO

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

xingyaoww commented 2 weeks ago

Some updates: we are making some progress on the infrastructure side - hopefully, we can resolve this in ~2 weeks!

Running OpenHands with 2000 docker containers effecienctly is not an easy task 😓

All-Hands-AI / OpenHands

Run evaluation on full SWE-Bench #1693