[Discussion] Genie has the highest score in the world on SWE-Bench

therealtimex commented 3 months ago

Issue

I came across https://cosine.sh/. Their SWE approach is promising. I think it's inspiring and Aider can learn from it.

Here's a summary of Genie:

Genie is a groundbreaking AI software engineering model developed by Cosine, a human reasoning lab. It has achieved remarkable performance on industry-standard benchmarks:

Scored 30.08% on SWE-Bench, a 57% improvement over previous best scores
Scored 50.67% on SWE-Lite
For comparison, GPT-4 scores only 1.31% on SWE-Bench

Key Features

Fully autonomous software engineering capabilities
Can solve bugs, build features, refactor code, and write/test code iteratively
Trained on proprietary data representing human reasoning in software engineering
Tackles problems logically like a human engineer, not just generating random code

Unique Approach

Cosine trained Genie on data that codifies human reasoning derived from real examples of software engineers working. This data represents:

Perfect information lineage
Incremental knowledge discovery
Step-by-step decision making

Demonstration

In a demo, Genie solved a real GitHub issue in 84 seconds by:

Fetching the issue
Retrieving relevant files
Writing and iteratively improving code
Using debugging tools
Trying multiple approaches
Opening a PR with title and description

Genie represents a significant advancement in AI-driven software development, demonstrating human-like problem-solving capabilities and efficiency in tackling complex coding tasks. It could be an interesting comparison point or inspiration for the Aider project, showcasing the potential of AI in software engineering.

Version and model info

No response

paul-gauthier commented 3 months ago

Thanks for filing this issue. I saw their announcements and reviewed their SWE Bench submission. They didn't provide much detail beyond hand waving descriptions, and they refused to show their trajectories to the SWE Bench team. So it's pretty hard to have confidence in their result or conclude much about their approach.

therealtimex commented 3 months ago

I also noticed that they didn't include Aider in the benchmark.

paul-gauthier commented 3 months ago

I'm going to close this issue for now, but feel free to add a comment here and I will re-open or file a new issue any time.

Aider-AI / aider