Aider-AI / aider

aider is AI pair programming in your terminal
https://aider.chat/
Apache License 2.0
22.42k stars 2.08k forks source link

Improving Aider benchmark #2411

Open RalphMao opened 3 days ago

RalphMao commented 3 days ago

Issue

Sharing some thoughts on how to make Aider benchmark closer to real development and detecting potential overfitting issues. During the trials, the aider agent should not have access to the unit tests, but be asked to write its own tests and iterate over the solution. The "oracle" unit tests are only used during final scoring.

  1. Closer to real development - fornew features, there is no guarantee of the availability of "oracle" tests. Especially for complex problems, setting the test boundary is as challenging as implementing the features.
  2. Detect potential overfitting - we can ask an LLM judge, similar to MT-bench, to evaluate the similarity between the written tests and official unit tests. Higher similarity implies higher possibility of overfitting.

What's your thoughts on this idea? I can go ahead and make PRs if this idea is echoed by other folks.

Version and model info

No response

Kreijstal commented 20 hours ago

Issue

Sharing some thoughts on how to make Aider benchmark closer to real development and detecting potential overfitting issues. During the trials, the aider agent should not have access to the unit tests, but be asked to write its own tests and iterate over the solution. The "oracle" unit tests are only used during final scoring.

1. Closer to real development - fornew features, there is no guarantee of the availability of "oracle" tests. Especially for complex problems, setting the test boundary is as challenging as implementing the features.

2. Detect potential overfitting - we can ask an LLM judge, similar to MT-bench, to evaluate the similarity between the written tests and official unit tests. Higher similarity implies higher possibility of overfitting.

What's your thoughts on this idea? I can go ahead and make PRs if this idea is echoed by other folks.

Version and model info

No response

I like this idea but that is indeed a different benchmark, not the same