Open RalphMao opened 3 days ago
Issue
Sharing some thoughts on how to make Aider benchmark closer to real development and detecting potential overfitting issues. During the trials, the aider agent should not have access to the unit tests, but be asked to write its own tests and iterate over the solution. The "oracle" unit tests are only used during final scoring.
1. Closer to real development - fornew features, there is no guarantee of the availability of "oracle" tests. Especially for complex problems, setting the test boundary is as challenging as implementing the features. 2. Detect potential overfitting - we can ask an LLM judge, similar to MT-bench, to evaluate the similarity between the written tests and official unit tests. Higher similarity implies higher possibility of overfitting.
What's your thoughts on this idea? I can go ahead and make PRs if this idea is echoed by other folks.
Version and model info
No response
I like this idea but that is indeed a different benchmark, not the same
Issue
Sharing some thoughts on how to make Aider benchmark closer to real development and detecting potential overfitting issues. During the trials, the aider agent should not have access to the unit tests, but be asked to write its own tests and iterate over the solution. The "oracle" unit tests are only used during final scoring.
What's your thoughts on this idea? I can go ahead and make PRs if this idea is echoed by other folks.
Version and model info
No response