WecoAI / aideml

AIDE: the state-of-the-art machine learning engineer agent, generating machine learning solution code from natural language descriptions.
https://www.weco.ai
MIT License
588 stars 65 forks source link

Private leaderboard percentiles for each individual competition #19

Open boranhan opened 2 weeks ago

boranhan commented 2 weeks ago

Hello, thank you for your team's great work.

I'm wondering if you can provide the private leaderboard percentiles for each individual competition?

Thanks in advanced!

Boran

AnirudhDagar commented 2 weeks ago
image

Yes, please can you share the scores or the private leaderboard percentile rank which were eventually used for the claimed number for outperforming 50% of humans. Are the python files in https://github.com/WecoAI/aideml/tree/main/sample_results enough to reproduce those numbers? I understand that these numbers are an average over 12 submissions.

@ZhengyaoJiang this was also requested earlier in https://github.com/WecoAI/aideml/issues/4#issuecomment-2138567087

Even sharing some raw results would be helpful to understand the performance of WecoAI. I checked OpenAI's MLE-bench but that doesn't seem to report any numbers either.

dexhunter commented 2 weeks ago

I think you can try to submit the best solution to Kaggle to confirm the performance/leaderboard percentile after running the aide application

AnirudhDagar commented 2 weeks ago

I tried submitting based on the provided code, it gets me extremely bad results. Running the aide application for all the competitions would also require compute resources and rather than reproducing the results, it would be more easy, and best if you could share either the code or the scores/percentiles based on your experiments.

For example for the competition https://www.kaggle.com/competitions/tabular-playground-series-apr-2021/ by using the code, I made this submission (https://www.kaggle.com/code/anirudhdagar/aide-solution-tabular-playground-series-apr-2021) only to get a private lb score of 0.70524 (which places me 1192/1250 on the leaderboard) which means only 4.64%.

dexhunter commented 2 weeks ago

I tried submitting based on the provided code

this is an example output, might be different from the final submission to Kaggle, also you can try with different large language models and other metrics in the config