beyretb / AnimalAI-Olympics

Code repository for the Animal AI Olympics competition
Apache License 2.0
572 stars 84 forks source link

overtrained? #80

Closed dan9thsense closed 4 years ago

dan9thsense commented 4 years ago

I find that small changes in code produce big drops in the evaluation score, even when they produce only small changes in scores with local testing. This, plus noting that the leaderboard has stalled out with scores around 40, makes me think we are overtrained on the trials used to generate the evaluation score.

Is it possible to provide some other configurations for evaluation? Maybe something as simple as changing the seed value. Otherwise, when the actual evaluation happens, we may find that the objective of creating robust AI was defeated due to inadvertently overtraining on the same dataset for several months.

ironbar commented 4 years ago

I have found exactly the same behaviour, LB score is very volatile. I have been trying for more than a month to improve my score without success, so I assume my current LB score was just lucky.

However in the final evaluation it is said that the number of tests will be bigger so the volatility will be lower.

mdcrosby commented 4 years ago

Hi,

This, plus noting that the leaderboard has stalled out with scores around 40, makes me think we are overtrained on the trials used to generate the evaluation score.

This is to be expected (and an unfortunate issue with any kind of test-set situation even when details are hidden as much as possible). As ironbar noted, the final test set is larger. This is an attempt to best reward the agent that can robustly solve similar tasks and not the one most overfitted to the test set you already have access to.

we may find that the objective of creating robust AI was defeated due to inadvertently overtraining on the same dataset for several months.

We will be releasing details for how to submit the final agent very soon. You can submit a different agent that will be used in the final evaluation (one your personal testing has shown to be more capable but perhaps doesn't overfit as much to the currently available tests), if not, then we will automatically take your best agent on the current leaderboard (which could potentially be overfitted). We strongly recommend using the former option if your offline testing suggests the agent is more robust.

dan9thsense commented 4 years ago

Without additional tests to try, there is really no reliable way to know whether to submit an agent that scored best on the leaderboard or one that seems better locally. By offering a second set with simply a different seed, you would allow us to determine that question without revealing much about the final evaluation. It would be disappointing to have several months of work come down to a guessing game.

mdcrosby commented 4 years ago

Unfortunately, the way the tests are designed it's not possible to just change the random seed (very few configurations include randomisation). Even if it was possible, a second test-set could also not be representative and lead to the 'wrong' agent submitted. Given this competition revolves around hidden tests there's always going to be a little unknown in the final entry and results.

dan9thsense commented 4 years ago

ah, got it. In that case, would you consider allowing teams to submit two entries? One would be their best leaderboard submission and the other would be whatever agent they think is actually best. Then their score would be whichever was the best of the two. If I only get a single entry, it will be hard to go with something other than the one with the best leaderboard score. If most people make the same choice, you will likely not get the best entries.

ironbar commented 4 years ago

I agree, that would be perfect. Having two final submissions is typical in Kaggle.

beyretb commented 4 years ago

We can definitely do that yes, we'll soon send out an email detailing guidelines for the final submission.

dan9thsense commented 4 years ago

ok