Will evaluation order be randomized?

beyretb / AnimalAI-Olympics

Code repository for the Animal AI Olympics competition

Apache License 2.0

573 stars 84 forks source link

From what I can tell, it would currently be possible for an agent to know (based on the number of arena resets) exactly which category is being tested at the moment. This information can provide an unnatural advantage for evaluation. A simple strategy to (ab)use this is the following:

Train many different agent models and submit each one for (non-final) evaluation.
For each category, identify which model performs best.
Submit an agent that loads a different model every 30 (or 90†) arenas, thereby using different, cherry-picked "AIs" for each category.

^{†For the final evaluation, the number of tests per category will supposedly be higher than 30. But even if the agent for the final submissions has to also pass the 30-per-category somehow, inferring whether the 31st test is in category 1 or 2 could probably be accomplished relatively easily.}

Depending on the variations in agent performance in each category, this could yield a sizeable score boost and it would be sub-optimal not to do (or at least attempt) it. And I don't see how counting the number of resets would violate the competition rules either.

To avoid a Prisoner's Dilemma scenario of everyone having to invest effort into this strategy, the tests should probably not be carried out sequentially for each category as appears to be the case right now. To avoid unfairness in the evaluation procedure in the face of timeouts, interleaving the test categories in an arbitrary (but fixed!) order could be a good solution. (Attempting to extract information about this order through overly clever submissions - such as timing out on each test in sequence and noting which categories get score increments - would be quite clearly violating the competition rules.) This change might however slightly alter/worsen the scores of submissions that do not finish all tests in time (they might end up skipping some easier tests instead of only the later, harder ones).

Some level of hand-inference will of course always be possible (such as "we are not in the first category because the agent is standing on a solid colored object" or "we are in the generalization category because there are several teal pixels in view"), and even trained agents themselves may "overfit" on the test category differences (to the extent that hidden tests allow this).

But currently the category information is trivial to supply to the agent and there is an obvious incentive to do so. This should be alleviated in my opinion.

Hey,

Thanks for raising this. These are all very valid points and we've been discussing these issues for a long time.

We decided that we will run the final tests sequentially, but we are not releasing details about the number of additional tasks (overall time will be increased proportionally to the number we add) so a simple counting and switching strategy will fail. The reasons to keep sequential execution are:

There's a large grey area between a highly modular approach and one that runs a separate agent for each category. We don't want to penalise highly modular approaches (especially any that are brain-inspired given the WBA prize). We do want to avoid completely separate agents being used with minor optimisations, but I think this is already covered by not releasing the exact number of tasks in the final evaluation.
Some people have expressed that they are trying approaches with online learning between tasks and using the fact that the task categories have been ordered to slowly introduce more aspects of the environment. Obviously, it would be a huge engineering challenge to get this to work even slightly (especially given the execution time limit), and we don't want to discourage anyone that works in this area.
As you pointed out it does not stop someone switching to different agents based on certain properties of their inputs anyway. An agent should react differently based on its inputs so again its a grey area just how much doing so is exploiting external information about the competition and how much is within the spirit of the competition.
If someone spends their time working out ways to use category information to optimise their score using the current submission process then there is no guarantee that this will help them in the final submission so is probably not a good use of time. It might be good for getting a few percent on the current leaderboard and we chose a flat structure for the AWS credits prizes partially for this reason. At the moment we can see on the leaderboard that the top submissions are bunching up in the mid to high 30s. I don't think it will be able to meaningfully separate from these without creating an agent capable of solving a new class of problems (hopefully by understanding an aspect of the environment) which is exactly what we want to see!

beyretb / AnimalAI-Olympics

Will evaluation order be randomized? #62