AI-ON / Multitask-and-Transfer-Learning

Benchmark and build RL architectures that can do multitask and transfer learning.
144 stars 31 forks source link

Multitask and Transfer Learning

  1. Create a benchmark for transfer learning and multitask learning.
    • Should measure improvement in learning that is directly attributable to knowledge transfer between games.
    • Should also be able to measure performance by a single agent on multiple games.
    • Should use cross-validation to mitigate the effects of a small number of games to test on.
  2. Design and implement deep reinforcement learning architectures that do well on the benchmark.
    • For methodological reasons, we think it's important to design the ideal benchmark before getting too attached to a particular architecture.
    • It's important that we're sure the benchmark is measuring the crux of the transfer and multi-task problem rather than measuring something our architecture is good at.

Contributing

We have a few different "threads" going on right now, so there are several different ways you can get involved if you're interested:

A few notes on contributing

Project Status:

See detailed status on the project tracker

Why this problem matters:

Generalizing across tasks is a crucial component of human intelligence. Current deep RL architectures get less effective the more tasks they are put to, whereas for humans, diversity of experience is a strength that improves performance on new tasks. Overcoming catastrophic forgetting and achieving one-shot learning are abilities that should fall out naturally if this task is solved convincingly.

At a more meta-level, this problem is both out of reach of current reinforcement learning architectures, but it seems reasonably within reach within a year or two. Much like ImageNet spurred innovation by creating a common target for researchers to aim for, this project could similarly provide a common idea of success for multitask and transfer learning. Many papers researching multi-task and transfer learning using Atari are doing it in ad-hoc ways that cherry-pick games that get good results.

How to measure success:

Success is in degrees, since an architecture (in principle) could surpass human ability in multi-task Atari, getting both higher scores on all games, and picking up new games faster than a human does. Ideally, a good waterline would be human level performance on the benchmark, but creating a robust dataset on human performance is beyond the scope of this project.

The fundamental benchmark then will be two measures:

  1. Transfer Learning: How much a given architecture improves on an unseen game when it is untrained versus when it has been trained on other games firest. Measured as a ratio of total score pre-trained vs. untrained. Ratio is averaged using cross-validation given that there is a small number of available games and the fact that high scores are not comparable across games.
  2. Multitask Learning: How well a given architecture does across all games with a single architecture and set of weights. Rather than an aggregate, this result will be a vector of top scores achieved for each game.

In addition to the scores, the benchmark will also make some strict demands on the architecture itself due to the testing/training regime:

Datasets:

Currently no datasets, but it’s possible the dataset being created at atarigrandchallenge.com will potentially be a useful comparison once it’s available. Measuring human performance needs to be done with a large sample size, both to control for pre-training (some people have played Atari games before, or other video games before) and to control for individual human skill levels (this could be seen as pre-training on non-Atari games, generalization from real life, or natural ability etc).

Akin to a dataset will be the benchmark framework itself. Since this is a reinforcement learning problem, the testing environment provides the data, rather than a static dataset.

Relevant/Related Work

Since the original Mnih paper, the Atari 2600 environment has been a popular target for testing out RL architectures

Note: More Work to be added to, always check the chat for latest related work for now