Design Proposal

Discovering a Curriculum Through a Reinforcement Learning Agent

Looking back onto the idea of having many different tasks as settings that the environment can implement, let's denote a task T as being composed of (classes, pct_examples, data_order). Thus some tasks would be more difficult than others, e.g. distinguishing between 'small mammals' and 'aquatic mammals' is more difficult than between train and plane`. The trouble is determining a difficulty metric for each of these tasks is not trivial, some attempts have been made by using the SVM margin as a metric. We also know from the same paper that training on easy examples at first and then increasing the difficulty as training progress increases the convergence rate.

We can then say that a curriculum C is composed of [T_0, ..., T_k] tasks. We can then define a task distribution p(T) (or let it be random) from which we can sample (classes, pct_examples, data_order) ~ p(T) k times to get a curriculum on which we can train a neural network. However, a sensible curriculum C also needs to have the ability to schedule tasks at different time steps t. At every timestep, we will have K gradient updates, at which point the AutoTrain agent will have to either select one gradient update (discrete) or a linear combination of all K updates (continuous, not sure how much sense this makes). Thus at the end, we will have a curriculum C which also has a scheduler determined by the AutoTrain Agent.

In classification problems, we have some reasonable heuristics to follow when it comes to thinking about how hard a task is. For example, the CIFAR-100 dataset is a hierarchical dataset with 100 classes and 20 super-classes, each including 5 member classes. Thus distinguishing between the superclasses is easier than between member classes (of the same superclass). So the tasks could progress from distinguishing from 2 superclasses to finally the entire dataset.

One point to mention here is that with this definition the dynamics of producing a learning rate and other hyperparameter schedulers is not so clear? perhaps we can utilize a second model or a different neural network. Once we figure out how to include hyperparameters scheduling we can also include that into the curriculum.

[ ] TODO: how to deal with the change of num. classes when switching between tasks?
[ ] TODO: hyperparameter scheduling
Final Reward Design

An interesting motivation behind the final reward would be reproducibility --> train different model architectures, the tighter the phi distribution the higher the reward. Or just the max(scores). Can also include timesteps taken to produce C, the phi value itself and a comparison to a 'vanilla' baseline training.

ctrlnomad commented 3 years ago

Current State Of The Project

reward
- link to openai
- simple environment
better competition environment
- again link to openai

broadinstitute / AutoTrain

Advanced Prototype - AutoTrain #6

Design Proposal

Discovering a Curriculum Through a Reinforcement Learning Agent

Final Reward Design

Current State Of The Project