ctrlnomad commented 3 years ago

Brief

TODO: mention auto experiment setting

With AutoTrain, we hope to solve the problem of autonomous neural network optimization by teaching an agent to recognise a relationship between hyperparameters (e.g. learning rate, training time) and performance metrics using a reinforcement learning problem setting. We hope that the agent learns to navigate the loss landscape and find minima that generalize well. We formulate an MDP and implement it with OpenAI's Gym and PyTorch frameworks. More specifically we see this project being used in image classification problems. This issue is here to discuss all the implementation details required to produce a working prototype/proof of concept.

Details & Problem Statement

The work done for this issue is going to be deposited into the min_work_proto branch.

The following is a more detailed formulation of the problem setting/MDP:

State Space S :

the state space for the environment has the following components:

model what functions to perform on the weights W
opt includes learning rate, weights W
dataset (x,y) pairs; image classification dataset with C categories
phi metric that we are trying to optimise for, can be accuracy or f1_score; computed with the use of Thresholdout (see here)

Action Space A :

the agent is going to have access to two actions: scale learning rate and rewind weights. Thus we introduce a control vector for the agent to produce [learning_rate_scale, go_back, stop]

Inside of the environment the state changes for the next time step like so:

[x] lr_t+1 = lr_t * learning_rate_scale with lr_0 = lr_init # provided at the env initialisation
[x] go_back is a scalar [0, 1], if go_back > thresh then we reset the weights to W_t = W_t-1
stop is a scalar [0, 1], if stop > thresh then the episode is finished and the environment resets.

the two modifications happen simultaneously.

Observations O:

each of the observations mentioned below are presented the agent as sequences of length K. Thus the observation matrix is K x num_components. Components:

[x] training loss at t
[x] lr_t
[x] phi value

Reward R:

intermediate:
- [x] the direction of phi: if there is an increase in the desired metric phi then issue +0.05 reward if not -0.05
- final unsure:
- phi value

More notes:

[x] the MDP is finite-horizon, the number of steps/actions allowed to take is determined by the Thresholdout budget
[x] MVP also includes ReplayBuffer which stores (s,a,r,s') tuples

Flow Chart

Check out this chart for a diagrammatical depiction of the MDP dynamics.

ctrlnomad commented 3 years ago

Reformulation of Action and Observation Spaces

go_back allows us to use the weights from the previous timestep, and roughly goes like this:

if go_back == True:
    W_t+1 = optim(W_t-1, data)
else:
    W_t+1 = optim(W_t, data)

Because the transition from t tot+1 is defined by performing an optimisation step (on the weights) even if the agent selects a high probability of go_back, the environment would not allow for true rewinding, i.e. even if the agent decides to go back one step, the environment will automatically perform one optimisation step when it goes t -> t+1, thus the agent will never be able to go say 3 steps back.

A better approach would be to introduce a new variable H and the agent would have a choice to go H-1 steps back. We will ask the agent to produce an H-size vector that is a proper probability distribution. If the agent puts a high probability at index 2, let's say, the environment would rewind weights two steps back:

It follows that if we sample index 0 we keep the weights.

There is one problem with this formulation and it is that we are counting on the agent to remember what the state of the environment was like 2 time-steps ago. Thus we need to change the observation from the env. at each time-step to also include information at least (H-1)-steps back. If X_t is a matrix that consists of the current learning rate, loss vector and generalization scores of the model with weights=W_t then O_t = [X_t, X_t-1 ... X_t-(H-1)].

Let's consider the observations from the environment at t_0. At first, it is tempting to initialise weights randomly then perform the optimisation step to get the X_t and fill the rest of the O_t matrix with zeros. It would be more interesting to initialise the weights H-1 times, perhaps with different init. methods like kaiman, Xavier, etc. , and include those as observations too.

TODO: add re-init button

Also note that H != K

More thoughts:

The only thing that changes when we go back is the SGD noise and the LR, would that be enough? What options can we explore?

Some thoughts and diagrams regarding this are captured here.

jccaicedo commented 3 years ago

Excellent description of the problem and the minimum working prototype! I am working on the agent implementation in this branch: agent_proto.

jccaicedo commented 3 years ago

Thanks for merging the agent code in bae38a80b2e3fa5535facf24184193bf93058b97 !

Seems like most of the design discussed here is implemented, which is great. Is there anything that we are missing or something that needs to be updated?

jccaicedo commented 3 years ago

Full working prototype here: 7a72f0e5ea5f850e084c94037f90db1d3c9046d6 by @Sultan-IH

ctrlnomad commented 3 years ago

First Experiments & Reward Re-Design

First milestone we reached was achieving 100 episodes of agent-environment interaction without run-time exceptions! 🥳

Below you can see the MDP logs from the first experiment run with a classifier trained on 20% of the MNIST dataset.

From this MDP log, we deduced that the agent learned how to exploit a bug in the environment that incorrectly calculated the step reward, this was fixed in [min_work_proto 3de982d]. I've also added 'lr' parameter to the MDP log so that it would be easier to plot and added a final_reward_scale parameter to the environment as suggested by @jccaicedo. Since then I've set two experiments running on 20% of the MNIST and CIFAR10 training data with identical architectures.

ctrlnomad commented 3 years ago

Reduced Instruction Set Agent

work done is on branch [risa]

reduce actions available to the agent as 're-init', 'increase lr by 80%', 'increase lr by 20%', 'decrease lr by 20%', 'decrease lr by 80%'

Reward Design Changes

We want to incentivise short training times and high accuracy step_reward = sign(delta)*1 - len(clf. optim. steps)

ctrlnomad commented 3 years ago

Results From Initial Experiments

FullMinProto - MNIST

FullMinProto - CIFAR10

FullMinProto - SGD CIFAR

RISA - SGD CIFAR

Next Steps

Curriculum for the Agent

have a variety of architectures and dataset pairs; will be easier to increase difficulty with synthetic data @jccaicedo

Reward

detect convergence; teach the agent when to stop
small negative reward for every increase in clf. optim. steps

Tools & Utils

Action selection visualisation

broadinstitute / AutoTrain

Minimum Working Prototype - AutoTrain #2

Brief

Details & Problem Statement

State Space S :

Action Space A :

Observations O:

Reward R:

More notes:

Flow Chart

Reformulation of Action and Observation Spaces

First Experiments & Reward Re-Design

Reduced Instruction Set Agent

Reward Design Changes

Results From Initial Experiments

FullMinProto - MNIST

FullMinProto - CIFAR10

FullMinProto - SGD CIFAR

RISA - SGD CIFAR

Next Steps

Curriculum for the Agent

Reward

Tools & Utils