Open ctrlnomad opened 4 years ago
go_back
allows us to use the weights from the previous timestep, and roughly goes like this:
if go_back == True:
W_t+1 = optim(W_t-1, data)
else:
W_t+1 = optim(W_t, data)
Because the transition from t
tot+1
is defined by performing an optimisation step (on the weights) even if the agent selects a high probability of go_back
, the environment would not allow for true rewinding, i.e. even if the agent decides to go back one step, the environment will automatically perform one optimisation step when it goes t -> t+1
, thus the agent will never be able to go say 3 steps back.
A better approach would be to introduce a new variable H and the agent would have a choice to go H-1 steps back. We will ask the agent to produce an H-size vector that is a proper probability distribution. If the agent puts a high probability at index 2, let's say, the environment would rewind weights two steps back:
It follows that if we sample index 0 we keep the weights.
There is one problem with this formulation and it is that we are counting on the agent to remember what the state of the environment was like 2 time-steps ago. Thus we need to change the observation from the env. at each time-step to also include information at least (H-1)-steps back. If X_t
is a matrix that consists of the current learning rate, loss vector and generalization scores of the model with weights=W_t
then O_t = [X_t, X_t-1 ... X_t-(H-1)]
.
Let's consider the observations from the environment at t_0
. At first, it is tempting to initialise weights randomly then perform the optimisation step to get the X_t
and fill the rest of the O_t
matrix with zeros. It would be more interesting to initialise the weights H-1
times, perhaps with different init. methods like kaiman, Xavier, etc. , and include those as observations too.
TODO: add re-init button
Also note that H != K
More thoughts:
The only thing that changes when we go back is the SGD noise and the LR, would that be enough? What options can we explore?
Some thoughts and diagrams regarding this are captured here.
Excellent description of the problem and the minimum working prototype!
I am working on the agent implementation in this branch: agent_proto
.
Thanks for merging the agent code in bae38a80b2e3fa5535facf24184193bf93058b97 !
Seems like most of the design discussed here is implemented, which is great. Is there anything that we are missing or something that needs to be updated?
Full working prototype here: 7a72f0e5ea5f850e084c94037f90db1d3c9046d6 by @Sultan-IH
First milestone we reached was achieving 100 episodes of agent-environment interaction without run-time exceptions! 🥳
Below you can see the MDP logs from the first experiment run with a classifier trained on 20% of the MNIST dataset.
From this MDP log, we deduced that the agent learned how to exploit a bug in the environment that incorrectly calculated the step reward, this was fixed in [min_work_proto 3de982d]
. I've also added 'lr' parameter to the MDP log so that it would be easier to plot and added a final_reward_scale
parameter to the environment as suggested by @jccaicedo. Since then I've set two experiments running on 20% of the MNIST and CIFAR10 training data with identical architectures.
work done is on branch [risa]
reduce actions available to the agent as 're-init', 'increase lr by 80%', 'increase lr by 20%', 'decrease lr by 20%', 'decrease lr by 80%'
We want to incentivise short training times and high accuracy
step_reward = sign(delta)*1 - len(clf. optim. steps)
Brief
TODO: mention auto experiment setting
With AutoTrain, we hope to solve the problem of autonomous neural network optimization by teaching an agent to recognise a relationship between hyperparameters (e.g. learning rate, training time) and performance metrics using a reinforcement learning problem setting. We hope that the agent learns to navigate the loss landscape and find minima that generalize well. We formulate an MDP and implement it with OpenAI's Gym and PyTorch frameworks. More specifically we see this project being used in image classification problems. This issue is here to discuss all the implementation details required to produce a working prototype/proof of concept.
Details & Problem Statement
The work done for this issue is going to be deposited into the
min_work_proto
branch.The following is a more detailed formulation of the problem setting/MDP:
State Space S :
the state space for the environment has the following components:
model
what functions to perform on the weights Wopt
includes learning rate, weights Wdataset
(x,y) pairs; image classification dataset with C categoriesphi
metric that we are trying to optimise for, can be accuracy or f1_score; computed with the use of Thresholdout (see here)Action Space A :
the agent is going to have access to two actions: scale learning rate and rewind weights. Thus we introduce a control vector for the agent to produce
[learning_rate_scale, go_back, stop]
Inside of the environment the state changes for the next time step like so:
lr_t+1 = lr_t * learning_rate_scale
withlr_0 = lr_init # provided at the env initialisation
go_back
is a scalar [0, 1], if go_back > thresh then we reset the weights to W_t = W_t-1stop
is a scalar [0, 1], if stop > thresh then the episode is finished and the environment resets.the two modifications happen simultaneously.
Observations O:
each of the observations mentioned below are presented the agent as sequences of length K. Thus the observation matrix is
K x num_components
. Components:lr_t
Reward R:
phi
then issue +0.05 reward if not -0.05More notes:
Flow Chart
Check out this chart for a diagrammatical depiction of the MDP dynamics.