Trialling hyper-parameter optimization

dsyme commented 3 years ago

@gbaydin This is a question about something I thought might "just work". Don't spend time on it, I'm just curious if there's something obvious I'm doing wrong.

I took a tentative attempt at hyper-parameter optimization for the learning rate of the first 20 minibatches of the GAN sample

https://github.com/DiffSharp/DiffSharp/compare/examples/vae-gan-hypopt?expand=1

gan.fsx: https://github.com/dsyme/DiffSharp/blob/ef0bcd04575a67636b5557dcc953c4ab8e287598/examples/gan.fsx

The aim is simply to work out what the optimal training learning rate would be if we're only going to run training on precisely those same 20 minibatches. However my attempt doesn't work because the derivative of my train function is always zero according to the optimizer, e,g, my printfn addition to the SGD optimizer gives this:

f = tensor(16.0946), g = tensor(0.)

Here f is the sum of the generators losses for the first 20 minibatches (the result of train) and g is the gradient of the train function as reported to SGD. At first I thought this might have been due to noDiff erasing all derivatives. However switching to just a stripDiff that just gets the primal didn't change things.

Anyway if you can spot anything simple I'm doing wrong on a quick glance it would be instructive for me.

gbaydin commented 3 years ago

Just had a very quick look, so this might not be the whole answer, but: have a look at the reversible argument of the optimizer.

https://github.com/DiffSharp/DiffSharp/blob/8e1608cbd94fc80453e79dad3b76b5aaf6bcd29e/src/DiffSharp.Core/Optim.fs#L44

This was something I added by design to turn off nesting through the standard optimizers by default, for efficiency reasons of the standard use cases. reversible=true can do the trick.

Also we should add some unit test cases for this reversible behavior.

dsyme commented 3 years ago

I see I'll try that thanks!

FWIW I added the numerical gradient to

        let v, g = dsharp.fg f x
        let nv, ng = dsharp.numfg 0.0000001 f x
        printfn $"v = {v}, g = {g}, nv = {nv}, ng = {ng}"

and got

v = tensor(16.0946), g = tensor(0.), nv = tensor(16.0946, rev), ng = tensor(7352.83, rev)

so the gradient is really not zero :-)

gbaydin commented 3 years ago

Another approach could be to set up an optimization loop based on the diffsharp differentiation API (e.g., dsharp.grad), instead of the PyTorch-inspired optimizer classes. That way you could design the whole nested pipeline using higher-order functions.

An example of this is in the unit tests: https://github.com/DiffSharp/DiffSharp/blob/8e1608cbd94fc80453e79dad3b76b5aaf6bcd29e/tests/DiffSharp.Tests/TestOptim.fs#L62

dsyme commented 3 years ago

BTW I think we should use string interpolation everywhere e.g.

    printfn $"v = {v}, g = {g}, nv = {nv}, ng = {ng}"

and where necessary strongly typed ones:

    printfn $"v = %f{v}, g = %d{g}"

dsyme commented 3 years ago

Another approach could be to set up an optimization loop based on the diffsharp differentiation API (e.g., dsharp.grad), instead of the PyTorch-inspired optimizer classes. That way you could design the whole nested pipeline using higher-order functions.

Cool thanks

dsyme commented 3 years ago

So my example is using this SGD which doesn't have reversible

https://github.com/dsyme/DiffSharp/blob/t2/src/DiffSharp.Core/Optim.fs#L229

Any other ideas?

dsyme commented 3 years ago

Ah maybe you meant the inner Adam optimizers. I'll try that

dsyme commented 3 years ago

Hmmm that didn't work, I get this:

System.Exception: Cannot have TensorR and TensorF in the same nesting level
   at DiffSharp.Tensor.op_Subtraction(Tensor a, Tensor b) in E:\GitHub\dsyme\DiffSharp\src\DiffSharp.Core\Tensor.fs:line 991
   at DiffSharp.Optim.Adam.updateRule(String name, Tensor t) in E:\GitHub\dsyme\DiffSharp\src\DiffSharp.Core\Optim.fs:line 94
   at <StartupCode$DiffSharp-Core>.$Optim.step@27.Invoke(Tuple`2 tupledArg) in E:\GitHub\dsyme\DiffSharp\src\DiffSharp.Core\Optim.fs:line 27
   at DiffSharp.Model.ParameterDict.iter(FSharpFunc`2 f) in E:\GitHub\dsyme\DiffSharp\src\DiffSharp.Core\Model.fs:line 84

I'll take more of a look tomorrow

dsyme commented 3 years ago

I got this working to a first approximation. Interesting results and a lot of fun - hard to believe how easy it is (even if "doing it well" may be harder).

To get it working I did this:

Use a new tag for the inner reverse mode let tag = GlobalNestingLevel.Next() and generator.reverseDiff(tag=tag) and discriminator.reverseDiff(tag=tag).
Strip a level of differentiation from the models on each iteration discriminator.stripDiff() and if i <> 0 then generator.stripDiff() rather than noDiff
Return the sum of primals of glosses and dlosses. This may be pretty bogus and not the right overall objective function for optimizing the training but it's the first thing that came to mind.

(It feels like we need some improvement in the model differentiation API here to avoid having to do these things?)

Anyway for kicks I used four hyperparameters:

let gopt = Adam(generator, lr=hyp.[0], beta1=hyp.[2], reversible=true)
let dopt = Adam(discriminator, lr=hyp.[1], beta1=hyp.[3], reversible=true)

and

Optim.optim.sgd (train, x0=dsharp.tensor([0.0003,0.0003,0.5,0.5]), lr=dsharp.tensor(0.0000000005))

The very low learning rate on the outer SGD optimization is needed because some of the initial gradients on the learning rates are so high (given this is R^4 --> R is it possible or wise to use a matrix of outer learning rates?)

Repeated optimization over the first unshuffled 640 images (not a lot I know) in batches of 64 yields very strong gradients where the system wants the generator to learn much less quickly and the discriminator to learn more quickly. The betas are not changing, e.g.

hyp: tensor([0.000054771, 0.000319337, 0.5, 0.5]), sum-of-losses = tensor(16.852), g = tensor([1465.97, -3667.55, -0.0514107, 2.15906])
hyp: tensor([0.000054294, 0.000321055, 0.5, 0.5]), sum-of-losses = tensor(16.8428), g = tensor([1157.51, -3550.37, -0.0382381, 2.08002])

Because this is all just for the first 640 images perhaps it makes sense - I guess it's deciding that the "sum of losses" objective function is best optimized by initially having the discriminator learn something quickly (rather than the generator trying to match the random initial noise of the discriminator).

I then looked at optimizing the beta parameters independently of the learning rates - nothing useful from this but interesting to see how easy it is to set this up

Optim.optim.sgd ((fun hyp ->  train (dsharp.tensor(4.87555e-05), dsharp.tensor(0.000340143), hyp.[0], hyp.[1], 1, Some 20)), x0=dsharp.tensor([0.5, 0.5]), lr=dsharp.tensor(0.0001))

gbaydin commented 3 years ago

Awesome progress! This looks really cool and I will definitely check it out in detail.

Another quick reply in the meantime: the API here

https://github.com/DiffSharp/DiffSharp/blob/8e1608cbd94fc80453e79dad3b76b5aaf6bcd29e/src/DiffSharp.Core/DiffSharp.fs#L1225

namely dsharp.nest() and the others next to it was designed for this type of thing. I haven't looked at it in detail and added tests for it yet, but the intention is to expose a nice lightweight api to the user to achieve the thing you did a bit tidier.

dsyme commented 3 years ago

Just to mention I tried the same thing in vae.fsx here https://github.com/DiffSharp/DiffSharp/compare/dev...dsyme:t2?expand=1#diff-5f944dc2993b2616546c10e0c5c00913d434683d822b41944aa5a2128843dc38R156

However I got very large hypergradients for the learning rate with pretty much immediate divergence to NaN, no matter what I did. Perhaps I'm doing something wrong or perhaps it's in the nature of the data or the model or the loss function.

It feels hard to debug this condition, to work out why the gradient is diverging. I think it's because even two steps of training the model are compacting again and again the effect of the learning rate. However a "fail on first NaN gradient" condition or similar might be quite useful?

Epoch: 1/1 minibatch: 0/1875 loss: 547.7877197 lr: tensor(1e-05, fwd)
l.primal = tensor(548.344, fwd)
l.primal.derivative = tensor(-14354.9)
Epoch: 1/1 minibatch: 1/1875 loss: 548.3441772 lr: tensor(1e-05, fwd)
l.primal = tensor(548.275, fwd)
l.primal.derivative = tensor(NaN)     <--- gradient of the loss w.r.t. learning rate diverges right away

dsyme commented 3 years ago

However I got very large hypergradients for the learning rate with pretty much immediate divergence to NaN,

https://github.com/DiffSharp/DiffSharp/issues/287 gives a possible solution for this. After applying the change to Adam there I can do hyper-gradient optimization of the learning rate.

For VAE, for example, I was able to optimize the learning rate (objective function is to minimise the sum of the losses over all batches in the last epoch). The hyper-gradient optimization progressively selects slightly larger learning rates and does reduce the sum of losses in training, but doesn't lead to actual improvement during validation. But there may be other reasons for that.

dsyme commented 3 years ago

A little more on this.

For VAE, using hyper-gradient optimization on the learning rate (optimizing w.r.t. the sum of the losses in the 2nd epoch) gave me a learning rate of 0.00136 for VAE instead of the default of 0.001 (and then sudden divergence of the learning rate to negative as it hit some kind of wall). This results in the following loss graph as training proceeeds (10-minibatch rolling average)

The original learning rate first reaches a rolling-average loss of 130 after 390 steps, the hyper-optimized loss reaches is after 375 steps. Likewise reaches 120 after 670 steps v. 526 steps. The scores on the validation set of data are 109.7948151 for the default learning rate and 109.1915359 for the hyper-parameter optimized learning rate.

Presumably for this example the default learning rate was from the python code for a paper which had already performed some kind of parameter sweep.

Of course I had to run the training multiple times to work out the optimized learning rate and that is effectively a form of hyper-gradient-guided parameter sweep. I suppose methodologically it would be better to divide the data into three sets

hyper-training set used to choose the learning rate
actual training set used to train the model at a fixed learning rate
validation set

But overall I'm impressed with the ease of doing the hyper-parameter optimization.

dsyme commented 3 years ago

I pushed the code in my experimental branch to "examples/vae-gan-hypopt" as a record of this experiment

https://github.com/DiffSharp/DiffSharp/compare/examples/vae-gan-hypopt?expand=1

dsyme commented 3 years ago

Including @awf, who I just showed this to.

His comments on the VAE example were:

The outer learning rate is 2e-10. That’s a red flag. Take the log of the inner learning rate it may work out better
Parameter sweeps often work better (use less training runs) if there is only one parameter. Try to optimize over two or three parameters instead, e.g. the Beta1 and Beta2 parameters of the inner Adam optimizer, which are using their default values of 0.9 and 0.99 here.

For the GAN example the comments were:

The objective function of “sum of dlosses + glosses” seems ok
Again take the log of the learning rate

dsyme commented 3 years ago

Taking into account @awf's feedback, we can optimize ln(learningRate) and 1-ln(beta1) and 1-ln(beta2) at the same time:

Optim.optim.sgd ((fun hyp -> train(hyp.[0].exp(), 1.0-hyp.[1].exp(), 1.0-hyp.[2].exp(), None, None)), 
                 x0=dsharp.tensor([log 0.001; log (1.0-0.9); log (1.0-0.999)]), lr=v 0.0001)

This gives a different set of hyper-parameters and some further improvements:

let lr = 0.001075
let beta1 = 0.777373
let beta2 = 0.999467
Validation loss: 108.8958969  (baseline 109.7948151)
Iterations to loss of 130: 299   (baeline 390)
Iterations to loss of 120: 399   (baseline 629)

so training

In any case, these blunt techniques work, sort of, and the mixed/nested mode gradient-taking looks accurate. The techniques still require re-running training multiple times however and will always have to compete with massively-parallelizable brute force parameter sweeps.

gbaydin commented 3 years ago

Hi @dsyme it is a good indication that these are working! I expect to look at the results closely in the coming days.

In the meantime, I would like to point you and @awf to the paper I published exactly on this subject at ICLR: https://arxiv.org/abs/1703.04782 In the paper we address many of the comments you've been raising. It took a lot of effort to explain this technique to the reviewers and the community at ICLR, so the paper is the best summary of everything around this subject so far. I hope you and @awf can have a careful look at it (if you have a bit of time, of course!).

I think two most important comments are:

There are online and offline varieties of hypergradient-based hyperparameter optimization. These have completely different complexities and computational cost. What you've been doing so far is the offline variety (I say this after a quick look at your code), where you propagate hypergradients through whole optimization sessions. This is in effect equivalent to what Maclaurin et al. did in their ICML paper: https://arxiv.org/abs/1502.03492 (By the way some of the examples in that paper would be really cool to implement in diffsharp!)
My ICLR paper I linked above is mainly about the online variety which has lots of advantages and it's actually quite cool. The most important thing to note is that it is not a direct competitor to something like a massive grid search. It's actually something more attractive than that. We provide something that, without any massive, costly grid-based tuning (i.e., running just a single training run where you jointly optimize the base model parameters and the hyperparameters, in other words parameter update and hyperparameter update happening together in each iteration), does almost as good as what one can achieve after a massive, costly grid-based tuning. This significantly reduces the need for hyperparameter tuning in practice, and people indeed use our technique a lot in many settings (some of them pretty large-scale). See especially Figure 2, 4 and associated discussions in the paper. This repo has the poster that can also help to quickly get the point I'm talking about: https://github.com/gbaydin/hypergradient-descent

We should implement the online variety as well. There are also lots of research directions that follow from this and some student projects that are currently ongoing on my side.

As a note: the paper initially started as result of a few quick experiments we did with DiffSharp 0.7 at that time. :)

awf commented 3 years ago

Nice one! You should log how many function/gradient evaluations the optimizer took -- less than 50 is pretty good, more than 500 is too close to grid search.

To significantly reduce function eval with a 3-parameter optimization problem, I would expect a Damped [Gauss-]Newton optimizer (e.g. Damped Newton or Levenberg Marquardt) to do much better than SGD, maybe 10x better in terms of function evaluations. Normally G-N is impractical because it needs good analytic derivatives.... @gbaydin is there an efficient way to get the nx3 Jacobian or the 3x3 Hessian for the outer problem?

dsyme commented 3 years ago

@gbaydin Thank you yes the linked paper is great. I'll go back and read it carefully now I've played with this a bit.

Quick question: are the beta1 and beta2 of the Adam also tuned incrementally in that paper? It looks like you're computing the analytical incrementalization w.r.t. learning rate?

I was quite surprised that tuning those seemed to make more of a difference than the learning rate in this example

dsyme commented 3 years ago

To significantly reduce function eval with a 3-parameter optimization problem, I would expect a Damped [Gauss-]Newton optimizer (e.g. Damped Newton or Levenberg Marquardt) to do much better than SGD, maybe 10x better in terms of function evaluations. Normally G-N is impractical because it needs good analytic derivatives....

Thanks!

is there an efficient way to get the nx3 Jacobian or the 3x3 Hessian for the outer problem?

Wel, reverse-on-reverse should do it, but blows up the memory on my machine for this example. There may be ways around that, or there may be a mistake in my code.

gbaydin commented 3 years ago

Quick question: are the beta1 and beta2 of the Adam also tuned incrementally in that paper? It looks like you're computing the analytical incrementalization w.r.t. learning rate?

@dsyme We had to focus that paper on the learning rate only. I think it's a great thing to extend this to other things including Adam's betas, other hyperparameters can be momentum rates, regularization coefficients, etc. Another one is to learn per-parameter learning rates (instead of a single scalar learning rate), which is indeed possible but was excluded from that paper.

To significantly reduce function eval with a 3-parameter optimization problem, I would expect a Damped [Gauss-]Newton optimizer (e.g. Damped Newton or Levenberg Marquardt) to do much better than SGD, maybe 10x better in terms of function evaluations. Normally G-N is impractical because it needs good analytic derivatives.... @gbaydin is there an efficient way to get the nx3 Jacobian or the 3x3 Hessian for the outer problem?

@awf This sounds actually great and looking at things other than SGD for the outer problem would be great!

Here's a quick list of things we have available right now in this code base:

jacobian-vector product (single forward eval)
vector-jacobian product (forward-reverse eval)
hessian-vector product (forward-reverse eval, i.e., reverse-on-forward)
vector-hessian product (forward-reverse-forward, reverse-on-reverse)
we can get a 3x3 jacobian via either "one-forward-then-three-reverse evals" (this is the standard way in pytorch, etc.) or just "three forward evals" (forward mode, should be significantly faster as there is no backpropagation)
we can get the 3x3 hessian via one of the two hessian vector products I mentioned above
there are some other things @barak would know! Like the hessian-inverse-vector product.
compositions of any of these are possible as well

dsyme commented 3 years ago

Another one is to learn per-parameter learning rates (instead of a single scalar learning rate), which is indeed possible but was excluded from that paper.

I was wondering about this. Also wondered about different learning rates for different layers/parts of models (so not necessarily one-per-parameter but one-per-layer - different parts of the brain with different levels of activity).

dsyme commented 3 years ago

@gbaydin I implemented the technique from the paper for Adam. I might have made a mistake though.

The orange and blue lines indicate the adaptive learning (with 0.001 and 0.0001 starting respectively) - the green line the fixed 0.001. The orange and blue look great at the start - they increase to learn very fast - but then the learning rate goes very low and the end score is 116 instead of 110. Have you seen this kind of problem before with adaptive learning rates? This is two epoch learning VAE over MNIST

I'll check the formulae. The Adam/SGD code for the hyperparameter adaption is in the branch at the top of this issue.

dsyme commented 3 years ago

Other things I noticed

Just to mention I added the multiplicative update rule from the paper and ran a longer training run and still have that effect where the fixed learning rate ultimately does better:

gbaydin commented 3 years ago

@dsyme I wouldn't care too much about a small difference between these two curves beyond 2k iterations. I think in this example, the choice of 0.001 was already very close to optimal. I would, for example, look at the adaptive/non-adaptive behavior for other learning rates such as 0.1, 0.01, 0.0001, 0.00001 etc. and see if the adaptive versions do a good enough job to bring the learning rate towards the optimal value and the loss towards the optimal. If they do, it's a winning result for some contexts because it shows you can adapt the learning rate on the fly to do much better than the non-adaptive algorithm, without any need for costly grid search. Depending on the setting, minute differences are not always important as there is never a magic method that is guaranteed to give you "lower than everything else everywhere". Optimization is a huge field and this thing you're expecting is kind of a holy grail.

If you're in a setting where you really want to make sure you have the absolute best loss and minute differences are important, you would definitely need the costly grid search, Bayesian optimization or other stuff. But this adaptive algorithm can even help with that because it can give you an indication of where to start your costly searches from, so you wouldn't need to cover a huge costly grid.

A general note: these two curves are just two individual realizations of running these two algorithms. I would definitely be interested in looking at this too: for each curve (orange and blue) I would run exactly the same experiment a handful of times (like ten times) with different random number seeds, and I would plot the mean and standard deviation of these curves to get a better picture of the expected behaviors. Looking at individual single realizations in stochastic optimization results can be misleading. (I actually want to add the shaded region plotting thing to the diffsharp pyplot wrapper for this purpose :) )

dsyme commented 3 years ago

@gbaydin Yes, it's possible, though the effect is quite strong - using the dynamic learning rate gives noticeably worse results in this instance, pushing the learning rate down to 0.00003 or so towards the end, giving the system a strong disinclination to learn anything at all. Strange since the higher learning rates do push the loss down. It's not explicable to me yet, I feel there may be some error in the implementation or methodology I'm using here. The theory is sound of course.

But this adaptive algorithm can even help with that because it can give you an indication of where to start your costly searches from

Well, in this case it wouldn't, as the adaptive algorithm is pushing the learning rate so low, and any manual searches would be in the wrong zone and miss the (current) lowest-loss solutions of 0.001 constant learning rate.

But again there may simply be some error in implementation or method here. Or it may indeed be an artifact of the implementation.

dsyme commented 3 years ago

@gbaydin @awf

One question about the methodology for hyper parameter optimization

In the examples above I was running minibatch N, computing the hyper-gradient using the incremental method from the paper, then adjusting the learning rate for use with minibatch N+1

Is this actually valid? Is the hyper-gradient of the learning rate going to be correct when used with highly varying data inputs from minibatch to minibatch? @gbaydin in your paper was this the method used, or was it from epoch to epoch with same input data at each step?

thanks

DiffSharp / DiffSharp

Trialling hyper-parameter optimization #286