Learning rate, beta1, beta 2 in adam optimizer in models

kvrooman commented 6 years ago

Does anyone have any background on why the values of the 3 parameters of the adam optimizer were chosen? In particular for the learning rate, too low initial learning rates can produce poor accuracy while too high initial learning rates can be unstable and inaccurate too. there is a sweet spot. also high learning rates are faster.

Normal ranges in the academic literature are suggested as 0.1 to 0.0001. the current setting is lower than that at 0.00005. PS - the adam optimizer naturally reduces the learning rate anyways over the course of epochs. with such a lower initial starting rate there isn't much to optimize.

http://openaccess.thecvf.com/content_ICCV_2017/papers/Korshunova_Fast_Face-Swap_Using_ICCV_2017_paper.pdf

a very similar model uses 0.001 to start and overrides the adam optimizer at set intervals over their training with it slowly being reduced to 0.0001

potential accuracy gain if we experiment with altering the initial learning rate?

Clorr commented 6 years ago

kvrooman commented 6 years ago

Hmm, looks like they're experimenting with replacing the optimizer altogether as well as potentially changing the activation layers of the NN away from leaky ReLu.

there are some other choices like the newer Yellow Fin optimizer or just plain SGD. Also other choices for ReLu such as RReLu or PReLu are available in Keras and are mentioned as upgrades in literature

Clorr commented 6 years ago

Yes, I linked that as "maybe" related. Maybe @gdunstone has more info on this as he tried some things....

Clorr commented 6 years ago

Also I'm wondering here, what is the effect of optimizer on train_on_batch. AFAIK optimizer is meant to drive the learning rate over the whole fitting of the model, but here we just train one autoencoder on a batch and stop to train the other autoencoder for another step. I wonder how the optimizer behaves here...

Clorr commented 6 years ago

I found an answer here: https://github.com/keras-team/keras/issues/3303

Clorr commented 6 years ago

I'm playing a bit with that lr value. I tried:

5e-3 gives erratic results, and leads to complete mess up
5e-4 seems fine, still a bit erratic, but loss is decreasing similarly to 5e-5

I wanted to go a bit further and tried to completely overfit my network (one image, no warping). Both losses decrease fast to 0.002 and then:

5e-4 stagnates around 0.001
5e-5 decreases with a constant rate below 0.001 (0.0004 after 600 iter)

Note that I'm using mse loss and that I'm only retraining encoder

gdunstone commented 6 years ago

I can't offer much. I saw that Nadam had some better results in some cases, but I can't remember how that related to that lr value

On 17 Mar. 2018 01:14, "Clorr" notifications@github.com wrote:

I'm playing a bit with that lr value. I tried:

5e-3 gives erratic results, and leads to complete mess up

5e-4 seems fine, still a bit erratic, but loss is decreasing similarly to 5e-5

I wanted to go a bit further and tried to completely overfit my network (one image, no warping). Both losses decrease fast to 0.002 and then:

5e-4 stagnates around 0.001

5e-5 decreases with a constant rate below 0.001 (0.0004 after 600 iter)

Note that I'm using mse loss

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/deepfakes/faceswap-model/issues/20#issuecomment-373726109, or mute the thread https://github.com/notifications/unsubscribe-auth/AEDEeFXpG4YTo38WI_yOCKBLjxOl3Vysks5te8jXgaJpZM4Sq2y1 .

kvrooman commented 6 years ago

I'd suggest mae ( the L1 loss ) as it gives better results usually.

The erratic results of the higher learning rates are likely a result of the level of regularization the current model has. With no dropouts and no batch normalization, I'd think the risk of over-fitting is higher than if those features were present. Thus, we have to use an abnormally low learning rate to maintain model stability.

One trick is to start with a low rate for a short while - stop the model and tweak the LR manually to a much higher number ( 5 e-3 or 5e-2 ) - the model has found some stability and can afford the higher rate -run for some more epochs at the higher rate until loss stops decreasing -- stop again and divide LR by 10 - restart and stop again and again until you get back to 5e-5 or lower -- you'll generally have a lower loss than if you just ran it with 5e-5. Its called a learning rate schedule and mentioned frequently in literature.

I was also playing around with a new model with high learning rates and some of extra features,

h1vem1nd85 commented 6 years ago

Also I'm wondering here, what is the effect of optimizer on train_on_batch. AFAIK optimizer is meant to drive the learning rate over the whole fitting of the model, but here we just train one autoencoder on a batch and stop to train the other autoencoder for another step. I wonder how the optimizer behaves here...

As far as I know the optimizer works on the trainer's internal "epochs" not what we consider to be epochs (more like iterations of calling the train_on_batch). I'm experimenting with a "unified" model and training using model.fit rather than train_on_batch which yields two advantages:

First off, it simply joins the two existing autoencoder_A/B models with two sets of inputs and two sets of outputs. This causes the shared decoder layer to be processed simultaneously (from what I can tell) and eliminates the ping-pong effect of passing different data through it between each train_on_batch call. Each of the "submodels" still saves its own weights so there is no conflict with existing model weight structure.

In Model.py self.unified = KerasModel( [a,b], [self.autoencoder_A(a), self.autoencoder_B(b)] )

Second, using the model.fit method we can specify the number of epochs in a manner the trainer (and optimizer, I'm assuming) normally accepts. The batch size N would be the actual batch size, and the batch size we specify in our current code acts more like "Load N training images". This allows less harddisk read/write between "epochs" as you can load hundreds of images with random transform/warp data and still only train on a smaller batch.

In Trainer.py losses = self.model.unified.fit([warped_A, warped_B], [target_A, target_B], epochs=100, batch_size= N) I'm Limited to CPU training so I don't really have an opportunity to test it in a timely manner. Experimenting with 32x32 input images to reduce computational load to prove the concept. Something like this at least runs and is producing less of the wobble in losses.

Clorr commented 6 years ago

Ah nice if we can use fit. I thought it was not possible because of the encoder that is shared. Having fit is also nice as we can use some enhancement like callbacks

h1vem1nd85 commented 6 years ago

As I said, I've only just started experimenting with the unified model. fit is certainly better than train_on_batch either way. The only way to truly have a "simultaneously trained layer" is to pass the two input images to their respective encoders and concatenate->dense before branching to the decoder but this creates a dependency between input image A and B and would require two image inputs during the conversion step as well.

deepfakes / faceswap-model

Learning rate, beta1, beta 2 in adam optimizer in models #20