Learning rate decay in addition to Adam?

dietercastel commented 6 years ago

Hi,

First of all let me compliment you on the swift implementation CapsNet in Keras. It looks very interesting! I haven't gotten around testing it myself but when I was skimming to the source code after reading the CapsNet paper I noticed the following line which schedules updates of the learning rate using a Keras callback:

https://github.com/XifengGuo/CapsNet-Keras/blob/12aaa590514816eb2f7b0f57bdc50f1c3410cbcb/capsulenet.py#L90

This made me wonder whether this is also part of their setup in the paper:

Our implementation is in TensorFlow (...) and we use the Adam optimizer with its TensorFlow default parameters, including the exponentially decaying learning rate, ...

As far as I understand Adam, the optimiser already uses exponentially decaying learning rates but on a per-parameter basis. This makes me think no further learning decay is necessary.

Some time soon I plan to run some tests without the additional learning rate decay and see how it changes the results. In any case I'd like to hear your thoughts on this. I don't seem to find anything conclusive as to whether Adam can benefit from additional learning rate decay.

Kind regards, Dieter

XifengGuo commented 6 years ago

@dietercastel Thanks for asking. Actually, I have the same confusion [awkward face]. I probably have misunderstood this part. But when I add additional learning rate decay, the training converges faster and to the higher accuracy. Considered the cost of training, I cannot afford for training too long. So I keep the learning rate decay. To be consistent with the paper, you can just delete this. I'll annotate this in next version. Thinks again.

dietercastel commented 6 years ago

Thanks for your reply! Glad I'm not the only one in doubt about this. While my machine was running this Keras implementation of CapsNet-v1 I read into it some more. It became a big post with quite some questions but I'm not in a hurry so don't feel rushed in answering anything.

I found some other references of people using Learning Rate decay with Adam. For example this topic on r/MachineLearning But I still don't grasp why this would be or even could be beneficial. Adam calculates the learning rate on a much more fine grained scale so why would an overall decaying learning rate matter that much? If anyone reading this is able to explain this I would be very interested in it!

When CapsNet-v1 finished on my machine I had this final output after the 30th epoch: loss: 0.0034 - out_caps_loss: 0.0034 - out_recon_loss: 0.2316 - out_caps_acc: 0.9986 - val_loss: 0.0065 - val_out_caps_loss: 0.0065 - val_out_recon_loss: 0.2317 - val_out_caps_acc: 0.9949

Below are the log.csv file and the plot of it: log

log.csv.txt

I would like to compare this info to the statistical measures in the table in the Readme:

Method	Routing	Reconstruction	MNIST (%)	Paper
Baseline	--	--	--	0.39
CapsNet-v1	1	no	0.39 (0.024)	0.34 (0.032)

But I'm not sure how you obtained these numbers in the table. Which of the measures the script outputs are used for this? And how did you obtain the standard deviation? I can't seem to find it in TensorBoard either but I'm pretty new to all this so maybe I'm just overlooking things.

XifengGuo commented 6 years ago

@dietercastel You can just run the same code for many times and record these results. Then you can calculate the average and standard deviation of these validation accuracies.

dietercastel commented 6 years ago

Ah of course, I could've known if i read the paper more attentively! Table 1: CapsNet classification test accuracy. The MNIST average and standard deviation results are reported from 3 trials With that info I was wondering how many trials did you run for each version of CapsNet, also 3?

I'm currently still training the last run of the first trial per model.

But so far indeed it indeed seems the case that without the additional learning rate decay after 30 epochs I have a higher test loss. Even outside the interval reported by your trial runs in the Readme. That holds, if I assume that the standard deviation is also reported in percentages, which seems most logical.

I calculated the test loss for the one trial with 1-Test acc where Test acc is the measure reported by running python capsulenet.py --is_training 0 --weights weights_best_epoch.h5 for each of the CapsNet versions.

I'll post the test losses I have when finished.

XifengGuo commented 6 years ago

@dietercastel I run for 3 trials too for each version. Testing error 0.39 (0.024) means that the average is 0.39% and standard deviation is 0.024%. Due to the randomness, no need to be surprised if your test error falls out side of the range. It'll be nice of you if you can share your results here.

dietercastel commented 6 years ago

So here goes with my results:

Method	Routing	Reconstruction	MNIST (%)	Paper	MyTrial*
Baseline	--	--	--	0.39	--
CapsNet-v1	1	no	0.39 (0.024)	0.34 (0.032)	0.64
CapsNet-v2	1	yes	0.37 (0.022)	0.29 (0.011)	0.65
CapsNet-v3	3	no	0.40 (0.016)	0.35 (0.036)	0.48
CapsNet-v4	3	yes	0.34 (0.009)	0.25 (0.005)	0.47

*I did a single trial for each network where I disabled the additional learning decay cfr. the branch noLRDecay of my fork.

From this tiny experiment it seems like Adam does indeed benefit from a globally decaying learning rate (by no means statistically verified though). The why of this is still left unanswered for me but I'm not going to dedicate anymore resources to it right now. If anyone wants to I would still be very curious of the results though!

XifengGuo commented 6 years ago

@dietercastel Thanks for the sharing. As can be seen in your results, the routing is beneficial to the performance. Many interesting studies can be carried on the CapsNet except for this issue. Thanks again.

sulaimanvesal commented 6 years ago

@dietercastel I really like what you tried so far through different CapsNets. Even though I read the paper but I wonder how do I should interpret the errors, while with a simple CNN over the MINST test data we can get an accuracy of more than 99 % with test error of lower than 2%?

dietercastel commented 6 years ago

I think your confusion might stem from the fact that the accuracy and test error reported here are expressed in percentages. E.g. the trial I ran in the post above has a test error of 0.64 % = 0.0064 or otherwise expressed a test accuracy of 1 - 0.64% = 0.9936. Does that clear it up for you @sulaimanvesal ?

sulaimanvesal commented 6 years ago

@dietercastel Oh thank you, now it is totally cleared, I was a bit confused about this fact. In this case, my model got a better test acuuracy= 0.9968 after 17 epochs.

XifengGuo / CapsNet-Keras

Learning rate decay in addition to Adam? #9