Why is noise scaled by Ntrain in RMSProp

ChunyuanLI / pSGLD

AAAI & CVPR 2016: Preconditioned Stochastic Gradient Langevin Dynamics (pSGLD)

35 stars 12 forks source link

Why is noise scaled by Ntrain in RMSProp #2

Open jaak-s opened 7 years ago

jaak-s commented 7 years ago

In SGLD_RMSprop.m the noise is scaled by opts.N which is set to Ntrain in DNN experiments: https://github.com/ChunyuanLI/pSGLD/blob/master/pSGLD_DNN/algorithms/SGLD_RMSprop.m#L51

Why is this the case? In the paper (https://arxiv.org/pdf/1512.07666v1.pdf) there is no such scaling.

I also checked SGLD_Adagrad.m and there is no scaling by Ntrain for the noise.

ChunyuanLI commented 7 years ago

The scaling for the noise is for faster convergence in practice; Otherwise, we need to train the model for a long time according to the theory.

jaak-s commented 7 years ago

Is the choice of Ntrain for the scaling arbitrary? Or do you think it will work in general for almost any dataset?

ChunyuanLI commented 7 years ago

Ntrain is the number of data points in the training dataset.

jaak-s commented 7 years ago

Yes, but you could use other numbers for scaling like a constant number (100) or batch size etc. So my question is whether you expect that Ntrain is a good choice in practice and it will work well for almost any dataset. Or should we try several values for the scaling and choose the best?

ChunyuanLI commented 7 years ago

I expect that Ntrain is a good choice in practice.

The "grad" is mean of the gradients computed in the mini-batch. We should use opts.N*grad to approximate the true gradient of the full dataset.

Instead, we consider the scaling issue in the stepsize "lr", and come to the update as following:

grad = lr* grad ./ pcder + sqrt(2*lr./pcder/opts.N).*randn(size(grad)) ;

However, this would take a long time to converge. In practice, I recommend:

grad = lr* grad ./ pcder + sqrt(2*lr./pcder).*randn(size(grad))/opts.N ;

jaak-s commented 7 years ago

Thank you for the explanation. I saw that also SGLD.m uses the same scaling by opts.N. So in your experience the same slow convergence holds true for the SGLD method too?

ChunyuanLI commented 7 years ago

Yes, the convergence also holds for SGLD.