apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

Different learning rate initial values in mxnet and matconvnet #3101

Closed JohanCao closed 7 years ago

JohanCao commented 8 years ago

Hi all,

I am trying to re-implement my old matcovnet classification work using mxnet. The initial learning rate is set to 1e-6 in matconvnet, however when I use the same small learning rate in mxnet, the performance is ~10% lower and the training accuracy is not enhancing, I also tried larger initial learning rate but the result is still bad. The batchsize is 1 and the normalization is set to "null" in softmaxoutput. The optimizer is ccSGD and rescale_grad is set to 1/batchsize. Is there any other options that can have influence on the learning rate? Many thanks!

winstywang commented 8 years ago

have you tried to use sgd?

JohanCao commented 8 years ago

@winstywang Not yet. Running it now. is there any difference between ccSGD and SGD? (except for the c++ implementation)

winstywang commented 8 years ago

@JohanCao Not sure. I heard some problems about ccSGD long time ago. I'm not sure whether it is related to your issue.

JohanCao commented 8 years ago

I've tried the SGD, doesn't make too much difference

piiswrong commented 8 years ago

Did you initialize the weights with the same distribution?

JohanCao commented 8 years ago

@piiswrong Yes, the network is initialized with vgg16 net, and the added conv layers are initialized using Gaussian distribution.

feiyulv commented 8 years ago

MxNet initialize bias to zero, will this affect your result?

JohanCao commented 8 years ago

I initialized the bias using gaussian as well. I also tried to initialize the net with the model that I trained with matconvnet, after only one iteration, some layer weights and gradient are NaN. The initial learning rate is 1e-7.

piiswrong commented 8 years ago

It would be helpful if you can post your code

JohanCao commented 8 years ago

The NaN problem is solved by updating to the newest version of mxnet. It turns out to be a bug of the pooling layer, the behaviour was very strange: the output value of previous layer (relu) are very small, but after one max pooling, the values become extremely large.

Now I get the same result with my early matconvnet work, but still with different initial learning rate (1e-4 v.s. 1e-6). The possible explanation could be the normalization of softmax output (the size of output score map is 34 * 34 * n). But I set the normalization to be "null". Isn't that means no normalization at the softmax calculation?

phunterlau commented 7 years ago

This issue is closed due to lack of activity in the last 90 days. Feel free to reopen if this is still an active issue. Thanks!