Closed JohanCao closed 7 years ago
have you tried to use sgd?
@winstywang Not yet. Running it now. is there any difference between ccSGD and SGD? (except for the c++ implementation)
@JohanCao Not sure. I heard some problems about ccSGD long time ago. I'm not sure whether it is related to your issue.
I've tried the SGD, doesn't make too much difference
Did you initialize the weights with the same distribution?
@piiswrong Yes, the network is initialized with vgg16 net, and the added conv layers are initialized using Gaussian distribution.
MxNet initialize bias to zero, will this affect your result?
I initialized the bias using gaussian as well. I also tried to initialize the net with the model that I trained with matconvnet, after only one iteration, some layer weights and gradient are NaN. The initial learning rate is 1e-7.
It would be helpful if you can post your code
The NaN problem is solved by updating to the newest version of mxnet. It turns out to be a bug of the pooling layer, the behaviour was very strange: the output value of previous layer (relu) are very small, but after one max pooling, the values become extremely large.
Now I get the same result with my early matconvnet work, but still with different initial learning rate (1e-4 v.s. 1e-6). The possible explanation could be the normalization of softmax output (the size of output score map is 34 * 34 * n). But I set the normalization to be "null". Isn't that means no normalization at the softmax calculation?
This issue is closed due to lack of activity in the last 90 days. Feel free to reopen if this is still an active issue. Thanks!
Hi all,
I am trying to re-implement my old matcovnet classification work using mxnet. The initial learning rate is set to 1e-6 in matconvnet, however when I use the same small learning rate in mxnet, the performance is ~10% lower and the training accuracy is not enhancing, I also tried larger initial learning rate but the result is still bad. The batchsize is 1 and the normalization is set to "null" in softmaxoutput. The optimizer is ccSGD and rescale_grad is set to 1/batchsize. Is there any other options that can have influence on the learning rate? Many thanks!