backward gradient is vanished

allenai / XNOR-Net

ImageNet classification using binary Convolutional Neural Networks

https://xnor.ai/

Other

856 stars 239 forks source link

backward gradient is vanished #6

Open luhaofang opened 7 years ago

luhaofang commented 7 years ago

Hi，thanks for your excellent work, and I'm focused on the work for a period. I think the core is Gradient optimization. But still then I haven't reproduce your experiment. Could you provide a little advice to me?

I build the network with the block that mentioned in your paper (B->A->C->P). and Backward is using full precision data (weights&gradient) for ||r|| < 1.

mrastegari commented 7 years ago

Are you running this code? or you are implementing in another platform?

luhaofang commented 7 years ago

I built a calculate platform by myself, and took an engineering aspect test on your method. It performs on cpu about 1s per image of 224*224 on resnet-18, but I haven't get the right way to train the network yet. I am wondering in shallow network the bin_conv work?

luhaofang commented 7 years ago

Hi, @mrastegari how do I initial the layer's parameters? If I want to train the network from the beginning?

mrastegari commented 7 years ago

The initialization is here : util.lua (in the function rand_initialize(layer))

luhaofang commented 7 years ago

@mrastegari I trained a shallow network and I found the accuracy is ~13% loss compared with full precision model, Is the ~10% loss estimated in xnor net model? Is the batch_size making a meanful impact? Anyway. I have noticed an interesting result. In my experiments the test accuracy is unstable, almost the same loss ,made a totally different accuracy(almost ~20%). So I think maybe the key point is the beta?

mrastegari commented 7 years ago

Yes 10% loss is expected with XNOR. The bigger batch size helps for better estimation of gradient in XNOR. I do not fully understand your point on being unstable at test time. Can you please explain with more details?

luhaofang commented 7 years ago

@mrastegari ,I noticed you mean centerd the layer's weights before binary. Is this a key point for training process?

luhaofang commented 7 years ago

@mrastegari hi, I 'm not sure about the thought. But in my comprehension, sparse coding makes DNN work. In my experiments. loss is not well corresponding to the validating accuracy， even after a really long training epoch. This means, the embedding features aren't really separating the sample's class. Back to the structure of xnor network, the convolution result's numerical differences are generated by alpha and beta, but according to your experiments beta doesn't make serious impact to the results. Alpha is used to scaling the convolution quantization, Maybe beta can increase the interval of the features' sampling quantization.