XNOR net doesn't converge - Githubissues

allenai / XNOR-Net

ImageNet classification using binary Convolutional Neural Networks

https://xnor.ai/

Other

856 stars 239 forks source link

XNOR net doesn't converge #7

Open zhaoweicai opened 8 years ago

zhaoweicai commented 8 years ago

Has anyone successfully run xnor-net? I run the code dozens of times, but it has never converged. The error is always "nan". Any idea to make the training converge?

mrastegari commented 8 years ago

Lets double check few things first: 1- Could you get the same accuracy with the pretrained models? 2- Could you train the BWN? 3- I have noticed in some versions of cudnn the precision of division makes issues in convergence. If you are using adam you can multiply all the gradients by a large number to prevent the precision error which leads to NaN.

zhaoweicai commented 8 years ago

hi @mrastegari The accuracy I get for two pretrained models are 56.67 and 42.37 respectively. I can train BWN, but I stopped at epoch #30, top-1 accuracy is 25.57. But the training was several weeks ago before you fixed some bugs. But for XNOR-net, I am not able to make training converge all the time. I don't know if others encounter the same issue.

mrastegari commented 8 years ago

Ok try to fix the precision by adding gradParameters:mul(1e+5) after line 184 in train.lua

zhaoweicai commented 8 years ago

Just to make sure, add gradParameters:mul(1e+5) after updateBinaryGradWeight(convNodes), right? It still doesn't work for me. Has anyone experienced the same issue?

mrastegari commented 8 years ago

After how many iteration you see the divergence? Also try to follow the paper by replacing the updateBinaryGradWeight function by:

function updateBinaryGradWeight(convNodes)
   for i =2, #convNodes-1 do
    local n = convNodes[i].weight[1]:nElement()
    local s = convNodes[i].weight:size()
    convNodes[i].gradWeight[convNodes[i].weight:le(-1)]=0;
    convNodes[i].gradWeight[convNodes[i].weight:ge(1)]=0;
    convNodes[i].gradWeight:add(1/(n)):mul(1-1/s[2]);
   end
   if opt.nGPU >1 then
    model:syncParameters()
   end
end

zhaoweicai commented 8 years ago

Hi @mrastegari Thanks for your help. But it still doesn't work for me. The training starts to diverge at the very beginning with err=nan. I start to retrain BinaryNet now. BinaryNet seems to work very well for now. XnorNet never works for me.

mrastegari commented 8 years ago

I just pushed a modification can you check that?

zhaoweicai commented 8 years ago

Thanks for your help. At first, I change '-cache' to './cache/'. It still doesn't work. Error becomes 'nan' at the beginning all the time even I run the experiments dozens of times and with different random seeds. Has anyone successfully reproduce the XNOR experiments yet? I am confused. BTW, I re-run the Binary-Net experiment, I can get the accuracy of 51.65% in the end. Does the xnor code work very well for you? What problem do you think it is?

mrastegari commented 8 years ago

There is definitely something wrong with your setup. I asked a friend to try on his machine and he could reproduce the same result ~43%. Which version of Binary-Net are you using? 51.65% top-1 is too good for binary-input-and-binary-weight. Do you have a code for that?

zhaoweicai commented 8 years ago

hi @mrastegari I found the problem, which is running multiple gpus. When I switched to 1 gpu, the model started to converge. For multiple-gpu version, maybe I used different CUDA and cuDNN versions. Could you share which version do you use? Thanks!

mrastegari commented 8 years ago

I use cuda 7.5 and cudnn 5. I also had this problem with GPUs on some of the machines that had mainboard incompatibility with GPUs