Open zhaoweicai opened 8 years ago
Lets double check few things first: 1- Could you get the same accuracy with the pretrained models? 2- Could you train the BWN? 3- I have noticed in some versions of cudnn the precision of division makes issues in convergence. If you are using adam you can multiply all the gradients by a large number to prevent the precision error which leads to NaN.
hi @mrastegari The accuracy I get for two pretrained models are 56.67 and 42.37 respectively. I can train BWN, but I stopped at epoch #30, top-1 accuracy is 25.57. But the training was several weeks ago before you fixed some bugs. But for XNOR-net, I am not able to make training converge all the time. I don't know if others encounter the same issue.
Ok try to fix the precision by adding
gradParameters:mul(1e+5)
after line 184 in train.lua
Just to make sure, add gradParameters:mul(1e+5)
after updateBinaryGradWeight(convNodes)
, right? It still doesn't work for me. Has anyone experienced the same issue?
After how many iteration you see the divergence? Also try to follow the paper by replacing the updateBinaryGradWeight
function by:
function updateBinaryGradWeight(convNodes)
for i =2, #convNodes-1 do
local n = convNodes[i].weight[1]:nElement()
local s = convNodes[i].weight:size()
convNodes[i].gradWeight[convNodes[i].weight:le(-1)]=0;
convNodes[i].gradWeight[convNodes[i].weight:ge(1)]=0;
convNodes[i].gradWeight:add(1/(n)):mul(1-1/s[2]);
end
if opt.nGPU >1 then
model:syncParameters()
end
end
Hi @mrastegari Thanks for your help. But it still doesn't work for me. The training starts to diverge at the very beginning with err=nan. I start to retrain BinaryNet now. BinaryNet seems to work very well for now. XnorNet never works for me.
I just pushed a modification can you check that?
Thanks for your help. At first, I change '-cache' to './cache/'. It still doesn't work. Error becomes 'nan' at the beginning all the time even I run the experiments dozens of times and with different random seeds. Has anyone successfully reproduce the XNOR experiments yet? I am confused. BTW, I re-run the Binary-Net experiment, I can get the accuracy of 51.65% in the end. Does the xnor code work very well for you? What problem do you think it is?
There is definitely something wrong with your setup. I asked a friend to try on his machine and he could reproduce the same result ~43%. Which version of Binary-Net are you using? 51.65% top-1 is too good for binary-input-and-binary-weight. Do you have a code for that?
hi @mrastegari I found the problem, which is running multiple gpus. When I switched to 1 gpu, the model started to converge. For multiple-gpu version, maybe I used different CUDA and cuDNN versions. Could you share which version do you use? Thanks!
I use cuda 7.5 and cudnn 5. I also had this problem with GPUs on some of the machines that had mainboard incompatibility with GPUs
Has anyone successfully run xnor-net? I run the code dozens of times, but it has never converged. The error is always "nan". Any idea to make the training converge?