problems on reproducing BWN result..

shuangchenli commented 7 years ago

Hi,

I somehow have trouble to reproduce the alexnet BWN result. I used your suggested configuration in the previous closed issue (1 GPU, default LR, 128 batchsize, 10000 epoch size) but I still got 45% after 55 epochs.... Could you please help me out? Thanks a lot!

BTW, the pretrained model totally works (56.82%). I also tried to train with the configuration from the paper (512 batchsize, 0.1 LR and decay 0.01 every 4 epochs), it doesn't work either.

Here is my training result.

mrastegari commented 7 years ago

I guess you are using dropout 0.5? It should be 0.1

shuangchenli commented 7 years ago

Thanks for you reply! but I already set dropout as 0.1... On Aug 9, 2016 6:09 PM, "mrastegari" notifications@github.com wrote:

I guess you are using dropout 0.5? It should be 0.1

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mrastegari/XNOR-Net/issues/3#issuecomment-238737546, or mute the thread https://github.com/notifications/unsubscribe-auth/AOiEcM247kcVtOgvV7-jT5ywzogFwwNOks5qeSJpgaJpZM4Jgqt- .

mrastegari commented 7 years ago

mmmm. That is weird. Because if the dropout is 0.1 after epoch 40 the train accuracy must be higher than the val accuracy.

shuangchenli commented 7 years ago

Yes, it is weird. Do you have any suggestion for me please? Thanks a lot!

mrastegari commented 7 years ago

top1_bwn This is my learning curve. I can see that even in the first epoch your accuracy is much less than mine

shuangchenli commented 7 years ago

Thanks. It is just too weird. just make sure that I used the configuration as "-nGPU 1 -batchSize 128 -netType alexnet -binaryWeight -dropout 0.1 -epochSize 10000" and the LR regimes is { 1, 18, 1e-2, 5e-4, }, { 19, 29, 5e-3, 5e-4 }, { 30, 43, 1e-3, 0 }, { 44, 52, 5e-4, 0 }, { 53, 1e8, 1e-4, 0 },

Does that means there are something wrong with my training dataset? Did you resize the training data by "find . -name "*.JPEG" | xargs -I {} convert {} -resize "256^>" {}" as well?

shuangchenli commented 7 years ago

I got prefect 57.1% using full precision with imagenet-multiGPU.torch, indicating my data and torch version are fine. Really don't know what did i miss for BWN...

mrastegari commented 7 years ago

Ok, another sanity check is to run my code without -binaryWeight then you should get the similar performance. If this works, it means that I screwed something in the training code.

shuangchenli commented 7 years ago

Thanks for the hint. I will try it.

shuangchenli commented 7 years ago

Well, I am running without "-binaryWeight". After the first epoch I got 14.3% train and 22% val, pretty much what you have from your training curve and even higher than what I got from "imagenet-multiGPU.torch"... any ideas please...

mrastegari commented 7 years ago

I will try to learn it again to see what is the changes.

shuangchenli commented 7 years ago

Thanks a lot!

mrastegari commented 7 years ago

Ok I fixed a bug in the gradient. After one epoch I got %14.4 (train) and %22.2 (test) accuracy.

shuangchenli commented 7 years ago

Thanks a lot! I really appreciate it. Could you please do me one more favor? could you please explain: line 85 m:add(1/(n)):mul(1-1/s[2]):mul(n); in the function updateBinaryGradWeight(convNodes) I got that the part m:add(1/(n)) is used for calculating , but what does :mul(1-1/s[2]):mul(n) work for?

mrastegari commented 7 years ago

Sure, :mul(1-1/s[2]) is the gradient of mean centering and :mul(n) scale back the gradient with respect to the filter size. These two are not in the paper I just added them after publishing the paper.

shuangchenli commented 7 years ago

Thanks again! I got similar result now~ I still cannot understand :mul(n). Could you please explain a bit more? Why do you need to "scale back" the gradient? The gradient is calculated with \cap(W)=sign(W)*\Alpha, I thought it is never scaled?

mrastegari commented 7 years ago

because we are multiplying the gradients with the scaling factors but the weights are already scaled so we are doing the scaling twice I am scaling back with n. I have a reason for that which is complicated. Basically I am scaling the learningRate differently for each layer. This is a bit hacky but if you want to be consistent with the paper you need to change it as follows:

function updateBinaryGradWeight(convNodes)
   for i =2, #convNodes-1 do
    local n = convNodes[i].weight[1]:nElement()
    local s = convNodes[i].weight:size()
    convNodes[i].gradWeight[convNodes[i].weight:le(-1)]=0;
    convNodes[i].gradWeight[convNodes[i].weight:ge(1)]=0;
    convNodes[i].gradWeight:add(1/(n)):mul(1-1/s[2]);
   end
   if opt.nGPU >1 then
    model:syncParameters()
   end
end

shuangchenli commented 7 years ago

Sorry that I keep bothering you... I got even more confused... In my understanding, the reason we are multiplying the scaling factor is because we need to get , which is in the paper. Then the code local m = convNodes[i].weight:norm(1,4):sum(3):sum(2):div(n):expand(s); m[convNodes[i].weight:le(-1)]=0; m[convNodes[i].weight:ge(1)]=0; is the part, and m:add(1/(n)):mul(1-1/s[2]); and convNodes[i].gradWeight:cmul(m) then just do the job, which turns to be the code before the last commit... I still don't understand why we need an extra mul(n) to scale it back...

On the other hand, the code in your last answer should be calculating instead?

Thanks a lot!

mrastegari commented 7 years ago

you should see it in the chain rule from previous layers. In the backward pass we are using the scaled binary weights so all the weights and the outputs are scaled and we multiply them with them with current gradient in a specific layer.

shuangchenli commented 7 years ago

Great, I think I got it. Thank you so much for helping me and explaining!!

ratnesh1729 commented 7 years ago

@shuangchenli Can you provide some info on which GPU you used for training and how much time it took. Thanks!

baiyancheng20 commented 7 years ago

@mrastegari "Ok I fixed a bug in the gradient. After one epoch I got %14.4 (train) and %22.2 (test) accuracy." Is the epoch size set as 10000 or 2500? if opt.optimType == 'sgd' then m:mul(n); end I am still confused of this. Could you give a formulation?

allenai / XNOR-Net

problems on reproducing BWN result.. #3