Closed shuangchenli closed 7 years ago
I guess you are using dropout 0.5? It should be 0.1
Thanks for you reply! but I already set dropout as 0.1... On Aug 9, 2016 6:09 PM, "mrastegari" notifications@github.com wrote:
I guess you are using dropout 0.5? It should be 0.1
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mrastegari/XNOR-Net/issues/3#issuecomment-238737546, or mute the thread https://github.com/notifications/unsubscribe-auth/AOiEcM247kcVtOgvV7-jT5ywzogFwwNOks5qeSJpgaJpZM4Jgqt- .
mmmm. That is weird. Because if the dropout is 0.1 after epoch 40 the train accuracy must be higher than the val accuracy.
Yes, it is weird. Do you have any suggestion for me please? Thanks a lot!
This is my learning curve. I can see that even in the first epoch your accuracy is much less than mine
Thanks. It is just too weird. just make sure that I used the configuration as "-nGPU 1 -batchSize 128 -netType alexnet -binaryWeight -dropout 0.1 -epochSize 10000" and the LR regimes is { 1, 18, 1e-2, 5e-4, }, { 19, 29, 5e-3, 5e-4 }, { 30, 43, 1e-3, 0 }, { 44, 52, 5e-4, 0 }, { 53, 1e8, 1e-4, 0 },
Does that means there are something wrong with my training dataset? Did you resize the training data by "find . -name "*.JPEG" | xargs -I {} convert {} -resize "256^>" {}" as well?
I got prefect 57.1% using full precision with imagenet-multiGPU.torch, indicating my data and torch version are fine. Really don't know what did i miss for BWN...
Ok, another sanity check is to run my code without -binaryWeight then you should get the similar performance. If this works, it means that I screwed something in the training code.
Thanks for the hint. I will try it.
Well, I am running without "-binaryWeight". After the first epoch I got 14.3% train and 22% val, pretty much what you have from your training curve and even higher than what I got from "imagenet-multiGPU.torch"... any ideas please...
I will try to learn it again to see what is the changes.
Thanks a lot!
Ok I fixed a bug in the gradient. After one epoch I got %14.4 (train) and %22.2 (test) accuracy.
Thanks a lot! I really appreciate it.
Could you please do me one more favor? could you please explain:
line 85 m:add(1/(n)):mul(1-1/s[2]):mul(n);
in the function updateBinaryGradWeight(convNodes)
I got that the part m:add(1/(n))
is used for calculating , but what does
:mul(1-1/s[2]):mul(n)
work for?
Sure, :mul(1-1/s[2])
is the gradient of mean centering and :mul(n)
scale back the gradient with respect to the filter size. These two are not in the paper I just added them after publishing the paper.
Thanks again! I got similar result now~
I still cannot understand :mul(n)
. Could you please explain a bit more? Why do you need to "scale back" the gradient? The gradient is calculated with \cap(W)=sign(W)*\Alpha, I thought it is never scaled?
because we are multiplying the gradients with the scaling factors but the weights are already scaled so we are doing the scaling twice I am scaling back with n
. I have a reason for that which is complicated. Basically I am scaling the learningRate differently for each layer. This is a bit hacky but if you want to be consistent with the paper you need to change it as follows:
function updateBinaryGradWeight(convNodes)
for i =2, #convNodes-1 do
local n = convNodes[i].weight[1]:nElement()
local s = convNodes[i].weight:size()
convNodes[i].gradWeight[convNodes[i].weight:le(-1)]=0;
convNodes[i].gradWeight[convNodes[i].weight:ge(1)]=0;
convNodes[i].gradWeight:add(1/(n)):mul(1-1/s[2]);
end
if opt.nGPU >1 then
model:syncParameters()
end
end
Sorry that I keep bothering you... I got even more confused...
In my understanding, the reason we are multiplying the scaling factor is because we need to get
, which is
in the paper. Then the code
local m = convNodes[i].weight:norm(1,4):sum(3):sum(2):div(n):expand(s);
m[convNodes[i].weight:le(-1)]=0;
m[convNodes[i].weight:ge(1)]=0;
is the part, and
m:add(1/(n)):mul(1-1/s[2]);
and convNodes[i].gradWeight:cmul(m)
then just do the job, which turns to be the code before the last commit... I still don't understand why we need an extra mul(n)
to scale it back...
On the other hand, the code in your last answer should be calculating instead?
Thanks a lot!
you should see it in the chain rule from previous layers. In the backward pass we are using the scaled binary weights so all the weights and the outputs are scaled and we multiply them with them with current gradient in a specific layer.
Great, I think I got it. Thank you so much for helping me and explaining!!
@shuangchenli Can you provide some info on which GPU you used for training and how much time it took. Thanks!
@mrastegari "Ok I fixed a bug in the gradient. After one epoch I got %14.4 (train) and %22.2 (test) accuracy."
Is the epoch size set as 10000 or 2500?
if opt.optimType == 'sgd' then m:mul(n); end
I am still confused of this. Could you give a formulation?
Hi,
I somehow have trouble to reproduce the alexnet BWN result. I used your suggested configuration in the previous closed issue (1 GPU, default LR, 128 batchsize, 10000 epoch size) but I still got 45% after 55 epochs.... Could you please help me out? Thanks a lot!
BTW, the pretrained model totally works (56.82%). I also tried to train with the configuration from the paper (512 batchsize, 0.1 LR and decay 0.01 every 4 epochs), it doesn't work either.
Here is my training result.![image](https://cloud.githubusercontent.com/assets/15238256/17538351/4d74547c-5e59-11e6-896e-03da9e21a175.png)