Closed abangdd closed 8 years ago
some explanations by you in the BN pr post
Let me as author of EltwiseAffine tell something against it in special BatchNorm case :) First I have implemented it to reproduce fully results of the original paper. But now: 1) Dont see why following convolution layer could not learn scale or bias integrated itself. 2) More important dont see any difference with EltwiseAffine and without it. Now I am making quick expriment on BN-Caffenet-128 (instead of 227 image size for speed): original, this BN, this BN + EltwiseAffine, this BN, but after ReLU, not before. As soon as it will finish, I`ll put results online, but so far variant "this BN, but after ReLU" is leading by great margin.
ChannelScalarLayer and BiasLayer makes sense for me :)
hi @abangdd you are right, no scale\shift yet. My EA implementation is buggy, which is found by @siddharthm83. But I am going to fix and finish it soon.
" lr_mult of BatchNorm is 0" is not relevant here, lr_mult=0 because it is needed by internal implementation of BatchNorm layer.
I see, thank you
@abangdd it looks like problem found https://github.com/BVLC/caffe/pull/2996#issuecomment-170445314 So I have started to train BN-EA-ReLU network
@abangdd
So: BN after ReLU > BN after ReLU + EA > BN before ReLU + EA > BN before ReLU
great benchmark on how to apply BN !
Why BN after ReLU better than BN after conv?
The result shows BN after ReLU better than BN after conv? but why? How to justify?
@Pratyeka In batchnorm paper authors state, that INPUT of layer should be normalized (mean = 0, var = 1) to make learning easy. So BN after ReLU is normalization of input to the next conv layer. But this is only speculation and I am not theory guy, sorry :)
Where can I find the ElementwiseAffine layer?
This is my name for two caffe layers: scale layer and bias layer. Both are standard caffe. Scale * x + bias is essentially affine transformation
hi ducha-aiki, I saw in the caffenet128_lsuv_no_lrn_BatchNormAfterReLU.prototxt file, the lr_mult of BatchNorm is 0, and there are no other layers doing scale/shift learning, so the tests are all done without scale/shift learning? thank you