Closed vfdev-5 closed 6 years ago
BatchNorm in Caffe works slightly different than that of Keras. BatchNorm layer in Caffe only performs "standardization" (i.e. subtract each sample with mean and variance of the mini-batch) without the scale and bias parameters. To compensate that, the BatchNorm layer is usually followed by a Scale layer to introduce the missing scale and bias terms. Since we port the Caffe model to Keras, we have to define a customized Scale layer to incorporate the scale and bias terms to the model.
Yes, it is true. Following Caffe BatchNorm documentation where it is explicitly said:
Note that the original paper also included a per-channel learned bias and scaling factor. To implement this in Caffe, define a ScaleLayer configured with bias_term: true after each BatchNormLayer to handle both the bias and scaling factor.
So, as you said Caffe's BatchNorm + Scale are doing :
y_bn = ( x - mean ) / (sqrt(var))
y_scale = y_bn * gamma + beta
However, on the Keras side, we have already these options included :
center: If True, add offset of `beta` to normalized tensor.
If False, `beta` is ignored.
scale: If True, multiply by `gamma`.
If False, `gamma` is not used.
When the next layer is linear (also e.g. `nn.relu`),
this can be disabled since the scaling
will be done by the next layer.
beta_initializer: Initializer for the beta weight.
gamma_initializer: Initializer for the gamma weight.
which is handled, for example, with tensorflow as
output = (x - mean) / (sqrt(var) + epsilon) * gamma + beta
So, I just wonder, why it is not possible to use directly beta_initializer
and gamma_initializer
for scaling purposes ?
Thanks
Thanks for your detailed elaboration. Yes, I think your suggested method of directly copying the beta and gamma parameter from Caffe's scale layer directly to Keras' BatchNorm layer is completely valid. It just happened that I used a Caffe-to-Keras weight converter that defines a separate Scale layer to handle the scaling parameters. Both methods work.
Sure. Just in your version, there are more trainable parameters than for example in pytorch implementation.
Hi,
At first, thanks for this repo with DenseNet for Keras. As I understand you ported the architecture and weights from Caffe. I've just a question on actual purpose of your custom layer Scale after BatchNormalization in Keras v2 in the sense don't they perform the same work ?