Scale layer vs BatchNormalization in Keras v2

vfdev-5 commented 7 years ago

Hi,

At first, thanks for this repo with DenseNet for Keras. As I understand you ported the architecture and weights from Caffe. I've just a question on actual purpose of your custom layer Scale after BatchNormalization in Keras v2 in the sense don't they perform the same work ?

flyyufelix commented 7 years ago

BatchNorm in Caffe works slightly different than that of Keras. BatchNorm layer in Caffe only performs "standardization" (i.e. subtract each sample with mean and variance of the mini-batch) without the scale and bias parameters. To compensate that, the BatchNorm layer is usually followed by a Scale layer to introduce the missing scale and bias terms. Since we port the Caffe model to Keras, we have to define a customized Scale layer to incorporate the scale and bias terms to the model.

vfdev-5 commented 7 years ago

Yes, it is true. Following Caffe BatchNorm documentation where it is explicitly said:

Note that the original paper also included a per-channel learned bias and scaling factor. To implement this in Caffe, define a ScaleLayer configured with bias_term: true after each BatchNormLayer to handle both the bias and scaling factor.

So, as you said Caffe's BatchNorm + Scale are doing :

y_bn = ( x - mean ) / (sqrt(var))
y_scale = y_bn * gamma + beta

However, on the Keras side, we have already these options included :

        center: If True, add offset of `beta` to normalized tensor.
            If False, `beta` is ignored.
        scale: If True, multiply by `gamma`.
            If False, `gamma` is not used.
            When the next layer is linear (also e.g. `nn.relu`),
            this can be disabled since the scaling
            will be done by the next layer.
        beta_initializer: Initializer for the beta weight.
        gamma_initializer: Initializer for the gamma weight.

which is handled, for example, with tensorflow as

output = (x - mean) / (sqrt(var) + epsilon) * gamma + beta

So, I just wonder, why it is not possible to use directly beta_initializer and gamma_initializer for scaling purposes ?

Thanks

flyyufelix commented 6 years ago

Thanks for your detailed elaboration. Yes, I think your suggested method of directly copying the beta and gamma parameter from Caffe's scale layer directly to Keras' BatchNorm layer is completely valid. It just happened that I used a Caffe-to-Keras weight converter that defines a separate Scale layer to handle the scaling parameters. Both methods work.

vfdev-5 commented 6 years ago

Sure. Just in your version, there are more trainable parameters than for example in pytorch implementation.

flyyufelix / DenseNet-Keras

Scale layer vs BatchNormalization in Keras v2 #5