Use of TimeDistributed(BatchNormalization())

yhenon commented 7 years ago

I believe there is an issue with using TimeDistributed(BatchNormalization()) in keras, as is done in keras-rcnn/keras_rcnn/classifiers/resnet.py (although I may be mistaken).As I understand, this leads to the moving mean and variance not being updated.

A small example:

from keras.layers import *
from keras.models import *
import keras.backend as K

img_size = 8
batch_size = 64
num_time_steps = 4
num_channels = 3

inputs = Input(shape=(num_time_steps, img_size, img_size, num_channels))
x = TimeDistributed(BatchNormalization(axis=3))(inputs)

model = Model(inputs=inputs, outputs=x)
model.compile(loss='mae', optimizer='sgd')

X = np.random.rand(batch_size, num_time_steps, img_size, img_size, num_channels)
Y = np.random.rand(batch_size, num_time_steps, img_size, img_size, num_channels)
history = model.fit(X, Y, epochs=4)

print(model.layers[1].get_weights()) # print the weights of the BN layer

And the output of the print() statement is:

[array([ 0.97972429,  0.97963089,  0.9796797 ], dtype=float32), array([ 0.00780913,  0.00766754, 
0.00772369], dtype=float32), array([ 0.,  0.,  0.], dtype=float32), array([ 1.,  1.,  1.], dtype=float32)]

Note that the [0, 0, 0] and [1, 1, 1] are the default, non-updated values of the mean and variance. As an alternative, I think doing BatchNormalization(axis=bn_axis+1), with the +1 as an offset to account for the extra time dimension, is an ok fix to the problem.

0x00b1 commented 7 years ago

Hi, @yhenon!

I appreciate the comment and the explanation (I was puzzled by your custom BatchNormalization implementation)! I’ll take a look. The BatchNormalization(axis=bn_axis+1) sounds like the better way to go!

I’m curious, would you be interested in helping out? Your package was certainly an inspiration!

JihongJu commented 7 years ago

I was not awared of this problem. Thank you for pointing out this issue. As far as I understood, the TimeDistributed layer should apply to a tensor of shape without the time dimension. If this is not the case for BatchNormalization, it might be an issue for keras as well because that would be inconsistent with the other layers. I'm not sure if the issue was caused by the extra dimension or others. That seems interesting and I will look into it.

JihongJu commented 7 years ago

Hi @yhenon ,

I've tried the TimeDistributed BatchNormalization with the following sample:

import numpy as np
from keras.layers import *
from keras.models import *
import keras.backend as K

img_size = 8
batch_size = 64
num_time_steps = 4
num_channels = 3
K.set_learning_phase(1)
X = np.random.rand(batch_size, num_time_steps,
                   img_size, img_size, num_channels)
x = K.variable(X)
y = TimeDistributed(BatchNormalization(axis=-1))(x)
print(K.int_shape(y))

norm = K.eval(y)
for i in range(num_time_steps):
    for j in range(num_channels):
        print(norm[:, i, ..., j].mean(), norm[:, i, ..., j].std())

And the results were:

(64, 4, 8, 8, 3)
-5.02914e-08 0.994199
2.79397e-09 0.99404
-2.79397e-08 0.994136
-3.21306e-08 0.994103
3.53903e-08 0.99412
2.79397e-08 0.994144
3.35276e-08 0.993953
-8.3819e-09 0.994113
1.11759e-08 0.994066
-8.73115e-08 0.993973
-7.45058e-09 0.993843
-6.61239e-08 0.99407

This seems to match what we desired from the BatchNormalization. It returns normalized activations per batch and the normalization was applied independently to all the data streams.

yhenon commented 7 years ago

@JihongJu After looking over your code, I still think my issue stands (though I may be missing something). To clarify my point a bit:

TimeDistributed(BatchNorm()) seems to work fine at training time (as you point out), as it normalizes using the statistics of the mini-batch
TimeDistributed(BatchNorm()) does not work fine at test time, as it normalizes using statistics computed on the training set. However, these statistics never get updated when the BN layer is in a TimeDistributed wrapper.

The problem stands from your line K.set_learning_phase(1), which uses BN in train mode. However, having K.set_learning_phase(1) as test time since it makes a number of layers behave undesirably (like dropout).

Here's a more complete example, where we compute the stats on a batch at both train and test time, using both approaches to BN:

from keras.layers import *
from keras.models import *
import keras.backend as K

def test_bn(batch_norm_type, learning_phase):
    K.set_learning_phase(learning_phase)

    img_size = 8
    batch_size = 64
    num_time_steps = 4
    num_channels = 3

    inputs = Input(shape=(num_time_steps, img_size, img_size, num_channels))

    if batch_norm_type == 'time_dist':
        # momentum increased for faster update of dataset statistics
        x = TimeDistributed(BatchNormalization(axis=-1, momentum=0.5))(inputs)
    elif batch_norm_type == 'flat':
        x = BatchNormalization(axis=4, momentum=0.5)(inputs)

    model = Model(inputs=inputs, outputs=x)
    model.compile(loss='mae', optimizer='sgd')

    X = np.random.rand(batch_size, num_time_steps, img_size, img_size, num_channels)
    Y = np.random.rand(batch_size, num_time_steps, img_size, img_size, num_channels)
    history = model.fit(X, Y, epochs=4, verbose=0)

    P = model.predict(X)

    print('bn_type: {:10} | learning_phase: {} | mean: {:14} | std: {:14}'.format(
        batch_norm_type, learning_phase, 
        P.mean(), P.std()))
    return

for batch_norm_type in ['time_dist', 'flat']:
    for learning_phase in [0, 1]:
        test_bn(batch_norm_type, learning_phase)

And the corresponding output:

bn_type: time_dist  | learning_phase: 0 | mean: 0.498111873865 | std: 0.287745058537
bn_type: time_dist  | learning_phase: 1 | mean: 0.00772533146665 | std: 0.973870813847
bn_type: flat       | learning_phase: 0 | mean: 0.0119582833722 | std: 0.961674869061
bn_type: flat       | learning_phase: 1 | mean: 0.0076981917955 | std: 0.973761022091

yhenon commented 7 years ago

@0x00b1 Hi! To be clear, in my implementation, I was just implementing what the paper said:

For the usage of BN layers, after pretraining, we compute the BN statistics (means and variances) for each layer on the ImageNet training set. Then the BN layers are fixed during fine-tuning for object detection. As such, the BN layers become linear activations with constant offsets and scales, and BN statistics are not updated by fine-tuning. We fix the BN layers mainly for reducing memory consumption in Faster R-CNN training.

This also provided a way of dealing with the above issue, so I left it.

I would certainly be interested in helping - my original implementation is rather limited in scope and full of hacks, amd a better quality keras frcnn would be desirable.

JihongJu commented 7 years ago

@yhenon Hmm, now I get the point. In that case, I agree with you, adding a flat BN, instead of a time distributed BN, to the 5D tensor seems fine.

hgaiser commented 7 years ago

Can I abuse this issue to ask why TimeDistributed layers are necessary? Is it to perform computation per ROI (meaning the term 'time distributed' is a bit poorly chosen here)? I noticed in py-faster-rcnn that they are limited to single batch training only, presumably because Caffe blobs are limited to 4d. If you have batch_size > 1 and ROIs, your blob would need 5 dimensions (batch_id, roi_id, height, width, channels). Is the use of TimeDistributed intended to get this fifth dimension?

In addition, I noticed that for Keras the moving average / variation is not updated when in test mode (see here). Wouldn't that be an issue? Shouldn't it be updated during test mode? Should this be fixed in Keras? So many questions :)

0x00b1 commented 7 years ago

@JihongJu I played with this too. I think @yhenon is correct. And I believe the suggestion by @yhenon will work (i.e. BatchNormalization(axis=bn_axis + 1)).

@yhenon Want to send a PR? 😄

0x00b1 commented 7 years ago

Can I abuse this issue to ask why TimeDistributed layers are necessary? Is it to perform computation per ROI (meaning the term 'time distributed' is a bit poorly chosen here)? I noticed in py-faster-rcnn that they are limited to single batch training only, presumably because Caffe blobs are limited to 4d. If you have batch_size > 1 and ROIs, your blob would need 5 dimensions (batch_id, roi_id, height, width, channels). Is the use of TimeDistributed intended to get this fifth dimension?

Yep. Your instincts are right. It’s a super clever hack by @yhenon to exploit the TimeDistributed wrapper’s batching to iterate across a variable number of regions. And, I agree, TimeDistributed is a bad name. I think Distributed (or Batched) would make more sense. (cc: @fchollet)

0x00b1 commented 7 years ago

In addition, I noticed that for Keras the moving average / variation is not updated when in test mode (see here). Wouldn't that be an issue? Shouldn't it be updated during test mode? Should this be fixed in Keras? So many questions :)

Hrm. Why do you think it should be updated during test (i.e. inference or prediction)?

hgaiser commented 7 years ago

Hrm. Why do you think it should be updated during test (i.e. inference or prediction)?

I'm not sure, but it sounds like the moving average / variation is depending on your current data, not on the data you trained on. I will read more today on BatchNormalization to see how it should be.

waleedka commented 7 years ago

I pushed PR to fix this issue here https://github.com/fchollet/keras/pull/7467. I believe it's a more generic solution than the bn_axis+1 solution, and fixes the root problem in the TimeDistributed layer.

yhenon commented 7 years ago

Thanks to @waleedka for making that PR which has now been merged! Re-running the above snippet with a freshly checked out keras install gives:

bn_type: time_dist  | learning_phase: 0 | mean: 0.0126442806795 | std: 0.960675358772
bn_type: time_dist  | learning_phase: 1 | mean: 0.00776057131588 | std: 0.973823308945
bn_type: flat       | learning_phase: 0 | mean: 0.0131619861349 | std: 0.961250126362
bn_type: flat       | learning_phase: 1 | mean: 0.00772643135861 | std:  0.97383749485

Which is the desired output. This should keep the API a bit simpler, since TimeDistributed() can now be applied to all layers in the final stage classifier. It means people will need to update their keras version to the latest, but that's ok.

0x00b1 commented 7 years ago

Awesome! Thanks for the update, @yhenon and thanks for the work @waleedka!

@waleedka please feel free to add yourself to the CONTRIBUTORS file!

subhashree-r commented 5 years ago

What is the best way to extend this script to a batch inference / training? @yhenon

broadinstitute / keras-rcnn

Use of TimeDistributed(BatchNormalization()) #42