Closed yhenon closed 7 years ago
Hi, @yhenon!
I appreciate the comment and the explanation (I was puzzled by your custom BatchNormalization implementation)! I’ll take a look. The BatchNormalization(axis=bn_axis+1) sounds like the better way to go!
I’m curious, would you be interested in helping out? Your package was certainly an inspiration!
I was not awared of this problem. Thank you for pointing out this issue. As far as I understood, the TimeDistributed layer should apply to a tensor of shape without the time dimension. If this is not the case for BatchNormalization, it might be an issue for keras as well because that would be inconsistent with the other layers. I'm not sure if the issue was caused by the extra dimension or others. That seems interesting and I will look into it.
Hi @yhenon ,
I've tried the TimeDistributed BatchNormalization with the following sample:
import numpy as np
from keras.layers import *
from keras.models import *
import keras.backend as K
img_size = 8
batch_size = 64
num_time_steps = 4
num_channels = 3
K.set_learning_phase(1)
X = np.random.rand(batch_size, num_time_steps,
img_size, img_size, num_channels)
x = K.variable(X)
y = TimeDistributed(BatchNormalization(axis=-1))(x)
print(K.int_shape(y))
norm = K.eval(y)
for i in range(num_time_steps):
for j in range(num_channels):
print(norm[:, i, ..., j].mean(), norm[:, i, ..., j].std())
And the results were:
(64, 4, 8, 8, 3)
-5.02914e-08 0.994199
2.79397e-09 0.99404
-2.79397e-08 0.994136
-3.21306e-08 0.994103
3.53903e-08 0.99412
2.79397e-08 0.994144
3.35276e-08 0.993953
-8.3819e-09 0.994113
1.11759e-08 0.994066
-8.73115e-08 0.993973
-7.45058e-09 0.993843
-6.61239e-08 0.99407
This seems to match what we desired from the BatchNormalization. It returns normalized activations per batch and the normalization was applied independently to all the data streams.
@JihongJu After looking over your code, I still think my issue stands (though I may be missing something). To clarify my point a bit:
TimeDistributed(BatchNorm())
seems to work fine at training time (as you point out), as it normalizes using the statistics of the mini-batchTimeDistributed(BatchNorm())
does not work fine at test time, as it normalizes using statistics computed on the training set. However, these statistics never get updated when the BN layer is in a TimeDistributed
wrapper.The problem stands from your line K.set_learning_phase(1)
, which uses BN in train mode. However, having K.set_learning_phase(1)
as test time since it makes a number of layers behave undesirably (like dropout).
Here's a more complete example, where we compute the stats on a batch at both train and test time, using both approaches to BN:
from keras.layers import *
from keras.models import *
import keras.backend as K
def test_bn(batch_norm_type, learning_phase):
K.set_learning_phase(learning_phase)
img_size = 8
batch_size = 64
num_time_steps = 4
num_channels = 3
inputs = Input(shape=(num_time_steps, img_size, img_size, num_channels))
if batch_norm_type == 'time_dist':
# momentum increased for faster update of dataset statistics
x = TimeDistributed(BatchNormalization(axis=-1, momentum=0.5))(inputs)
elif batch_norm_type == 'flat':
x = BatchNormalization(axis=4, momentum=0.5)(inputs)
model = Model(inputs=inputs, outputs=x)
model.compile(loss='mae', optimizer='sgd')
X = np.random.rand(batch_size, num_time_steps, img_size, img_size, num_channels)
Y = np.random.rand(batch_size, num_time_steps, img_size, img_size, num_channels)
history = model.fit(X, Y, epochs=4, verbose=0)
P = model.predict(X)
print('bn_type: {:10} | learning_phase: {} | mean: {:14} | std: {:14}'.format(
batch_norm_type, learning_phase,
P.mean(), P.std()))
return
for batch_norm_type in ['time_dist', 'flat']:
for learning_phase in [0, 1]:
test_bn(batch_norm_type, learning_phase)
And the corresponding output:
bn_type: time_dist | learning_phase: 0 | mean: 0.498111873865 | std: 0.287745058537
bn_type: time_dist | learning_phase: 1 | mean: 0.00772533146665 | std: 0.973870813847
bn_type: flat | learning_phase: 0 | mean: 0.0119582833722 | std: 0.961674869061
bn_type: flat | learning_phase: 1 | mean: 0.0076981917955 | std: 0.973761022091
@0x00b1 Hi! To be clear, in my implementation, I was just implementing what the paper said:
For the usage of BN layers, after pretraining, we compute the BN statistics (means and variances) for each layer on the ImageNet training set. Then the BN layers are fixed during fine-tuning for object detection. As such, the BN layers become linear activations with constant offsets and scales, and BN statistics are not updated by fine-tuning. We fix the BN layers mainly for reducing memory consumption in Faster R-CNN training.
This also provided a way of dealing with the above issue, so I left it.
I would certainly be interested in helping - my original implementation is rather limited in scope and full of hacks, amd a better quality keras frcnn would be desirable.
@yhenon Hmm, now I get the point. In that case, I agree with you, adding a flat BN, instead of a time distributed BN, to the 5D tensor seems fine.
Can I abuse this issue to ask why TimeDistributed
layers are necessary? Is it to perform computation per ROI (meaning the term 'time distributed' is a bit poorly chosen here)? I noticed in py-faster-rcnn
that they are limited to single batch training only, presumably because Caffe blobs are limited to 4d. If you have batch_size > 1
and ROIs, your blob would need 5 dimensions (batch_id, roi_id, height, width, channels). Is the use of TimeDistributed
intended to get this fifth dimension?
In addition, I noticed that for Keras the moving average / variation is not updated when in test
mode (see here). Wouldn't that be an issue? Shouldn't it be updated during test
mode? Should this be fixed in Keras? So many questions :)
@JihongJu I played with this too. I think @yhenon is correct. And I believe the suggestion by @yhenon will work (i.e. BatchNormalization(axis=bn_axis + 1)
).
@yhenon Want to send a PR? 😄
Can I abuse this issue to ask why TimeDistributed layers are necessary? Is it to perform computation per ROI (meaning the term 'time distributed' is a bit poorly chosen here)? I noticed in py-faster-rcnn that they are limited to single batch training only, presumably because Caffe blobs are limited to 4d. If you have batch_size > 1 and ROIs, your blob would need 5 dimensions (batch_id, roi_id, height, width, channels). Is the use of TimeDistributed intended to get this fifth dimension?
Yep. Your instincts are right. It’s a super clever hack by @yhenon to exploit the TimeDistributed wrapper’s batching to iterate across a variable number of regions. And, I agree, TimeDistributed is a bad name. I think Distributed (or Batched) would make more sense. (cc: @fchollet)
In addition, I noticed that for Keras the moving average / variation is not updated when in test mode (see here). Wouldn't that be an issue? Shouldn't it be updated during test mode? Should this be fixed in Keras? So many questions :)
Hrm. Why do you think it should be updated during test (i.e. inference or prediction)?
Hrm. Why do you think it should be updated during test (i.e. inference or prediction)?
I'm not sure, but it sounds like the moving average / variation is depending on your current data, not on the data you trained on. I will read more today on BatchNormalization to see how it should be.
I pushed PR to fix this issue here https://github.com/fchollet/keras/pull/7467. I believe it's a more generic solution than the bn_axis+1 solution, and fixes the root problem in the TimeDistributed layer.
Thanks to @waleedka for making that PR which has now been merged! Re-running the above snippet with a freshly checked out keras install gives:
bn_type: time_dist | learning_phase: 0 | mean: 0.0126442806795 | std: 0.960675358772
bn_type: time_dist | learning_phase: 1 | mean: 0.00776057131588 | std: 0.973823308945
bn_type: flat | learning_phase: 0 | mean: 0.0131619861349 | std: 0.961250126362
bn_type: flat | learning_phase: 1 | mean: 0.00772643135861 | std: 0.97383749485
Which is the desired output. This should keep the API a bit simpler, since TimeDistributed()
can now be applied to all layers in the final stage classifier. It means people will need to update their keras version to the latest, but that's ok.
Awesome! Thanks for the update, @yhenon and thanks for the work @waleedka!
@waleedka please feel free to add yourself to the CONTRIBUTORS file!
What is the best way to extend this script to a batch inference / training? @yhenon
I believe there is an issue with using
TimeDistributed(BatchNormalization())
in keras, as is done inkeras-rcnn/keras_rcnn/classifiers/resnet.py
(although I may be mistaken).As I understand, this leads to the moving mean and variance not being updated.A small example:
And the output of the
print()
statement is:Note that the
[0, 0, 0]
and[1, 1, 1]
are the default, non-updated values of the mean and variance. As an alternative, I think doingBatchNormalization(axis=bn_axis+1)
, with the+1
as an offset to account for the extra time dimension, is an ok fix to the problem.