I'm implementing recurrent BN in Keras, but looking at the original paper and those citing it, a detail remains unclear to me: how are batch statistics computed? In the original, authors state (pg. 3) (emphasis mine):
At training time, the statistics E[h] and Var[h] are estimated by the sample mean and sample variance of the current minibatch
Yet another paper (pg. 3) using and citing it describes:
We subscript BN by time (BN_t) to indicate that each time step tracks its own mean and variance. In practice, we track these statistics as they change over the course of training using an exponential moving average (EMA)
My question's thus two-fold:
Are minibatch statistics computed per immediate minibatch, or as an EMA?
How are the inference parameters, shared across all timesteps, gamma and beta computed? Is the computation in (1) simply averaged across all timesteps? (e.g. average EMA_t for all t)
Existing implementations: in Keras and TF below, but are all outdated, and am unsure regarding correctness
All above agree that during training, immediate minibatch statistics are used, and that beta and gamma are updated as an EMA of these minibatches
Problem: the bn operation (in A, and presumably B & C) is applied on a single timestep slice, to be passed to the K.rnn control flow for re-iteration. Hence, EMA is computed w.r.t. minibatches and timesteps - which I find questionable:
EMA is used in place of a simple average when population statistics are dynamic (e.g. minibatch-to-minibatch), whereas we have access to all timesteps in a minibatch prior having to update gamma and beta
EMA is a worse but at times necessary alternative to a simple average, but per above, we can use latter - so why don't we? Timestep statistics can be cached, averaged at the end, then discarded - holds also for stateful=True
I'm implementing recurrent BN in Keras, but looking at the original paper and those citing it, a detail remains unclear to me: how are batch statistics computed? In the original, authors state (pg. 3) (emphasis mine):
Yet another paper (pg. 3) using and citing it describes:
My question's thus two-fold:
gamma
andbeta
computed? Is the computation in (1) simply averaged across all timesteps? (e.g. averageEMA_t
for allt
)Existing implementations: in Keras and TF below, but are all outdated, and am unsure regarding correctness
beta
andgamma
are updated as an EMA of these minibatchesbn
operation (in A, and presumably B & C) is applied on a single timestep slice, to be passed to theK.rnn
control flow for re-iteration. Hence, EMA is computed w.r.t. minibatches and timesteps - which I find questionable:gamma
andbeta
stateful=True