cooijmanstim / recurrent-batch-normalization

64 stars 26 forks source link

What is the meaning of `dummy_states`? #5

Closed wandering007 closed 6 years ago

wandering007 commented 6 years ago

I am not familiar with Theano and just can understand the code in general. The usage of dummy_states makes me really confused. What is the meaning of this variable? And If I didn't get it wrong, it was added to the state updates, which is much more confusing.

cooijmanstim commented 6 years ago

Sorry for the late reply. It's a little trick we have to do in order to be able to differentiate with respect to particular hidden states. The dummy states are zero and so they have no effect on the function value, but the derivative of the function value with respect to the dummy state is equal to that with respect to the corresponding true state. We only use these derivatives for the gradient vanishing analysis in the paper.

The reason it's complicated to get these derivatives is an annoying technicality that comes with the use of symbolic loops (scan in Theano, while_loop in TensorFlow) in symbolic computation graphs. These symbolic loops introduce a separation between the "inner graph" that computes the body of the loop and the "outer graph" that contains the loop. The hidden states with respect to which we want to differentiate are part of the inner graph, and loss is computed in the outer graph based on the sequence of hidden states returned by scan, which is an outer-graph variable. If we directly differentiate the loss with respect to the sequence of hidden states, we get only the partial derivatives through the classifier and not the total derivative that includes the contributions of past states on future states.

If you replaced the scan by a pure Python loop that puts the sequence of states into a list, then there is no inner/outer graph separation and differentiation of the loss with respect to the hidden states would work as you would expect.

Hope that helps!

wandering007 commented 6 years ago

@cooijmanstim Thanks very much for the details! It's really complicated... I guess that I didn't have to use this technique if I re-implemented it with PyTorch as autograd can handle everything for me.

cooijmanstim commented 6 years ago

Yes, because you don't have to deal with symbolic loops in PyTorch :-)