LSTM model equations - Githubissues

mheilman commented 9 years ago

The code says it implements the version of the LSTM from Graves et al. (2013), which I assume is this http://www.cs.toronto.edu/~graves/icassp_2013.pdf or http://www.cs.toronto.edu/~graves/asru_2013.pdf. However, it looks like the LSTM equations in those papers have both the output layer values and memory cell values from the previous time step as input to the gates.

E.g., in equation 3 of http://www.cs.toronto.edu/~graves/icassp_2013.pdf:

i_t = σ (W_xi xt + W_hi ht−1 + W_ci ct−1 + bi)

However, it looks like the code is doing the following:

i_t = σ (W_xi xt + W_hi ht−1 + bi)

Am I missing something here? Is there another LSTM paper this is based on?

I doubt there's much of a practical difference between these two formulations, but it would be good if the documentation were accuracy. Sorry if I'm misunderstanding something here (also sorry for the messy equations above).

JonathanRaiman commented 9 years ago

Michael, You read closely those diagrams :)

Yes, indeed it appears that referencing 2013 and Alex Graves is not as precise as I'd hoped. There are indeed LSTM networks that use the cell activations (memory) as inputs to their gates, while others from 2013 that reference Alex Graves' architectures (namely In Learning to Execute (http://arxiv.org/pdf/1410.4615v3.pdf), Grammar as a Foreign Language, and other good LSTM papers) but reserve memory cells solely for internal LSTM purposes (I guess this lets the cells "focus" on one thing during the learning). The papers you mention use the cell memories, so I should make that would apparent in the documentation.

In any case, it is informative to see how Andrej Karpathy in https://github.com/karpathy/recurrentjs describes LSTMs (in Javascript), and how Zaremba describes LSTMs here (in Lua) https://github.com/wojciechz/learning_to_execute.

To be fair the most common implementation is the one present here, but potentially a better one is the one you speak of. If you cross-validate one against the other I'd be very interested in hearing if there's a major difference

JonathanRaiman commented 9 years ago

Michael,

Quick follow up. Ran a couple models with the two different versions and using the version you talk about most models hit a local minima way sooner in their training. In most cases training time is doubled or tripled to exit it. While the version implemented here (where memory does not feed back to gates) reaches a lower local minima and exits quicker. There may be some coupling with the type of gradient descent run (Adadelta vs Adam vs RMSProp or something else).

If you find a way of training them easily, or some combination that works well I'd be curious to hear about it, but for now it appears that these cannot be used interchangeably without understanding where the optimisation troubles come from.

mheilman commented 9 years ago

Thanks for your very detailed reply! I'll let you know if I find anything else useful related to this.

JonathanRaiman commented 9 years ago

You might be interested in a more thorough discussion from last week's Arxiv paper.

mheilman commented 9 years ago

That's a very useful reference. Thanks!

JonathanRaiman / theano_lstm

LSTM model equations #9