Element-Research / rnn

Recurrent Neural Network library for Torch7's nn
BSD 3-Clause "New" or "Revised" License
939 stars 313 forks source link

FastLSTM outputs same values if bn=true #366

Open jonathanasdf opened 7 years ago

jonathanasdf commented 7 years ago

With FastLSTM.bn = true, it seems to output the same value for everything in a sequence. For example, try

nn.FastLSTM.bn = true
a = nn.Sequencer(nn.FastLSTM(2, 2))
print(a:forward(torch.rand(5, 1, 2))

Even when you train it and give it different gradients, it still outputs the same value for everything.

rachtsingh commented 7 years ago

This isn't 100% related, but I'm curious whether the authors or anyone else was able to validate that the batch normalization actually helps. I've implemented it in almost exactly the same way (and am trying variants) and it seems to be decreasing convergence time, both in training iterations and wall clock time.

@lakehanne maybe? Thanks for the help!

robotsorcerer commented 7 years ago

hello @rachtsingh , I have used this algorithm in my work. See model.lua. It actually helps, especially for recurrent networks and their variants. When the size of the dataset is small however, I have found it to have little effect in speeding up convergence but It has not worsened convergence in any of my implementations. I would be happy to see what sort of problem you are trying to use it with.

rachtsingh commented 7 years ago

Thank you for the quick reply! I'm just trying to enable it in the OpenNMT project, for which I have a fork with a batch-normalized LSTM here: https://github.com/rachtsingh/OpenNMT/blob/master/onmt/modules/LSTM.lua

The model is sequence-to-sequence with attention, and I'm translating from English -> German, and I'm using a small dataset of 100k sentence pairs.

Here are the results from the baseline: https://gist.github.com/rachtsingh/e97b4be011f4b86c47956848725e8095

And the results from enabling batch normalization (just the first 8 epochs since that tells the story) https://gist.github.com/rachtsingh/d856ccdf71f8885b7ea535d820c9d7ef

Basically the batch normalized version does better in the first ~250 batches, but after that seems to slow down and do worse, and doesn't converge at all. Also, in wall speed the batch normalized version is at least 2x slower (that's the source token/s). Both tests were on a K80.

Is my implementation just wrong? It doesn't differ substantially from the one in Element-Research/rnn, I think.

rachtsingh commented 7 years ago

Ah, my apologies. I didn't fully understand how our memory optimizer functioned, and under the hood it was sharing some input tensors between time steps. After removing my errors it now works correctly.

Thanks again for verifying that it works!

robotsorcerer commented 7 years ago

Awesome. Glad you fixed it!

Sincerely, Lekan

On Jan 1, 2017, at 12:20 AM, Rachit Singh notifications@github.com wrote:

Ah, my apologies. I didn't fully understand how our memory optimizer functioned, and under the hood it was sharing some input tensors between time steps. After removing my errors it now works correctly.

Thanks again for verifying that it works!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.