Open jonathanasdf opened 7 years ago
This isn't 100% related, but I'm curious whether the authors or anyone else was able to validate that the batch normalization actually helps. I've implemented it in almost exactly the same way (and am trying variants) and it seems to be decreasing convergence time, both in training iterations and wall clock time.
@lakehanne maybe? Thanks for the help!
hello @rachtsingh , I have used this algorithm in my work. See model.lua. It actually helps, especially for recurrent networks and their variants. When the size of the dataset is small however, I have found it to have little effect in speeding up convergence but It has not worsened convergence in any of my implementations. I would be happy to see what sort of problem you are trying to use it with.
Thank you for the quick reply! I'm just trying to enable it in the OpenNMT project, for which I have a fork with a batch-normalized LSTM here: https://github.com/rachtsingh/OpenNMT/blob/master/onmt/modules/LSTM.lua
The model is sequence-to-sequence with attention, and I'm translating from English -> German, and I'm using a small dataset of 100k sentence pairs.
Here are the results from the baseline: https://gist.github.com/rachtsingh/e97b4be011f4b86c47956848725e8095
And the results from enabling batch normalization (just the first 8 epochs since that tells the story) https://gist.github.com/rachtsingh/d856ccdf71f8885b7ea535d820c9d7ef
Basically the batch normalized version does better in the first ~250 batches, but after that seems to slow down and do worse, and doesn't converge at all. Also, in wall speed the batch normalized version is at least 2x slower (that's the source token/s). Both tests were on a K80.
Is my implementation just wrong? It doesn't differ substantially from the one in Element-Research/rnn, I think.
Ah, my apologies. I didn't fully understand how our memory optimizer functioned, and under the hood it was sharing some input tensors between time steps. After removing my errors it now works correctly.
Thanks again for verifying that it works!
Awesome. Glad you fixed it!
Sincerely, Lekan
On Jan 1, 2017, at 12:20 AM, Rachit Singh notifications@github.com wrote:
Ah, my apologies. I didn't fully understand how our memory optimizer functioned, and under the hood it was sharing some input tensors between time steps. After removing my errors it now works correctly.
Thanks again for verifying that it works!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
With FastLSTM.bn = true, it seems to output the same value for everything in a sequence. For example, try
Even when you train it and give it different gradients, it still outputs the same value for everything.