This probably won't matter once we do minibatches, but until then this speeds up convergence for one of my mega models by about 5x (1400 steps instead of 6800)
Coverage increased (+0.1%) to 85.342% when pulling 57498da5a7a777fceed0eb902e21e8d6cf39829c on slower-learning-rate into bc3909608a0a798d920ea7ff9b6b377ad9eaea99 on master.
This probably won't matter once we do minibatches, but until then this speeds up convergence for one of my mega models by about 5x (1400 steps instead of 6800)