Closed matt-gardner closed 7 years ago
Preliminary result: looks like I can get BiDAF down to less than 45 minutes per epoch using this technique (@nelson-liu should be happy about that =) ). This is a ~10x speedup, better than the 4-5x from my initial estimate. It's possible it'll still crash on me before finishing the first epoch, but I'm optimistic. The largest batches have somewhere around 400 instances in them.
wow, that's amazing! Great work @matt-gardner
It worked! Final epoch time: ~34 minutes!
Ok, this is close to done. I still need to add a test for the adaptive grouping, but other than that I think it's ready for review.
LGTM
Just a note for anyone who was following this thread:
You might not want to use as large a batch size as possible - this is kind of the whole point of SGD, anyway - you use minibatches, and the variance this causes in your gradient is actually helpful in avoiding local minima. I ran BiDAF yesterday, fully optimizing GPU memory usage, and some batches had upwards of 400 instances in them. This gave a running time of 33-34 minutes per epoch, but it also had poorer learning behavior, and only got to ~53% exact match on the dev set, where it should be reaching closer to ~60% (this is using one randomly picked dev annotation, not all three, so numbers are lower than the official evaluation).
I'm running it again, capping batch size at 60 (batches are still smaller for the largest instances, so it fits on the GPU without any truncation), and it's taking ~1.1 hours per epoch, but the learning behavior seems to be a bit better. It hasn't finished yet, though. I'll update this thread again when it does.
So the run with batch size 60 got to about 55% accuracy on span begin, and 58% on span end, which we could be optimistic and say translates to ~55% exact match, which is still a few percent below Min's implementation (which gets ~60% on this metric). We haven't tuned any of the learning parameters, or dropout, or regularization, or anything, though. The point is that the running time of this code is now a little faster than Min's, the code is much cleaner, and if someone wants to, you could tune it to get comparable performance.
Still very much in progress, just opening the PR to track things. If anyone cares at this point, this at least has my basic plan for implementing this in the docstrings, though I haven't done any of the actual work yet.