allenai / deep_qa

A deep NLP library, based on Keras / tf, focused on question answering (but useful for other NLP too)
Apache License 2.0
404 stars 133 forks source link

Generator re-pads shuffled instances which are already padded #360

Closed DeNeutoy closed 7 years ago

DeNeutoy commented 7 years ago

screen shot 2017-05-16 at 11 00 04 am

Training using the data generator results in a slow down due to incremental padding from the noise introduced during shuffling.

matt-gardner commented 7 years ago

The easiest fix is just to copy the dataset here, before creating batches. Doing it in a way that avoids this copying would be really intrusive to the instance code, because you'd have to keep track of (or figure out after the fact) what part of the instance is just padding, and what isn't. I don't think the additional copy should be a big deal.

I'll get to this soon, if you don't get to it first. My day's pretty busy today, but I should have time tomorrow. It should just be a one-line fix, though having a test to be sure it's doing the right thing would also be good.

DeNeutoy commented 7 years ago

361