allenai / deep_qa

A deep NLP library, based on Keras / tf, focused on question answering (but useful for other NLP too)
Apache License 2.0
404 stars 133 forks source link

Data generators and dynamic padding #295

Closed matt-gardner closed 7 years ago

matt-gardner commented 7 years ago

This PR allows for what I'll call dynamic padding. This means that padding only happens at a batch level. If you sort instances by size and then sample slices from this list, you should be able to dramatically decrease runtime, because the amount of padding you actually need is minimized.

So far I've implemented everything but the sorting, for the simplest model case. Individual models will have to do a little bit of work, probably, to support this for particular padding dimensions, but the changes should be minimal in most cases.

matt-gardner commented 7 years ago

Note that this handles more than just what dynamic_rnn does. In this particular setting, we're not doing RNNs at all - we're distributing a character-level CNN encoder across an unknown number of words. This isn't something you would do with dynamic_rnn.

I agree that we need to be careful with this, though, which is why there's so much documentation about how you should probably use TimeDistributed if it works for you. This limits the kinds of reshaping you can easily do in your models. I should probably use the EncoderWrapper by default in the word and character tokenizer if use_dynamic_padding is False, actually, otherwise you just won't be able to build certain kinds of models that use character-level encoders.

DeNeutoy commented 7 years ago

Yeah sure, you're right. Feel free to merge.