Allow use of a data generator for training

allenai / deep_qa

A deep NLP library, based on Keras / tf, focused on question answering (but useful for other NLP too)

Apache License 2.0

404 stars 133 forks source link

Allow use of a data generator for training #283

Closed matt-peters closed 7 years ago

matt-peters commented 7 years ago

This commit allows Trainer subclasses to use a data generator for training and evaluation instead of pre-computed X and y arrays.

Behavior is controlled with the keys use_data_generator and steps_per_epoch in the params. Default behavior is unchanged (don't use a generator).

matt-gardner commented 7 years ago

Have you actually run this? It'd be nice to have a test in here with a model that actually uses a generator, just to make sure this works.

matt-peters commented 7 years ago

Thanks for the comments. I have run this and it is working for my purpose. To get it to work I had to subclass Trainer to override self.prepare_data to return generators. This was beneficial in my case for a couple reasons:

it's necessary to me to change the indexing logic in the datasets to be consistent with another code base
I'd like to support batches that are dynamically padded for each batch for efficiency reasons

I'll look at adding a test but since it needs a working subclass of Trainer to run it's not super easy to do.

nelson-liu commented 7 years ago

Ah, yeah I was just pointing out that this wouldn't work out of the box with one of our texttrainers; thanks for the additional context

matt-gardner commented 7 years ago

I can worry about adding a test for this - what I'm thinking of is pretty simple, just overriding the TrueFalseModel in the test file itself, just to make sure this code is tested. I'll do that soon, and push it to the PR before merging (but probably not today).

matt-gardner commented 7 years ago

Closing this, as it's superseded by #295. Thanks for this @matt-peters, it was helpful in figuring out the right way to integrate this functionality into DeepQA.