Refactor tokenizers - Githubissues

matt-gardner commented 7 years ago

NOTE: This is a PR to the new_data_api branch, not to master. This will not pass tests as it is, I am sure. But to avoid having one massive PR once I've finished passing all of the tests, I'm going to submit pieces to the new_data_api branch as I go along, recognizing that tests will be broken as I'm pulling things apart and rebuilding them.

So, this PR is just to review the basic structure of how I've set this up. So far I've just got the notion of Fields, where a TextField can be tokenized and represented as arrays in various ways. Instances will be built up of multiple Fields, and Datasets will have multiple Instances. Once all of that is done, I'll worry about embedding stuff for each of the token representations.

matt-gardner commented 7 years ago

Ok, this is ready for another look, I think, @DeNeutoy.

matt-peters commented 7 years ago

Nice to see the model / embedding specific stuff removed from the tokenizers 👍

Since I'm not familiar with some of the details of how the Instances work and this is the first time I've looked at most of this code, I'm having a hard time seeing how these pieces all fit together, to e.g. take a list of sentences and return some padded arrays. Hopefully that will become apparent when tests are added.

Otherwise, I have two general comments:

(1) These classes seem to have a large public API footprint. If these methods are necessary then keep them -- but otherwise it's better to remove them or make private sooner rather then later.

(2) I appreciate the use of the Params for specifying options, but I'd strongly consider moving the Params up a layer to the Trainer or elsewhere and using more traditional arguments for these lower level components. For someone not familiar with this library it makes it difficult to determine how to construct these objects (and even adds an additional hurdle to just importing and using these classes, since it also requires understanding how Params works too). It also means in most cases one needs to inspect the code to see what arguments are required and what the defaults are as it is essentially using a def __init__(self, **kwargs): signature for the constructors.

matt-gardner commented 7 years ago

On Params - that's how we're handling dependency injection, so you can just specify a list of parameters in a config file for easy experimentation. An alternative would be to have each class that currently takes Params instead have a @classmethod called from_params(), that lets the experiment framework still construct the experiment, while not making the data code itself require the use of the experiment framework (there's still a code dependency, through the from_params() method, but it's no longer required for using it). Would that resolve your concerns with it?

matt-gardner commented 7 years ago

Also, the part where the pieces fit together aren't part of this PR, so it's not surprising that you can't see it yet =). I'm almost done with those. There's an Instance class that contains a dictionary of named Fields, and a Dataset that batches Instances together. The Dataset can read itself from a file (or list of files), constructing Instances with specific Fields. This Dataset can then either be converted into one large set of arrays (keyed by field name), for model.fit(), or split up into batches (which are also just Dataset objects) by a data generator, for model.fit_generator().

matt-peters commented 7 years ago

RE: Params, a from_params method would resolve it. I'm not sure how much additional overhead that adds. I was just looking for a way to decouple the experiment framework from the data generator piece. The code dependency is fine -- by using this library someone has already introduced that dependency into their stack.

EDIT to add: looking at the Params class, this is essentially a wrapper around a dict, right? So to construct a class with it, it's just obj = KlassName(Params(**kwargs))? This is pretty lightweight for a power user who is using pieces of the library. In that case I'd ignore my previous comment about it.

matt-gardner commented 7 years ago

I think separating the experiment framework from the data code is a reasonable thing to do, and I've been conflicted about exactly how Params works ever since we started using it - any method for doing dependency injection has issues. I think making a from_params method is a reasonable compromise, so you can just ignore the whole thing if you want to use some other method.

matt-gardner commented 7 years ago

Ok, I think I addressed all of the outstanding issues with this - unless I get a more comments, I'm going to merge this tomorrow morning sometime, and submit a new PR with the instance and dataset classes.

allenai / deep_qa

Refactor tokenizers #393