Closed matt-gardner closed 7 years ago
Ok, this is ready for another look, I think, @DeNeutoy.
Nice to see the model / embedding specific stuff removed from the tokenizers 👍
Since I'm not familiar with some of the details of how the Instances work and this is the first time I've looked at most of this code, I'm having a hard time seeing how these pieces all fit together, to e.g. take a list of sentences and return some padded arrays. Hopefully that will become apparent when tests are added.
Otherwise, I have two general comments:
(1) These classes seem to have a large public API footprint. If these methods are necessary then keep them -- but otherwise it's better to remove them or make private sooner rather then later.
(2) I appreciate the use of the Params
for specifying options, but I'd strongly consider moving the Params
up a layer to the Trainer
or elsewhere and using more traditional arguments for these lower level components. For someone not familiar with this library it makes it difficult to determine how to construct these objects (and even adds an additional hurdle to just importing and using these classes, since it also requires understanding how Params
works too). It also means in most cases one needs to inspect the code to see what arguments are required and what the defaults are as it is essentially using a def __init__(self, **kwargs):
signature for the constructors.
On Params
- that's how we're handling dependency injection, so you can just specify a list of parameters in a config file for easy experimentation. An alternative would be to have each class that currently takes Params
instead have a @classmethod
called from_params()
, that lets the experiment framework still construct the experiment, while not making the data code itself require the use of the experiment framework (there's still a code dependency, through the from_params()
method, but it's no longer required for using it). Would that resolve your concerns with it?
Also, the part where the pieces fit together aren't part of this PR, so it's not surprising that you can't see it yet =). I'm almost done with those. There's an Instance
class that contains a dictionary of named Fields
, and a Dataset
that batches Instances
together. The Dataset
can read itself from a file (or list of files), constructing Instances
with specific Fields
. This Dataset
can then either be converted into one large set of arrays (keyed by field name), for model.fit()
, or split up into batches (which are also just Dataset
objects) by a data generator, for model.fit_generator()
.
RE: Params, a from_params
method would resolve it. I'm not sure how much additional overhead that adds. I was just looking for a way to decouple the experiment framework from the data generator piece. The code dependency is fine -- by using this library someone has already introduced that dependency into their stack.
EDIT to add: looking at the Params
class, this is essentially a wrapper around a dict, right? So to construct a class with it, it's just obj = KlassName(Params(**kwargs))
? This is pretty lightweight for a power user who is using pieces of the library. In that case I'd ignore my previous comment about it.
I think separating the experiment framework from the data code is a reasonable thing to do, and I've been conflicted about exactly how Params
works ever since we started using it - any method for doing dependency injection has issues. I think making a from_params
method is a reasonable compromise, so you can just ignore the whole thing if you want to use some other method.
Ok, I think I addressed all of the outstanding issues with this - unless I get a more comments, I'm going to merge this tomorrow morning sometime, and submit a new PR with the instance and dataset classes.
NOTE: This is a PR to the
new_data_api
branch, not to master. This will not pass tests as it is, I am sure. But to avoid having one massive PR once I've finished passing all of the tests, I'm going to submit pieces to thenew_data_api
branch as I go along, recognizing that tests will be broken as I'm pulling things apart and rebuilding them.So, this PR is just to review the basic structure of how I've set this up. So far I've just got the notion of
Fields
, where aTextField
can be tokenized and represented as arrays in various ways.Instances
will be built up of multipleFields
, andDatasets
will have multipleInstances
. Once all of that is done, I'll worry about embedding stuff for each of the token representations.