Refactor instances and datasets

allenai / deep_qa

A deep NLP library, based on Keras / tf, focused on question answering (but useful for other NLP too)

Apache License 2.0

404 stars 133 forks source link

Refactor instances and datasets #402

Closed matt-gardner closed 7 years ago

matt-gardner commented 7 years ago

I redid the way Fields work a bit, making it simpler and easier to make one field depend on another. I also moved over the DatasetReaders to return instances using this new API, and removed all of the old Instance classes. This is a mostly final implementation of how you go from text files to numpy arrays. I think all of the pieces for that part are in place. I'll update the tests as soon as you think this design is reasonable.

Actually committing this to master has to wait until we've implemented the embedding stuff, though, and changed how experiments run... It'll be a while still before we can actually switch over.

matt-gardner commented 7 years ago

@DeNeutoy, on your top-level comment - isn't the whole point of importing things in the module __init__.py so that we can then import them from that module? So you avoid things like from data.vocabulary import Vocabulary, and can just do from data import Vocabulary? I'm not sure I understand your concern here.

Other than that, I think I've addressed the outstanding issues, and if you're happy with it, feel free to merge and work on the next stuff.

DeNeutoy commented 7 years ago

That should be the objective for outside users of the library - however, within the library, it causes problems, because it makes the order of things which are imported in __init__ important. It also makes it easy for circular imports to happen (such as here, where you import the token_indexers into the abstract class from init, but within init you import the abstract class). At least we should try not to do this within modules.