allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.74k stars 2.24k forks source link

Model serialization and deserialization: ensuring vocabularization consistency? #4285

Closed johntiger1 closed 4 years ago

johntiger1 commented 4 years ago

In this line, it says that we can use our model for inference by simply doing torch.load(...).

https://github.com/allenai/allennlp/blob/e52fea2801fefc07808ee2039a086a9abbf21a1e/allennlp/training/trainer.py#L910

However, don't we need to ensure certain things are consistent? For instance the mapping from tokens to indices, the instances yielded by the dataset_reader etc.

I looked into here, and it seems like we need some archiving, https://github.com/allenai/allennlp/blob/master/allennlp/models/archival.py, but then why does the default training code not seem to use archives? How are they able to restore training without the dataset reader?

matt-gardner commented 4 years ago

What you linked is internal documentation of a private method (it begins with an underscore), and it is not complete. It only points to what you would have to do to replace that particular method. See the code example here for what goes into saving and loading a model: https://allennlp-course.apps.allenai.org/building-your-model#3.

The section on what to do if you're not using config files isn't written yet, but the gist is that you have to create the model using the same constructor arguments as when the model was saved, then call that model.load_state_dict method seen above, in addition to the vocabulary things that are shown in what I linked.

johntiger1 commented 4 years ago

Thanks @matt-gardner . Will take a look. In the meantime I think I've found a way that simply pickles the dataset reader + vocab, but this is completely ad-hoc and likely brittle.

matt-gardner commented 4 years ago

Yes, that's fine too, as long as all of your objects are pickle-able. Most ways of saving and loading are somewhat brittle, unless you have a standard format that includes configuration; hence our config file approach to things.

johntiger1 commented 4 years ago

Thanks @matt-gardner that makes sense. I have my own opinions on the config-based approach (for instance, memory blow-up using allennlp-train vs precise memory and garbage collection control via a code-first approach) but won't bore you with the details. Would love to give back and contribute something (a guide, opinion piece) after EMNLP though!

matt-gardner commented 4 years ago

The only memory difference between the two approaches should be which Instances are in memory when, and you should be able to control that with lazy. If you're seeing drastic memory differences between the two, we'd like to know about them.