model/embedding versioning?

aparrish commented 5 years ago

To save time and space, I'd like to be able to pre-process the text in my corpus and only store the encoded IDs from encode_ids(). But I'm worried about running into a situation where the models/embeddings are updated (retrained on a different corpus, different version of sentencepiece, etc.) such that the IDs I've stored no longer line up with the IDs in the model I used to pre-process the corpus.

This come up if, e.g., I pre-processed the text on one machine and then wanted to train a model with the pre-processed data on another machine: because the models are loaded at run-time, there's no guarantee (in the code, or in the documentation) that using pip install BPEmb==0.3.0 will lead to the same model being used across two different installations.

There are some simple downstream workarounds for this, I guess, like manually packaging the cached download as part of my own deployment process, or forking the code so that the download URLs point to my own copies of the models. And I assume that training these vectors is time- and energy-consuming, and you're not in the habit of updating them frequently in ways that would break backwards compatibility. Still, it would be nice to have some assurance (technical or verbal) that installations of the package and the data are easily and fully repeatable!

(One idea would be to make copies of the model files you're distributing so that they have names like eo.wiki.bpe.vs10000.d100.v001.w2v.txt.tar.gz or something? And the v001 could be updated to v002 if the model is updated in compatibility-breaking ways. Mappings between models and Python package versions could be hard-coded, or the BPEmb constructor function could give the option to download a particular version of the model.)

bheinzerling commented 5 years ago

I considered versioning the embeddings when fixing this issue https://github.com/bheinzerling/bpemb/issues/2, since versioning has the benefits you pointed out, but ultimately decided against it, mainly because:

it is too expensive to keep old versions online (the embeddings for all languages with different vocab sizes and dimensions take up over 400GB);
I've changed the embeddings and SentencePiece models only once (https://github.com/bheinzerling/bpemb/issues/2); and
I have no plans of creating another version.

So I don't foresee any compatibility-breaking changes in the embeddings and SentencePiece models and think it should be quite safe to store the IDs. I cannot speak for SentencePiece, though: It might be the case that a new SentencePiece version reads the same SentencePiece model differently than a previous version and then assigns different IDs, but I think this is very unlikely, and since it's on pip, you can fix the SentencePiece package version.

aparrish commented 5 years ago

Okay, thank you for the clarification!

bheinzerling / bpemb

model/embedding versioning? #31