Closed aparrish closed 5 years ago
I considered versioning the embeddings when fixing this issue https://github.com/bheinzerling/bpemb/issues/2, since versioning has the benefits you pointed out, but ultimately decided against it, mainly because:
So I don't foresee any compatibility-breaking changes in the embeddings and SentencePiece models and think it should be quite safe to store the IDs. I cannot speak for SentencePiece, though: It might be the case that a new SentencePiece version reads the same SentencePiece model differently than a previous version and then assigns different IDs, but I think this is very unlikely, and since it's on pip, you can fix the SentencePiece package version.
Okay, thank you for the clarification!
To save time and space, I'd like to be able to pre-process the text in my corpus and only store the encoded IDs from
encode_ids()
. But I'm worried about running into a situation where the models/embeddings are updated (retrained on a different corpus, different version of sentencepiece, etc.) such that the IDs I've stored no longer line up with the IDs in the model I used to pre-process the corpus.This come up if, e.g., I pre-processed the text on one machine and then wanted to train a model with the pre-processed data on another machine: because the models are loaded at run-time, there's no guarantee (in the code, or in the documentation) that using
pip install BPEmb==0.3.0
will lead to the same model being used across two different installations.There are some simple downstream workarounds for this, I guess, like manually packaging the cached download as part of my own deployment process, or forking the code so that the download URLs point to my own copies of the models. And I assume that training these vectors is time- and energy-consuming, and you're not in the habit of updating them frequently in ways that would break backwards compatibility. Still, it would be nice to have some assurance (technical or verbal) that installations of the package and the data are easily and fully repeatable!
(One idea would be to make copies of the model files you're distributing so that they have names like
eo.wiki.bpe.vs10000.d100.v001.w2v.txt.tar.gz
or something? And thev001
could be updated tov002
if the model is updated in compatibility-breaking ways. Mappings between models and Python package versions could be hard-coded, or theBPEmb
constructor function could give the option to download a particular version of the model.)