bheinzerling / bpemb

Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)
https://nlp.h-its.org/bpemb
MIT License
1.18k stars 101 forks source link

load vectors from path #23

Closed alejandrojcastaneira closed 5 years ago

alejandrojcastaneira commented 5 years ago

Hello, sorry maybe I'm doing a newbie question. It's there a way to load a local downloaded BPEmb vectors file, instead of reading it from the cache every time or downloading it again if the local cache was cleaned.

Best Regards.

bheinzerling commented 5 years ago

If you manually downloaded BPEmb vectors and SentencePiece models and have put them into /some/folder, you can use the cache_dir argument in the BPEmb constructor:

from pathlib import Path

bpemb = BPEmb(lang='en', cache_dir=Path('/some/folder'))

If you trained your own embeddings and SentencePiece model, this is currently not supported (although not difficult to implement). As a workaround for this case, you should be able to monkey-patch your own embeddings and SentencePiece model:

from bpemb.util import sentencepiece_load, load_word2vec_file

bpemb = BPEmb(lang='en')
bpemb.spm = sentencepiece_load('/some/folder/sentencepiece.model')
bpemb.emb = load_word2vec_file('/some/folder/my_byte_pair_emb.w2v.bin')