bheinzerling / bpemb

Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)
https://nlp.h-its.org/bpemb
MIT License
1.18k stars 101 forks source link

Vocabulary size issue #43

Closed aimanmutasem closed 4 years ago

aimanmutasem commented 4 years ago

Dear @all,

I'm trying to load the English BPEmb model with vocabulary size 30k and 300-dimensional embeddings.

bpemb_en = BPEmb(lang="en", vs=30000, dim=300)

Every time I get the same error:

`BPEmb fallback: en from vocab size 30000 to 200000

RuntimeError Traceback (most recent call last)

in ----> 1 bpemb_en = BPEmb(lang="en", vs=30000, dim=300) ~/anaconda3/lib/python3.6/site-packages/bpemb/bpemb.py in __init__(self, lang, vs, dim, cache_dir, preprocess, encode_extra_options, add_pad_emb, vs_fallback, segmentation_only, model_file, emb_file) 172 model_file = self.model_tpl.format(lang=lang, vs=vs) 173 self.model_file = self._load_file(model_file) --> 174 self.spm = sentencepiece_load(self.model_file) 175 self.vocab_size = self.vs = self.spm.get_piece_size() 176 if encode_extra_options: ~/anaconda3/lib/python3.6/site-packages/bpemb/util.py in sentencepiece_load(file) 7 from sentencepiece import SentencePieceProcessor 8 spm = SentencePieceProcessor() ----> 9 spm.Load(str(file)) 10 return spm 11 ~/anaconda3/lib/python3.6/site-packages/sentencepiece.py in Load(self, filename) 116 117 def Load(self, filename): --> 118 return _sentencepiece.SentencePieceProcessor_Load(self, filename) 119 120 def LoadOrDie(self, filename): RuntimeError: Internal: /sentencepiece/src/sentencepiece_processor.cc(73) [model_proto->ParseFromArray(serialized.data(), serialized.size())] ` Any suggestions to fix this error !! Regards,
bheinzerling commented 4 years ago

Vocab size 30000 is not supported, please choose one of the vocab sizes listed here: https://nlp.h-its.org/bpemb/en/

aimanmutasem commented 4 years ago

Thank you @bheinzerling , this is my fault ...

but I'm little confused its worked well for Arabic language

bpemb_ar = BPEmb(lang="ar", vs=30000, dim=300)