bheinzerling / bpemb

Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)
https://nlp.h-its.org/bpemb
MIT License
1.18k stars 101 forks source link

Encoder not splitting words into subwords #16

Closed SamLynnEvans closed 5 years ago

SamLynnEvans commented 5 years ago

screen shot 2018-11-26 at 14 44 39

Running bpemb_en.encode is solely splitting the words by spaces and commas etc., not splitting them into subwords... you guys know what's up? Thanks :)

bheinzerling commented 5 years ago

Thanks for noticing, that's a mistake in the example. If / how a word gets split depends on the vocabulary size. Generally, a smaller vocabulary size will yield a segmentation into many subwords, while a large vocabulary size will result in frequent words not being split:

vocabulary size segmentation
1000 ['▁str', 'at', 'f', 'ord']
3000 ['▁str', 'at', 'ford']
5000 ['▁str', 'at', 'ford']
10000 ['▁strat', 'ford']
25000 ['▁stratford']
50000 ['▁stratford']
100000 ['▁stratford']
200000 ['▁stratford']

Turns out I had loaded the English model with default vocabulary size (10,000) but copy&pasted the wrong line for loading the 50,000 model.

I'll add a better explanation to the website, that also shows that not all vocabulary sizes give good segmentations.

SamLynnEvans commented 5 years ago

Brilliant, thanks for swift reply!