Encoder not splitting words into subwords

SamLynnEvans commented 5 years ago

screen shot 2018-11-26 at 14 44 39

Running bpemb_en.encode is solely splitting the words by spaces and commas etc., not splitting them into subwords... you guys know what's up? Thanks :)

bheinzerling commented 5 years ago

Thanks for noticing, that's a mistake in the example. If / how a word gets split depends on the vocabulary size. Generally, a smaller vocabulary size will yield a segmentation into many subwords, while a large vocabulary size will result in frequent words not being split:

vocabulary size	segmentation
1000	['▁str', 'at', 'f', 'ord']
3000	['▁str', 'at', 'ford']
5000	['▁str', 'at', 'ford']
10000	['▁strat', 'ford']
25000	['▁stratford']
50000	['▁stratford']
100000	['▁stratford']
200000	['▁stratford']

Turns out I had loaded the English model with default vocabulary size (10,000) but copy&pasted the wrong line for loading the 50,000 model.

I'll add a better explanation to the website, that also shows that not all vocabulary sizes give good segmentations.

SamLynnEvans commented 5 years ago

Brilliant, thanks for swift reply!

bheinzerling / bpemb

Encoder not splitting words into subwords #16