Closed SamLynnEvans closed 5 years ago
Thanks for noticing, that's a mistake in the example. If / how a word gets split depends on the vocabulary size. Generally, a smaller vocabulary size will yield a segmentation into many subwords, while a large vocabulary size will result in frequent words not being split:
vocabulary size | segmentation |
---|---|
1000 | ['▁str', 'at', 'f', 'ord'] |
3000 | ['▁str', 'at', 'ford'] |
5000 | ['▁str', 'at', 'ford'] |
10000 | ['▁strat', 'ford'] |
25000 | ['▁stratford'] |
50000 | ['▁stratford'] |
100000 | ['▁stratford'] |
200000 | ['▁stratford'] |
Turns out I had loaded the English model with default vocabulary size (10,000) but copy&pasted the wrong line for loading the 50,000 model.
I'll add a better explanation to the website, that also shows that not all vocabulary sizes give good segmentations.
Brilliant, thanks for swift reply!
Running bpemb_en.encode is solely splitting the words by spaces and commas etc., not splitting them into subwords... you guys know what's up? Thanks :)