dmlc / gluon-nlp

NLP made easy
https://nlp.gluon.ai/
Apache License 2.0
2.56k stars 538 forks source link

BPE's default alpha with sentencepiece #1239

Closed haven-jeon closed 4 years ago

haven-jeon commented 4 years ago

Description

As BPE-dropout is applied to sentencepiece recently, it can be tokenized based on sampling. alpha = 1 is for optimally training not for inference.

Default alpha=1 isn't appropriate because most users and models provided by gluonnlp expect to deterministically tokenize.

https://github.com/google/sentencepiece/issues/371

To Reproduce

path = gluon.utils.download('https://kobert.blob.core.windows.net/models/kogpt2/tokenizer/kogpt2_news_wiki_ko_cased_818bfa919d.spiece')
tok = nlp.data.SentencepieceTokenizer(path)
tok('안녕하세요.')  
['▁', '안', '녕', '하', '세', '요', '.']
tok = nlp.data.SentencepieceTokenizer(path, 0, 0.5)                                                                                                                                              
tok('안녕하세요.')                                                                                                                                                                               
['▁', '안', '녕', '하', '세요', '.']
tok('안녕하세요.')                                                                                                                                                                               
['▁안', '녕', '하', '세요', '.']
tok('안녕하세요.')                                                                                                                                                                               
['▁안녕', '하', '세요', '.']
tok('안녕하세요.')                     
tok = nlp.data.SentencepieceTokenizer(path, num_best=0, alpha=0)
tok('안녕하세요.')                                                                                                                                                                               
['▁안녕하세요', '.']
szha commented 4 years ago

agreed. feel free to propose a PR to update this.

sxjscience commented 4 years ago

Nice observation, I've fixed it in the new version: See https://github.com/dmlc/gluon-nlp/pull/1225