Closed haven-jeon closed 4 years ago
As BPE-dropout is applied to sentencepiece recently, it can be tokenized based on sampling. alpha = 1 is for optimally training not for inference.
Default alpha=1 isn't appropriate because most users and models provided by gluonnlp expect to deterministically tokenize.
alpha=1
https://github.com/google/sentencepiece/issues/371
path = gluon.utils.download('https://kobert.blob.core.windows.net/models/kogpt2/tokenizer/kogpt2_news_wiki_ko_cased_818bfa919d.spiece') tok = nlp.data.SentencepieceTokenizer(path) tok('안녕하세요.') ['▁', '안', '녕', '하', '세', '요', '.'] tok = nlp.data.SentencepieceTokenizer(path, 0, 0.5) tok('안녕하세요.') ['▁', '안', '녕', '하', '세요', '.'] tok('안녕하세요.') ['▁안', '녕', '하', '세요', '.'] tok('안녕하세요.') ['▁안녕', '하', '세요', '.'] tok('안녕하세요.') tok = nlp.data.SentencepieceTokenizer(path, num_best=0, alpha=0) tok('안녕하세요.') ['▁안녕하세요', '.']
agreed. feel free to propose a PR to update this.
Nice observation, I've fixed it in the new version: See https://github.com/dmlc/gluon-nlp/pull/1225
Description
As BPE-dropout is applied to sentencepiece recently, it can be tokenized based on sampling. alpha = 1 is for optimally training not for inference.
Default
alpha=1
isn't appropriate because most users and models provided by gluonnlp expect to deterministically tokenize.https://github.com/google/sentencepiece/issues/371
To Reproduce