[Numpy Refactor] Tokenizers wishlist

We revised the implementation of tokenizers in the new version of GluonNLP.

Basically we have integrated the following tokenizers:

whitespace
spacy
jieba
SentencePiece
YTTM
HuggingFaceBPE
HuggingFaceByteBPE
HuggingFaceWordPiece

For all tokenizers, we support the following methods:

Encode into a list of integers: encode('hello world!', int)
Encode into a list of string tokens: encode('hello world!', str)
Encode into a list of integers + offsets: encode_with_offsets('hello world!', int)
Encode into a list of string tokens + offsets: encode_with_offsets('hello world!', str)

To give an example, we load the tokenizer in ALBERT, which is a SentencePieceTokenizer and illustrate these functionalities:

In [1]: from gluonnlp.models.albert import get_pretrained_albert                

In [2]: cfg, tokenizer, _,_ = get_pretrained_albert()                           

In [3]: tokenizer                                                               
Out[3]: 
SentencepieceTokenizer(
   model_path = /Users/xjshi/.mxnet/models/nlp/google_albert_base_v2/spm-65999e5d.model
   do_lower = True, nbest = 0, alpha = 0.0
   vocab = Vocab(size=30000, unk_token="<unk>", pad_token="<pad>", cls_token="[CLS]", sep_token="[SEP]", mask_token="[MASK]")
)

In [4]: tokenizer.encode('hello world!', int)                                   
Out[4]: [10975, 126, 187]

In [5]: tokenizer.encode('hello world!', str)                                   
Out[5]: ['▁hello', '▁world', '!']

In [6]: tokenizer.encode_with_offsets('hello world!', str)                      
Out[6]: (['▁hello', '▁world', '!'], [(0, 5), (5, 11), (11, 12)])

In [7]: tokenizer.encode_with_offsets('hello world!', int)                      
Out[7]: ([10975, 126, 187], [(0, 5), (5, 11), (11, 12)])

However, there are a lot of other commonly used tokenizers. We can consider to integrate:

[ ] BlingFire: https://github.com/microsoft/BlingFire
[ ] Mecab: A commonly used tokenizer for Japanese: https://taku910.github.io/mecab/

dmlc / gluon-nlp

[Numpy Refactor] Tokenizers wishlist #1242