dmlc / gluon-nlp

NLP made easy
Apache License 2.0
2.56k stars 538 forks source link

[Numpy Refactor] Tokenizers wishlist #1242

Open sxjscience opened 4 years ago

sxjscience commented 4 years ago

We revised the implementation of tokenizers in the new version of GluonNLP.

Basically we have integrated the following tokenizers:

For all tokenizers, we support the following methods:

To give an example, we load the tokenizer in ALBERT, which is a SentencePieceTokenizer and illustrate these functionalities:

In [1]: from gluonnlp.models.albert import get_pretrained_albert                

In [2]: cfg, tokenizer, _,_ = get_pretrained_albert()                           

In [3]: tokenizer                                                               
   model_path = /Users/xjshi/.mxnet/models/nlp/google_albert_base_v2/spm-65999e5d.model
   do_lower = True, nbest = 0, alpha = 0.0
   vocab = Vocab(size=30000, unk_token="<unk>", pad_token="<pad>", cls_token="[CLS]", sep_token="[SEP]", mask_token="[MASK]")

In [4]: tokenizer.encode('hello world!', int)                                   
Out[4]: [10975, 126, 187]

In [5]: tokenizer.encode('hello world!', str)                                   
Out[5]: ['▁hello', '▁world', '!']

In [6]: tokenizer.encode_with_offsets('hello world!', str)                      
Out[6]: (['▁hello', '▁world', '!'], [(0, 5), (5, 11), (11, 12)])

In [7]: tokenizer.encode_with_offsets('hello world!', int)                      
Out[7]: ([10975, 126, 187], [(0, 5), (5, 11), (11, 12)])

However, there are a lot of other commonly used tokenizers. We can consider to integrate: