Open sxjscience opened 4 years ago
We revised the implementation of tokenizers in the new version of GluonNLP.
Basically we have integrated the following tokenizers:
For all tokenizers, we support the following methods:
encode('hello world!', int)
encode('hello world!', str)
encode_with_offsets('hello world!', int)
encode_with_offsets('hello world!', str)
To give an example, we load the tokenizer in ALBERT, which is a SentencePieceTokenizer and illustrate these functionalities:
In [1]: from gluonnlp.models.albert import get_pretrained_albert In [2]: cfg, tokenizer, _,_ = get_pretrained_albert() In [3]: tokenizer Out[3]: SentencepieceTokenizer( model_path = /Users/xjshi/.mxnet/models/nlp/google_albert_base_v2/spm-65999e5d.model do_lower = True, nbest = 0, alpha = 0.0 vocab = Vocab(size=30000, unk_token="<unk>", pad_token="<pad>", cls_token="[CLS]", sep_token="[SEP]", mask_token="[MASK]") ) In [4]: tokenizer.encode('hello world!', int) Out[4]: [10975, 126, 187] In [5]: tokenizer.encode('hello world!', str) Out[5]: ['▁hello', '▁world', '!'] In [6]: tokenizer.encode_with_offsets('hello world!', str) Out[6]: (['▁hello', '▁world', '!'], [(0, 5), (5, 11), (11, 12)]) In [7]: tokenizer.encode_with_offsets('hello world!', int) Out[7]: ([10975, 126, 187], [(0, 5), (5, 11), (11, 12)])
However, there are a lot of other commonly used tokenizers. We can consider to integrate:
We revised the implementation of tokenizers in the new version of GluonNLP.
Basically we have integrated the following tokenizers:
For all tokenizers, we support the following methods:
encode('hello world!', int)
encode('hello world!', str)
encode_with_offsets('hello world!', int)
encode_with_offsets('hello world!', str)
To give an example, we load the tokenizer in ALBERT, which is a SentencePieceTokenizer and illustrate these functionalities:
However, there are a lot of other commonly used tokenizers. We can consider to integrate: