Closed hutao965 closed 4 years ago
Thank you for reporting the bug @hymzoque.
@eric-haibin-lin are you aware of this issue? Could it be related to https://github.com/pytorch/fairseq/blob/master/fairseq/models/roberta/hub_interface.py#L38-L56
Also, currently GPT2BPETokenizer hardcodes the vocabulary. However, for example to add XLM-R https://github.com/pytorch/fairseq/tree/master/examples/xlmr we need to add support for a separate vocabulary.
I think that is how it encodes space character. As far as I can tell, it doesn't lead to any error. For instance, if you apply nlp.data.GPT2BPEDetokenizer
on the tokenized text, you successfully retrieve the original text.
Yes, that's working as intended. (cc @sxjscience)
We are now considering to revise the GPT2Tokenzier to make it more similar to https://github.com/sxjscience/gluonnlp-gpt2. For this issue, it’s not a bug because the tokenizer in GPT2 is based on bytes.
@szha @sxjscience is there an open issue which I can follow for supporting separate vocabulary for GPT2BPETokenizer as mentioned by @leezu ? Or is it on the backburner since there is only one trained dataset for both GPT and RoBERTa?
Before closing this issue, at the very least we have to fix the example to use the GPT2BPEDetokenizer instead of simply looking up tokens in the vocab.
@zeeshansayyed let's open a second issue to track it. It's a real issue as other models use the same tokenizer with a separate vocab.
@leezu Should we close this?
Fixed in master
Description
I ran code piece from https://gluon-nlp.mxnet.io/model_zoo/bert/index.html and the GPT2BPETokenizer produce a strange symbol Ġ
Error Message
In [1]: import gluonnlp as nlp; import mxnet as mx; ...: model, vocab = nlp.model.get_model('roberta_12_768_12', dataset_name='openwebtext_ccnew ...: s_stories_books_cased', use_decoder=False); ...: tokenizer = nlp.data.GPT2BPETokenizer(); ...: text = [vocab.bos_token] + tokenizer('Hello world!') + [vocab.eos_token]; ...: seq_encoding = model(mx.nd.array([vocab[text]])) ...: ...:
In [2]: print(text) ['\<s>', 'Hello', 'Ġworld', '!', '\</s>']
command and paste the outputs below: