GPT2BPETokenizer produce strange symbol

dmlc / gluon-nlp

NLP made easy

https://nlp.gluon.ai/

Apache License 2.0

2.56k stars 538 forks source link

GPT2BPETokenizer produce strange symbol #1008

Closed hutao965 closed 4 years ago

hutao965 commented 4 years ago

Description

I ran code piece from https://gluon-nlp.mxnet.io/model_zoo/bert/index.html and the GPT2BPETokenizer produce a strange symbol Ġ

Error Message

In [1]: import gluonnlp as nlp; import mxnet as mx; ...: model, vocab = nlp.model.get_model('roberta_12_768_12', dataset_name='openwebtext_ccnew ...: s_stories_books_cased', use_decoder=False); ...: tokenizer = nlp.data.GPT2BPETokenizer(); ...: text = [vocab.bos_token] + tokenizer('Hello world!') + [vocab.eos_token]; ...: seq_encoding = model(mx.nd.array([vocab[text]])) ...: ...:

In [2]: print(text) ['\<s>', 'Hello', 'Ġworld', '!', '\</s>']

command and paste the outputs below:

leezu commented 4 years ago

Thank you for reporting the bug @hymzoque.

@eric-haibin-lin are you aware of this issue? Could it be related to https://github.com/pytorch/fairseq/blob/master/fairseq/models/roberta/hub_interface.py#L38-L56

leezu commented 4 years ago

Also, currently GPT2BPETokenizer hardcodes the vocabulary. However, for example to add XLM-R https://github.com/pytorch/fairseq/tree/master/examples/xlmr we need to add support for a separate vocabulary.

zeeshansayyed commented 4 years ago

I think that is how it encodes space character. As far as I can tell, it doesn't lead to any error. For instance, if you apply nlp.data.GPT2BPEDetokenizer on the tokenized text, you successfully retrieve the original text.

szha commented 4 years ago

Yes, that's working as intended. (cc @sxjscience)

sxjscience commented 4 years ago

We are now considering to revise the GPT2Tokenzier to make it more similar to https://github.com/sxjscience/gluonnlp-gpt2. For this issue, it’s not a bug because the tokenizer in GPT2 is based on bytes.

zeeshansayyed commented 4 years ago

@szha @sxjscience is there an open issue which I can follow for supporting separate vocabulary for GPT2BPETokenizer as mentioned by @leezu ? Or is it on the backburner since there is only one trained dataset for both GPT and RoBERTa?

leezu commented 4 years ago

Before closing this issue, at the very least we have to fix the example to use the GPT2BPEDetokenizer instead of simply looking up tokens in the vocab.

@zeeshansayyed let's open a second issue to track it. It's a real issue as other models use the same tokenizer with a separate vocab.

sxjscience commented 4 years ago

@leezu Should we close this?

leezu commented 4 years ago

Fixed in master