Tokens still have spaces

mortonjt commented 4 years ago

🐛 Bug

Long story short, I've trained roberta using a custom dictionary and now I am trying to extract features (code snippet below for reference).

roberta = RobertaModel.from_pretrained(
    path1, 'checkpoint_best.pt',
    path2,
    gpt2_encoder_json=custom_json,
    gpt2_vocab_bpe=custom_vocab)

tokens = roberta.encode('A B C D E G'))

When I try to run this, I get the following error below

Traceback (most recent call last):
  File "attention_layers.py", line 80, in <module>
    tokens = roberta.encode(' '.join(list(s)))
  File "/home/jmorton/software/fairseq/fairseq/models/roberta/hub_interface.py", line 57, in encode
    bpe_sentence = '<s> ' + self.bpe.encode(sentence) + ' </s>'
  File "/home/jmorton/software/fairseq/fairseq/data/encoders/gpt2_bpe.py", line 40, in encode
    return ' '.join(map(str, self.bpe.encode(x)))
  File "/home/jmorton/software/fairseq/fairseq/data/encoders/gpt2_bpe_utils.py", line 110, in encode
    bpe_tokens.extend(self.encoder[bpe_token] for bpe_token in self.bpe(token).split(' '))
  File "/home/jmorton/software/fairseq/fairseq/data/encoders/gpt2_bpe_utils.py", line 110, in <genexpr>
    bpe_tokens.extend(self.encoder[bpe_token] for bpe_token in self.bpe(token).split(' '))
KeyError: 'Ġ'

Environment

fairseq Version: 0.9.0
PyTorch Version: 1.2.0
OS: Redhat
How you installed fairseq: pip
Python version: 3.6.2
CUDA/cuDNN version: just cpu
GPU models and configuration: just cpu
Any other relevant information:

Additional context

Turns out that there is still spacing in the tokens when parsing for the particular example. The fix is presented here: https://github.com/mortonjt/fairseq/pull/1/files

I'm raising this issue mainly to bring awareness around challenges of trying to plug in custom dictionaries. Can push in PR if there is interest.

myleott commented 4 years ago

Note that the GPT-2 BPE is intended to be fully reversible, so spaces are a part of the vocab. In particular the GPT-2 BPE uses leading spaces: https://github.com/pytorch/fairseq/blob/b31849aa9282755bbb9eecd9384b2e0fc2b9c0a1/fairseq/models/roberta/hub_interface.py#L47-L55

myleott commented 4 years ago

Following up on this, by default RobertaModel.from_pretrained will use the GPT-2 BPE.

However you can override this with any of the supported BPE modules: https://github.com/pytorch/fairseq/tree/master/fairseq/data/encoders

For example, XLM-R uses bpe='sentencepiece': https://github.com/pytorch/fairseq/blob/08dcd08d9c442ec2c35bb041356d2b768ffcb922/fairseq/models/roberta/model_xlmr.py#L26

facebookresearch / fairseq