facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.43k stars 6.4k forks source link

Sentencepiece error during calling XLMR model.encode() #1616

Closed mukhal closed 4 years ago

mukhal commented 4 years ago

🐛 Bug

when running

from fairseq.models.roberta import XLMRModel
xlmr = XLMRModel.from_pretrained('/path/to/xlmr.large', checkpoint_file='model.pt')

ids = xlmr.encode('Hello world!')

I get the following error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/add/anaconda3/envs/py35/lib/python3.6/site-packages/fairseq/models/roberta/hub_interface.py", line 57, in encode
    bpe_sentence = '<s> ' + self.bpe.encode(sentence) + ' </s>'
  File "/home/add/anaconda3/envs/py35/lib/python3.6/site-packages/fairseq/data/encoders/sentencepiece_bpe.py", line 30, in encode
    return ' '.join(self.sp.EncodeAsPieces(x))
TypeError: sequence item 0: expected str instance, bytes found

I am using fairseq 0.9.0 and sentencepiece 0.1.85

myleott commented 4 years ago

Hmm, can you check if the published XLM-R models work?

I just tried with the same sentencepiece version and this worked for me:

>>> import torch
>>> xlmr = torch.hub.load('pytorch/fairseq', 'xlmr.large', force_reload=True)
>>> xlmr.encode('Hello world')
tensor([    0, 35378,  8999,     2])
mukhal commented 4 years ago

I run into the same problem when loading models from torch.hub. It turned out I had the wrong version of sentencepiece. Closing this isssue now.