facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.43k stars 6.4k forks source link

xlmr encode and sentence piece encoder off by one #2094

Closed ShubhC closed 4 years ago

ShubhC commented 4 years ago

There seems to be an inconsistency in encoding returned by hub_interface and bpe model. The sentencepiece encodings seem to be off by one. Here are the code snippets:

import torch
xlmr = torch.hub.load('pytorch/fairseq', 'xlmr.large')
xlmr.encode('Hello World!')
tensor([0, 35378, 6661,38,2])

import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load("./xlmr.large/sentencepiece.bpe.model")
sp.EncodeAsIds('Hello World!')
[35377, 6660, 37]

Am I missing something here?

Particularly, I'm looking to fine-tune XLM-R for a binary sentence classification task. Which encoder should I use?

I'm using torch 1.4.0 and sentencepiece 0.1.86

myleott commented 4 years ago

Unfortunately this is because we have a secondary fairseq dictionary which adds some special tokens.

Try this:

# (...)
sp_toks = ' '.join(sp.EncodeAsPieces('Hello world!'))
# '▁Hello ▁world !'

from fairseq.data import Dictionary
fs_dict = Dictionary.load('./xlmr.large/dict.txt')
fs_dict.encode_line(sp_toks)
# tensor([35378,  8999,    38,     2], dtype=torch.int32)