xlmr encode and sentence piece encoder off by one

facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

MIT License

30.43k stars 6.4k forks source link

There seems to be an inconsistency in encoding returned by hub_interface and bpe model. The sentencepiece encodings seem to be off by one. Here are the code snippets:

import torch
xlmr = torch.hub.load('pytorch/fairseq', 'xlmr.large')
xlmr.encode('Hello World!')
tensor([0, 35378, 6661,38,2])

import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load("./xlmr.large/sentencepiece.bpe.model")
sp.EncodeAsIds('Hello World!')
[35377, 6660, 37]

Am I missing something here?

Particularly, I'm looking to fine-tune XLM-R for a binary sentence classification task. Which encoder should I use?

I'm using torch 1.4.0 and sentencepiece 0.1.86

# (...) sp_toks = ' '.join(sp.EncodeAsPieces('Hello world!')) # '▁Hello ▁world !' from fairseq.data import Dictionary fs_dict = Dictionary.load('./xlmr.large/dict.txt') fs_dict.encode_line(sp_toks) # tensor([35378, 8999, 38, 2], dtype=torch.int32)

facebookresearch / fairseq

xlmr encode and sentence piece encoder off by one #2094