Closed ShubhC closed 4 years ago
Unfortunately this is because we have a secondary fairseq dictionary which adds some special tokens.
Try this:
# (...)
sp_toks = ' '.join(sp.EncodeAsPieces('Hello world!'))
# '▁Hello ▁world !'
from fairseq.data import Dictionary
fs_dict = Dictionary.load('./xlmr.large/dict.txt')
fs_dict.encode_line(sp_toks)
# tensor([35378, 8999, 38, 2], dtype=torch.int32)
There seems to be an inconsistency in encoding returned by hub_interface and bpe model. The sentencepiece encodings seem to be off by one. Here are the code snippets:
Am I missing something here?
Particularly, I'm looking to fine-tune XLM-R for a binary sentence classification task. Which encoder should I use?
I'm using torch 1.4.0 and sentencepiece 0.1.86