XLM-R doesn't support extract_features_aligned_to_words ?

Luvata commented 4 years ago

I have AssertionError error when running extract_features_aligned_to_words from XLMRModel. Is this a bug or there's difference between RoBERTa and XLMRModel ?

Here is my code:

from fairseq.models.roberta import XLMRModel
xlmr = XLMRModel.from_pretrained('xlmr.large.v0.tar.gz', checkpoint_file='model.pt')
xlmr.eval()

doc = xlmr.extract_features_aligned_to_words("hello RoBERTa")

And Error:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-72-37f3a81eb496> in <module>
----> 1 doc = xlmr.extract_features_aligned_to_words("hello RoBERTa")

~/workspace/fairseq/fairseq/models/roberta/hub_interface.py in extract_features_aligned_to_words(self, sentence, return_all_hiddens)
    120         spacy_toks = tokenizer(sentence)
    121         spacy_toks_ws = [t.text_with_ws for t in tokenizer(sentence)]
--> 122         alignment = alignment_utils.align_bpe_to_words(self, bpe_toks, spacy_toks_ws)
    123 
    124         # extract features and align them

~/workspace/fairseq/fairseq/models/roberta/alignment_utils.py in align_bpe_to_words(roberta, bpe_tokens, other_tokens)
     33 
     34     # strip leading <s>
---> 35     assert bpe_tokens[0] == '<s>'
     36     bpe_tokens = bpe_tokens[1:]
     37     assert ''.join(bpe_tokens) == ''.join(other_tokens)

AssertionError:

Luvata commented 4 years ago

I tested RobertaModel also failed in function extract_features_aligned_to_words My current version is fairseq 0.8.0

Both model was loaded from_pretrained with a '.gz' file download in README

myleott commented 4 years ago

It's not supported. XLMR uses sentencepiece BPE whereas RoBERTa uses the GPT-2 BPE. Unfortunately the extract_features_aligned_to_words doesn't have support for sentencepiece BPE yet.

cc @ngoyal2707

ngoyal2707 commented 4 years ago

@myleott Actually https://github.com/fairinternal/fairseq-py/commit/e8c0196e4927f77e980e4a15375bc6872066fb42#diff-c3ae106584251b0d35cc504bc481482e commit seems to have added stripping of bos token in string() call of dictionary.py. So kinda broken for both roberta and xlm-r.

Will send out a fix

ngoyal2707 commented 4 years ago

Fix is merged to master

facebookresearch / fairseq

XLM-R doesn't support extract_features_aligned_to_words ? #1447