Closed mortonjt closed 4 years ago
Note that the GPT-2 BPE is intended to be fully reversible, so spaces are a part of the vocab. In particular the GPT-2 BPE uses leading spaces: https://github.com/pytorch/fairseq/blob/b31849aa9282755bbb9eecd9384b2e0fc2b9c0a1/fairseq/models/roberta/hub_interface.py#L47-L55
Following up on this, by default RobertaModel.from_pretrained
will use the GPT-2 BPE.
However you can override this with any of the supported BPE modules: https://github.com/pytorch/fairseq/tree/master/fairseq/data/encoders
For example, XLM-R uses bpe='sentencepiece'
: https://github.com/pytorch/fairseq/blob/08dcd08d9c442ec2c35bb041356d2b768ffcb922/fairseq/models/roberta/model_xlmr.py#L26
🐛 Bug
Long story short, I've trained roberta using a custom dictionary and now I am trying to extract features (code snippet below for reference).
When I try to run this, I get the following error below
Environment
Additional context
Turns out that there is still spacing in the tokens when parsing for the particular example. The fix is presented here: https://github.com/mortonjt/fairseq/pull/1/files
I'm raising this issue mainly to bring awareness around challenges of trying to plug in custom dictionaries. Can push in PR if there is interest.