Open sshleifer opened 4 years ago
This is indeed strange. I don't really know what is happening in those cases. I need to investigate this. The only reason I can think of is that the sentencepiece model is trained on different data than the ones I use for creating the vocabulary. That can, indeed, happen as I leave the sentencepiece model constant even if I augment the data, but the basic data set should always be included. I have no immediate answer on that ...
On the ber-es transformer, if I run:
you get:
But
▁Be
is not inopus.spm32k-spm32k.vocab.yml
, so my python tokenizer raises aKeyError
when it encounters these tokens.This doesn't change if I run preprocess.sh first. When I run the pieced sequence through
marian_decoder
I get a good translation, no error.This happens for other model,character combos, here is a list of (pair, missing key) from a random sample of models I tested.
Is this expected? Should my encoder use the id in these cases?