facebookresearch / XLM

PyTorch original implementation of Cross-lingual Language Model Pretraining.
Other
2.87k stars 495 forks source link

Vocab size not match model input size #333

Open moment-of-peace opened 3 years ago

moment-of-peace commented 3 years ago

Why the vocab and model checkpoint provided in "II. Cross-lingual language model pretraining (XLM)" of readme don' t match? For example, the size of vocab for "tokenize + lowercase + no accent + BPE" should be 95k (the embedding size of the model), but after downloading, the vocab file actually has more than 120k lines

PootieT commented 2 years ago

Similar issue here with XLM-R 100 language model vocab file, it should have 200K vocab when the downloaded file has 239776 vocab.