facebookresearch / XLM

PyTorch original implementation of Cross-lingual Language Model Pretraining.
Other
2.87k stars 495 forks source link

Difference between code and vocabulary #325

Closed Hannibal046 closed 3 years ago

Hannibal046 commented 3 years ago

Hello guys!

I want to know what is the difference between code and vocabulary. As far as I know, BPE algorithm just take a corpus, split it into base characters, and merge them iteratively. After meet the predefined vocab size, we stop. so how to explain the bpe here in xlm? After read about the demo jupyter notebook .It seems that fastBPE parameterized by Code provided by xlm just act like a tokenizer and dice.word2idx just a map from tokenzied word to predefined vocab index? What does the code in bpe algorithm actually mean? Why there is @ symbol? How can I restore the original sentence? Thanks very much !