Hi, I have downloaded the 1.3B-dist checkpoint from this GitHub's NLLB section, and it reports to accept 256206 tokens as a dimension of the model layer. However, the dictionary.txt file in the GitHub contains 255997 token entries. Is this a compatibility issue, or are there additional steps I am failing to notice? I acknowledge that fitting the dimension resolves the error for training, but it is likely that the mismatch might cause off-by-something issues in the dictionary (since the order of token embeddings matter). I've also noticed that fairseq adds in 6 tokens for NLLB? I would appreciate it someone could answer my questions. Have a great day!
Hi, I have downloaded the 1.3B-dist checkpoint from this GitHub's NLLB section, and it reports to accept 256206 tokens as a dimension of the model layer. However, the dictionary.txt file in the GitHub contains 255997 token entries. Is this a compatibility issue, or are there additional steps I am failing to notice? I acknowledge that fitting the dimension resolves the error for training, but it is likely that the mismatch might cause off-by-something issues in the dictionary (since the order of token embeddings matter). I've also noticed that fairseq adds in 6 tokens for NLLB? I would appreciate it someone could answer my questions. Have a great day!