NLLB vocabulary missing common Chinese character/tokens

pluiez commented 2 years ago

Hi, I use the released NLLB checkpoint to decode flroes Chinese testset, overall the results looks good. However, I found that a lot of very common Chinese characters/tokens are missing from the dictionary, leading to those words never generated from other languages to Chinese and OOV tokens when translating from Chinese to other languages.

For example, "The eagle catches the chickens" translates to "老鹰捉小鸡" in Chinese, but NLLB model generates "▁ \<unk> 抓住了 \<unk>" since tokens for the two species are absent from the dictionary. This can be a practical problem, because the missing tokens are absolutely common in real-world situation.

The following is part of the tokens highly-frequent but missing from the dictionary:

饱
畅
湍
滩
岭
舱
诩
阔
荫
鸽
勋
鸡
鹰
裙
艳
哦
毋庸
稻
蔗
熔
亥
裤
氢
《
》
...

gmryu commented 2 years ago

I thought nllb uses a byte-level sentencepiece. Am I wrong? Is the dict you talked about is this https://dl.fbaipublicfiles.com/large_objects/nllb/models/spm_200/dictionary.txt ?

Since it is a byte-level dictionary, there is no actual word/character inside. They are meant to be decoded back to normal strings later. So I think those are generated just by chance.

pluiez commented 2 years ago

I thought nllb uses a byte-level sentencepiece. Am I wrong? Is the dict you talked about is this https://dl.fbaipublicfiles.com/large_objects/nllb/models/spm_200/dictionary.txt ?

Since it is a byte-level dictionary, there is no actual word/character inside. They are meant to be decoded back to normal strings later. So I think those are generated just by chance.

Yes, I did use this dict. The sentencepiece model and translation model dictionary are downloaded from https://github.com/facebookresearch/fairseq/tree/nllb#preparing-datasets-for-training

Here is the translation output from Chinese to English on flores devtest, there are a total of 447 \<unk>s in source language across 1012 sentences.

log.flores-test.checkpoint.NLLB-200-Distilled-600M.zh2en.txt

gmryu commented 2 years ago

Confirmed. The downloaded dictionary.txt does not have all byte chars. So there are actually a lot of words/characters considered as \<unk>.

I inspected the original dictionary with more logger.info inside fairseq/data/dictionary.py: (well, I wrote a expansion code of this dictionary to support unk2byte char.

original: 老鹰捉小鸡
after spm: ▁老 鹰 捉 小 鸡
2022-07-10 13:45:57 | INFO | fairseq_cli.preprocess | ▁老 鹰 捉 小 
鸡
2022-07-10 13:45:57 | INFO | fairseq_cli.preprocess | 鹰 not found 
in self.indices
# Since it is not found, I transfer 鹰 to its equivalent byte char string.
2022-07-10 13:45:57 | INFO | fairseq_cli.preprocess | 鹰 := é¹°    
2022-07-10 13:45:57 | INFO | fairseq_cli.preprocess | 鸡 not found 
in self.indices
# 鸡 to its equivalent byte char string.
2022-07-10 13:45:57 | INFO | fairseq_cli.preprocess | 鸡 := é¸¡    
2022-07-10 13:45:57 | INFO | fairseq_cli.preprocess | tensor([230393, 248132,      3, 250174, 252996, 250014, 248132,      3, 249934, 
             2], dtype=torch.int32)

decoded: ▁老 é <unk> ° 捉 小 é <unk> ¡
bchar converted: ▁老<unk>捉小<unk>

There are two "3" appear in the tensor, which means byte char "¹" and "¸" does not exist inside the downloaded dictionary.txt as well. In sum there are 36 byte chars missing in the dictionary.txt,

p.s. you can use the tensor to find corresponding tokens inside the dictionary.txt. (open dictionrary.txt with uft-8) ▁老 is the first element = 230393 = (line number) 230390 - 1 (id starts from 0) + 4 (starts from bos,pad,eos,unk) = 230390 +3 So go to line 230390 of dictionary.txt. é is 248132, so go to line 248132 - 3 = 248129. ( it applies to almost fairseq dictionary and its dict.txt)

--

Well, adding those 36 byte chars into the dictionary.txt does not instantly fix the problem. Since the pretrained model's input dim is already decided, and you need to write a new fairseq dictionary.py to convert unknown word/char to byte char string before finishing dict encoding.

pluiez commented 2 years ago

Thank you for your nice explanation! Does this mean that the model may need fintuning on a extended vocabulary including the missing byte chars to fix this problem?

gmryu commented 2 years ago

I would wonder how the authors deal with those unknonw words. It feels like a huge hole and they would not have overlooked this.

In my case, I expanded fairseq/data/dictionary.py to overwrite least used tokens in the dict.txt with missing byte chars and put unknown words into byte char string. With these 2 setup, there are no more \<unk>. The vocab count is kept the same. Probably this is the minimum changes done to the pretrained model. It would be a disaster if those least used tokens are what you want most. However this time it is 36 tokens, which may be ignored. Then, yes you can finetune the model for your case without worrying any unknown words/symbols.

BrightXiaoHan commented 2 years ago

Other character base languages, such as Japanese, may have similar problems too.

vince62s commented 1 year ago

@huihuifan @edunov do you guys know if it could be possible to "update" the vocab for CJK and continue training so that those issues might be fixed ?

https://discuss.huggingface.co/t/nllb-3-3b-poor-translations-from-chinese-to-english/27695

vince62s commented 1 year ago

@pluiez I am working on a curated version of NLLB-200 to include these 26 symbols. Are you sure there are no other missing symbols for Chinese/Japanese/Korean ?

avidale commented 2 months ago

I have a tutorial on expanding NLLB with a new language, which also has a section of adding new tokens to its vocabulary. https://cointegrated.medium.com/a37fc706b865 However, I have to warn you that the model needs to be fine-tuned in order to use the new tokens adequately, and, unless you fine-tune it with enough data in all 202 languages, some forgetting may happen and the model would perform less well on some languages that didn't participate in the fine-tuning.

facebookresearch / fairseq

NLLB vocabulary missing common Chinese character/tokens #4560