M2M-100 SentencePiece model produces tokens that are missing on the fixed dictionary

jofregit commented 3 years ago

🐛 Bug

The SentencePiece model for M2M-100 (https://dl.fbaipublicfiles.com/m2m_100/spm.128k.model) generates several tokens that are missing on the fixed dictionary (https://dl.fbaipublicfiles.com/m2m_100/data_dict.128k.txt)

To Reproduce

Steps to reproduce the behavior:

Tokenize the following sentence with the SentencePiece model for M2M-100:

import sentencepiece as spm
sentence = "My dog perhaps enjoyed music."
tokenizer = spm.SentencePieceProcessor(model_file =os.path.join(model_path, 'spm.128k.model') )
tokenizer.EncodeAsPieces(sentence)

See the tokens generated: ['▁My', '▁dog', '▁perhaps', '▁enjoyed', '▁music', '.']
If you check the fixed dictionary (data_dict.128k.txt) you will notice that '▁perhaps', '▁enjoyed' are missing and during the encoding process these tokens will be set to 3 which corresponds to the "unkwnown" token.
The translations are inaccurate for such cases: "My dog perhaps enjoyed noises." --> (fr) "Mon chien a appris les bruits." (with num_beams = 1)

Other tokens that will be set to 3 ("unknown" token) after the encoding:

{"", "̈", "ঞ", "ઞ", "ଙ", "ଞ", "ඈ", "ၡ", "ầ", "ậ", "ẵ", "↳", "啡", \ "圳", "圾", "垃", "础", "礎", "萄", "雰", "됩", "밝", "얻", "|01f4fa", "\ |01f924", "୍ଚ", "୍ଷ", "ຜນ", "င့", "ည့", "ቃሴ", "ይማ", "ដើ", "ឌ្", \ "ほと", "やは", "ろん", "イベ", "ッフ", "パソ", "来越", "特朗", "西班", "бәп", "лөш", \ "үек", "խմբ", "سرے", "یین", "इते", "मीण", "িৎস", "ିରଫ", "සාහ", "คโน", \ "จจุ", "ถาน", "ษัท", "ียญ", "เสร", "ຂວງ", "ງິນ", "ຖິງ", "ລ້ວ", "ວາມ", \ "ຫ່ງ", "ຶ້ນ", "່ວມ", "ໍລິ", "အတြ", "គរប", "ភិវ", "ាណិ", "ូមិ", "េតុ", \ "ំនង", "្ងៃ", "システ", " иҫә", " луѓ", " мот", " հաղ", " ճան", " تجه", \ " هیو", " ټکن", " ڊيس", " તરી", " ରିପ", " മേഖ", "зеге", "шкил", \ "шөөр", "ідэн", "әүге", "әүеш", "میشہ", "ंसिर", "म्मू", "समें", \ "ক্টো", "ামলা", "েস্ক", "ਜਵਾਨ", "ਤੂਬਰ", "ਮੇਟੀ", "ਿਆਰਥ", "ંટણી", \ "துகா", "ಪರ್ಕ", "ಬೈಲ್", "ಾಜಿಕ", "මෙයි", "ญี่ป", "ดาห์", "รกิจ", \ "ริ่ม", "ัพท์", "าศาส", "าะห์", "ูนย์", "ຈົ້າ", "ດນາມ", "ມືອງ", \ "ສບຸກ", "ັກໂນ", "ໍາເນ", "က်နှ", "იტომ", "ំព័រ", "្ញុំ", "្មែរ", \ "្លួន", "្លេស", "かもしれ", " күрһ", " эшмә", " مقای", " उन्ह", " कोशि", \ " नोटि", " मोबा", " নিরা", " દિલ્", " માહિ", " ଓଡ଼ି", " ପଟ୍ଟ", " \ ಅಭ್ಯ", " ಕ್ಷೇ", " ಪೊಲೀ", " ವಾಣಿ", " කිහි", " පැමි", " ტერი", "версі", \ "клопе", "сьәлә", "һынса", "աքանչ", "րաժար", "ונטאג", "ترنتی", \ "ورسٹی", "پیوتر", "یبانی", "ंत्री", "क्राउ", "म्मीद", "তিবার", \ "বাদিক", "ুধবার", "ਹਾਨੂੰ", "ଭିନ୍ନ", "ബരിമല", "ගමැති", "ุงเทพ", \ "้อมูล", "ທະວີຕ", "ໍາລັບ", "თიერთ", "უხედა", "ძლიათ", "ხედრო", \ "លរដ្ឋ", "ីដេអូ", "្បាប់", " հանդի", " אוטוב", " דאנאר", " کارشن", " \ इस्ते", " उत्पा", " प्राथ", " ગુજરા", " അദ്ദേ", " ຂ່າວວ", "न्त्री", \ "सन्धान", "্যান্য", "வடிக்க", "ಮಾರ್ಟ್", "วเตอร์", "ังหวัด", "ວຽດນາມ", \ "აშორის", "ាមេរិក", "័ត៌មាន", "្នំពេញ", " тарафы", " төхөөр", " \ Հայաստ", " الفلسط", " ٹیکنال", " განმავ", "тегория", "улланыу", \ "פטעמבער", "বিদ্যাল", "র্জাতিক", "വനന്തപു", "ເຂົ້າຫາ", " қамтама", " \ ສົ່ງໃຫ້", " ສໍາຫລັບ", " სხვადას", "স্পতিবার", "ີໂອເອລາວ", " \ વ્યાખ્યાઓ", "abaihan", "abogon", " achieve", "ahabog", "ahabogang", \ "ahlekel", " akawnt", "akuada", "alakahle", "almudug", "altachd", " \ amih", "aminosang", " anvä", "aphuno", "arangang", "aroaupenc", " \ artíc", "ashayeli", " Azərbay", "ịbụ", " beispi", " benfas", " \ benveng", " bharra", "bingkil", "ịbụl", "BND", " Bucure", " \ businesses", "cabka", " certainly", " Chatro", " citt", "èhófà", \ "eklase", "emmuz", " enjoyed", "erantany", "erzlech", "eshimi", \ "esterd", "esye", " ettev", "ewé", " eyisi", "faktirè", "fthiwe", " \ giin", " Goom", "haichean", "haps", "hathast", " hemib", \ "heqululini", "holoni", " htt", "ibeat", "ibuli", "iddene", \ "idmatan", "igawas", "igbahin", "Igual", "íklad", "ilangkan", \ "imutangan", "isemane", "iyembre", " iyisig", " Izray", " kabungtor", \ " KAHAPON", "ketho", " kinaug", " któr", " lớ", "laseklase", \ "latego", "Lietuv", " lling", "ləq", " mainta", " mmad", " mopak", " \ mümk", "naqi", " nearly", " nëm", "ởng", " nghiệ", "oblèm", "ófà", " \ okuday", " øn", "ópez", " owesifazana", "owever", " paggam", "Pagh", \ "Paghimo", "panid", " particularly", " perhaps", " Phetol", " \ przecie", " qualc", "qubom", "ərçiv", " reported", " rəhb", "ríguez", \ "ərrü", " sagols", " sebaga", "Sekelo", "selves", " Sga", "sgol", " \ społ", " Srednj", "Sulod", "tatge", "though", "tirè", "tụrụ", \ "ughout", "ugnawan", "ujourd", "ulagway", "upenc", "uregwu", "utube", \ "utubong", "uwega", " Uyas", " véh", " vreemdel", "vrier", "winan", " \ wła", " wouldn", "XÍA", " xüs", "yembre", "ynəl", "ynnag", "yoné", " \ Zagre", "zində", "zköz", "zonia", \ "[Alpha][Rho]ί[Omicron][Upsilon]", " [CapitalDelta]\ [CurlyEpsilon][Kappa][CurlyEpsilon]", " [CurlyEpsilon][Pi]\ [Alpha][Gamma][Gamma][CurlyEpsilon][Lambda]", " [CapitalIota]\ [Omicron][Upsilon][Nu]", \ "[Mu][Beta][Rho]ί[Omicron][Upsilon]", " [CapitalNu][Omicron]\ [CurlyEpsilon]", " [CapitalOmicron][Kappa][Tau][Omega][Beta]", \ " [CapitalSigma][CurlyEpsilon][Pi][Tau][CurlyEpsilon]", " \ [CapitalSigma][CapitalUpsilon][CapitalRho][CapitalIota]\ [CapitalZeta]", "[Tau][Omega][Beta]"

Expected behavior

I guess that one would expect the SentencePiece model to produce mainly tokens corresponding to the ones in the fixed dictionary (https://dl.fbaipublicfiles.com/m2m_100/data_dict.128k.txt)

jofregit commented 3 years ago

Any update on this issue? Thanks.

Mehrad0711 commented 3 years ago

Hi, Thanks @jofregit for reporting this. I'm seeing a similar issue when using the model from huggingface. Here's the code snippet to reproduce the issue:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('facebook/m2m100_418M')

tokenizer.src_lang = "en"
en_text = "I perhaps went there"

encoded = tokenizer(en_text, return_tensors="pt")
print('encoded: ', encoded['input_ids'])
print('output: ', tokenizer.batch_decode(encoded['input_ids'], skip_special_tokens=False, clean_up_tokenization_spaces=False))
print('output_skip_special_tokens: ', tokenizer.batch_decode(encoded['input_ids'], skip_special_tokens=True, clean_up_tokenization_spaces=False))

stdout:

encoded: tensor([[128022,    203,      3, 117292,  71586,      2]])
output: ['__en__ I<unk> went there</s>']
output_skip_special_tokens:  ['I went there']

As mentioned in the above post, missing tokens such as "perhaps" are assigned unknown id (3). This can cause silent errors especially if skip_special_tokens is True which will skip all such tokens in the output.

Would be great if the fairseq team could take a look at this issue. Thanks!

Jourdelune commented 2 years ago

Indeed, I get the same thing with unicode emoji and this character: ▬

facebookresearch / fairseq