facebookresearch / seamless_communication

Foundational Models for State-of-the-Art Speech and Text Translation
Other
10.96k stars 1.06k forks source link

Bug in SeamlessM4T-Large: t2tt when target is 'yue' #64

Open freedomtan opened 1 year ago

freedomtan commented 1 year ago

When the seamlessM4T_large model is used for t2tt, tgt_lang='yue', src_lang='eng', the returned results are in Mandarin with Simplified Han glyphs (expected results are in Cantonese with Traditional Han glyphs.

import torch
from seamless_communication.models.inference import Translator

translator_medium = Translator("seamlessM4T_medium", "vocoder_36langs", torch.device("cuda:0"), torch.float16)
translator_large = Translator("seamlessM4T_large", "vocoder_36langs", torch.device("cuda:0"), torch.float16)

message_to_translate = 'The forces of Syria’s president, Bashar al-Assad, struck soon after 2am. Residents of Ghouta, a Damascus suburb, told reporters that they heard a strange noise, as if someone was opening a bottle of Pepsi. A local doctor, fighting back tears, explained that many people had sought shelter underground, but the gas was heavier than air and it pooled in basements and cellars. Had they climbed the stairs instead, they would have lived.'
translated_text, _, _ = translator_medium.predict(message_to_translate, "t2tt", 'yue', src_lang='eng')
# with medium size model, we got expected Cantonese contents in Tradiditional Han glyphs
print(f'from medium model: {translated_text}')
translated_text, _, _ = translator_large.predict(message_to_translate, "t2tt", 'yue', src_lang='eng')
# with large size model, we got Mandarin contents in Simplified Han glyphs (NOT expected yue in Traditional Han script)
print(f'from large model: {translated_text}')

The results

from medium model: 敘利亞總統 Bashar al-Assad 嘅軍隊喺早上 2 點之後好快就擊中咗 達馬士革郊區 Ghouta 嘅居民話畀記者知 佢哋聽到一個奇怪嘅聲音 就好似有人打開一個<unk>酒瓶 一位當地醫生反抗眼淚 佢話好多人喺地下尋求庇護 但氣體比空氣好重 佢哋喺地下室同地下室聚集
from large model: 叙利亚总统巴沙尔·阿萨德 (Bashar al-Assad) 的军队在凌晨 2 点袭击. 大马士革郊区古塔 (Ghouta) 的居民告诉记者,他们听到一个奇怪的噪音,好像有人在打开百事可乐的瓶子. 一个当地医生,控制着眼泪,解释说许多人寻求地下避难所,但气体比空气更重,它聚集在地下室和地下室.如果他们爬上楼梯,他们会活下来.
dalyafaraj commented 1 year ago

Hello, how do I solve this problem !! ModuleNotFoundError: No module named 'seamless_communication' !!

freedomtan commented 1 year ago

Hello, how do I solve this problem !! ModuleNotFoundError: No module named 'seamless_communication' !!

pip install git+https://github.com/facebookresearch/seamless_communication

if you are in jupyter notebook or colab

!pip install git+https://github.com/facebookresearch/seamless_communication
chenyunsai commented 1 year ago

Where can I see the abbreviation of the language, similar to eng

chenyunsai commented 1 year ago

ValueError: lang must be a supported language, but is 'zh-cn' instead.

freedomtan commented 1 year ago

Where can I see the abbreviation of the language, similar to eng

In the paper and code in this repo 😀

https://github.com/facebookresearch/seamless_communication/blob/main/src/seamless_communication/assets/cards/unity_nllb-100.yaml

https://github.com/facebookresearch/seamless_communication/blob/main/src/seamless_communication/assets/cards/unity_nllb-200.yaml

Basically, they are from ISO 639-3

elbayadm commented 1 year ago

You can also check the list in https://github.com/facebookresearch/seamless_communication/tree/main/scripts/m4t/predict#supported-languages For Mandarin Chinese, we have cmn (Hans script) and cmn_Hant (Hant script)

freedomtan commented 1 year ago

You can also check the list in https://github.com/facebookresearch/seamless_communication/tree/main/scripts/m4t/predict#supported-languages For Mandarin Chinese, we have cmn (Hans script) and cmn_Hant (Hant script)

some inconsistency between this and the YAML files for medium and large models:

Some glyphs in zho_Hant, e.g., 喺 and 嘅, are usually for yue only.

elbayadm commented 1 year ago

That's correct, the list in the README is for the large model only. The medium model's tokenizer actually supports more languages (it's the same tokenizer as NLLB-200 with the langs in https://github.com/facebookresearch/fairseq/blob/nllb/examples/nllb/modeling/scripts/flores200/langs.txt). That said, cmn_Hant in the large should be the same as zho_Hant in the medium (speaking of training data).

freedomtan commented 1 year ago

That's correct, the list in the README is for the large model only. The medium model's tokenizer actually supports more languages (it's the same tokenizer as NLLB-200 with the langs in https://github.com/facebookresearch/fairseq/blob/nllb/examples/nllb/modeling/scripts/flores200/langs.txt). That said, cmn_Hant in the large should be the same as zho_Hant in the medium (speaking of training data).

Nope, something must be wrong. From what returned by translated_text, _, _ = translator_medium.predict(to_translate, "t2tt", 'zho_Hant', src_lang='eng'), it's closer to Cantonese (yue) instead of Mandarin (cmn or cmn_Hant). A Mandarin speaker who has no prior knowledge of Cantonese probably will say why there are some random gibberish glyphs :-)

elbayadm commented 1 year ago

Thanks @freedomtan for your observations. The major differences between the medium and large models are:

  1. Medium reuses the NLLB-600M distilled model from NLLB
  2. Large uses a new version of NLLB that focuses on the 100 languages of SeamlessM4T (hence nllb-100). In training this new version of NLLB, we trained another tokenizer while enforcing the addition of frequent Chinese characters (see ¶Training a Text Tokenizer - page 31 of the paper https://ai.meta.com/research/publications/seamless-m4t/) so it should be better for cmn, cmn_Hant and yue. If I'm to believe chrF++ scores, then large is 5 chrF++ points better than medium on Flores eng-> cmn_Hant/zho_Hant
freedomtan commented 1 year ago

@elbayadm to summarise what I know

I guess you do get better chrF++ points, but something in character/glyph-level is wrong.

elbayadm commented 1 year ago

@freedomtan I tested with another example from FLORES-200

import torch
from seamless_communication.models.inference import Translator

translator_medium = Translator("seamlessM4T_medium", "vocoder_36langs", torch.device("cuda:0"), torch.float16)
translator_large = Translator("seamlessM4T_large", "vocoder_36langs", torch.device("cuda:0"), torch.float16)

message_to_translate = "If you visit the Arctic or Antarctic areas in the winter you will experience the polar night, which means that the sun doesn\'t rise above the horizon."
translated_text, _, _ = translator_medium.predict(message_to_translate, "t2tt", 'cmn_Hant', src_lang='eng')
print(f'from medium model: {translated_text}')
translated_text, _, _ = translator_large.predict(message_to_translate, "t2tt", 'cmn_Hant', src_lang='eng')
print(f'from large model: {translated_text}')

from medium model: 如果你喺冬天去訪北極或者南極, 你會經歷極夜, 意思係太陽唔會喺地平線上升. from large model: 如果你在冬天去北極或南極地區, 你會體驗北極夜,

And:

translated_text, _, _ = translator_medium.predict(message_to_translate, "t2tt", 'yue', src_lang='eng')
print(f'from medium model: {translated_text}')
translated_text, _, _ = translator_large.predict(message_to_translate, "t2tt", 'yue', src_lang='eng')
print(f'from large model: {translated_text}')

from medium model: 如果你喺冬天去北極或者南極, 你會發現北極嘅夜晚, 即係話太陽唔會喺地平線上升. from large model: 如果你喺冬天去北極或者南極 你會經歷北極嘅夜晚 即係話太陽唔喺地平線上面升起

AFAICT:

for SeamlessM4T medium: zho_Hant, which is supposed to return Mandarin in Traditional Han, returned Cantonese in Traditional Han

is True even for the FLORES example.

for SeamlessM4T large: yue, which is supposed to return Cantonese in Traditional Han, returned Mandarin in Simplified Han

this one looks like it's correctly translated in Cantonese. if you agree with my assessment, then the issue in your first example could be caused by the code switching with English in the output. I'll investigate this further, to see if the training data is not wrongly labeled.

freedomtan commented 1 year ago

@elbayadm Thanks for spending time checking this issue.

this one looks like it's correctly translated in Cantonese. if you agree with my assessment, then the issue in your first example could be caused by the code switching with English in the output. I'll investigate this further, to see if the training data is not wrongly labeled.

Yes, it turned out that the yue in large model is a bit tricky. I have a shorter example.

to_translate_1 = "The forces of Syria's president, Bashar al-Assad, fight back."
to_translate_2 = "The forces of Syria's president, Bashar al-Assad, fight back soon."

translated_text, _, _ = translator_large.predict(to_translate_1, "t2tt", 'yue', src_lang='eng')
print(f'from large model 1: {translated_text}')
translated_text, _, _ = translator_large.predict(to_translate_2, "t2tt", 'yue', src_lang='eng')
print(f'from large model 2: {translated_text}')

The result of to_translate_1 is in Cantonese in Traditional Han. The result of to_translated_2 is Mandarin in Simplified Han.

from large model 1: 敘利亞總統巴沙爾·阿薩德嘅軍隊反擊
from large model 2: 叙利亚总统巴沙尔·阿萨德 (Bashar al-Assad) 的军队很快就会反击.