Open freedomtan opened 1 year ago
Hello, how do I solve this problem !! ModuleNotFoundError: No module named 'seamless_communication' !!
Hello, how do I solve this problem !! ModuleNotFoundError: No module named 'seamless_communication' !!
pip install git+https://github.com/facebookresearch/seamless_communication
if you are in jupyter notebook or colab
!pip install git+https://github.com/facebookresearch/seamless_communication
Where can I see the abbreviation of the language, similar to eng
ValueError: lang
must be a supported language, but is 'zh-cn' instead.
Where can I see the abbreviation of the language, similar to eng
In the paper and code in this repo 😀
Basically, they are from ISO 639-3
You can also check the list in https://github.com/facebookresearch/seamless_communication/tree/main/scripts/m4t/predict#supported-languages
For Mandarin Chinese, we have cmn
(Hans script) and cmn_Hant
(Hant script)
You can also check the list in https://github.com/facebookresearch/seamless_communication/tree/main/scripts/m4t/predict#supported-languages For Mandarin Chinese, we have
cmn
(Hans script) andcmn_Hant
(Hant script)
some inconsistency between this and the YAML files for medium and large models:
cmn_Hant
is only in supported by the large model (or say the large tokenizer)zho_Hant
in the medium tokenizer, but it's more like a variant of yue
instead of cmn_Hant
in the large one, e.g.,
translated_text, _, _ = translator_medium.predict(to_translate, "t2tt", 'zho_Hant', src_lang='eng')
print(translated_text)
translated_text, _, _ = translator_medium.predict(to_translate, "t2tt", 'yue', src_lang='eng')
print(translated_text)
results:
敘利亞總統 Bashar al-Assad 的軍隊喺早上 2 點之後好快擊中. 達馬士革郊區嘅Ghouta 居民話畀記者知,佢哋聽到一個奇怪嘅聲音,就好似有人打開一個<unk>酒瓶一樣. 一位當地醫生,反抗眼淚, 解釋咗好多人喺地下尋求庇護,但氣體比空氣重,而且聚集喺地下室同地下室.
敘利亞總統 Bashar al-Assad 嘅軍隊喺早上 2 點之後好快就擊中咗 達馬士革郊區 Ghouta 嘅居民話畀記者知 佢哋聽到一個奇怪嘅聲音 就好似有人打開一個<unk>酒瓶 一位當地醫生反抗眼淚 佢話好多人喺地下尋求庇護 但氣體比空氣好重 佢哋喺地下室同地下室聚集
Some glyphs in zho_Hant
, e.g., 喺 and 嘅, are usually for yue
only.
That's correct, the list in the README is for the large model only. The medium model's tokenizer actually supports more languages (it's the same tokenizer as NLLB-200 with the langs in https://github.com/facebookresearch/fairseq/blob/nllb/examples/nllb/modeling/scripts/flores200/langs.txt).
That said, cmn_Hant
in the large should be the same as zho_Hant
in the medium (speaking of training data).
That's correct, the list in the README is for the large model only. The medium model's tokenizer actually supports more languages (it's the same tokenizer as NLLB-200 with the langs in https://github.com/facebookresearch/fairseq/blob/nllb/examples/nllb/modeling/scripts/flores200/langs.txt). That said,
cmn_Hant
in the large should be the same aszho_Hant
in the medium (speaking of training data).
Nope, something must be wrong. From what returned by translated_text, _, _ = translator_medium.predict(to_translate, "t2tt", 'zho_Hant', src_lang='eng')
, it's closer to Cantonese (yue
) instead of Mandarin (cmn
or cmn_Hant
). A Mandarin speaker who has no prior knowledge of Cantonese probably will say why there are some random gibberish glyphs :-)
Thanks @freedomtan for your observations. The major differences between the medium and large models are:
cmn
, cmn_Hant
and yue
. If I'm to believe chrF++ scores, then large is 5 chrF++ points better than medium on Flores eng
-> cmn_Hant
/zho_Hant
@elbayadm to summarise what I know
zho_Hans
, zho_Hant
, yue_Hant
) work as expected.zho_Hant
, which is supposed to return Mandarin in Traditional Han, returned Cantonese in Traditional Hanyue
, which is supposed to return Cantonese in Traditional Han, returned Mandarin in Simplified HanI guess you do get better chrF++ points, but something in character/glyph-level is wrong.
@freedomtan I tested with another example from FLORES-200
import torch
from seamless_communication.models.inference import Translator
translator_medium = Translator("seamlessM4T_medium", "vocoder_36langs", torch.device("cuda:0"), torch.float16)
translator_large = Translator("seamlessM4T_large", "vocoder_36langs", torch.device("cuda:0"), torch.float16)
message_to_translate = "If you visit the Arctic or Antarctic areas in the winter you will experience the polar night, which means that the sun doesn\'t rise above the horizon."
translated_text, _, _ = translator_medium.predict(message_to_translate, "t2tt", 'cmn_Hant', src_lang='eng')
print(f'from medium model: {translated_text}')
translated_text, _, _ = translator_large.predict(message_to_translate, "t2tt", 'cmn_Hant', src_lang='eng')
print(f'from large model: {translated_text}')
from medium model: 如果你喺冬天去訪北極或者南極, 你會經歷極夜, 意思係太陽唔會喺地平線上升. from large model: 如果你在冬天去北極或南極地區, 你會體驗北極夜,
And:
translated_text, _, _ = translator_medium.predict(message_to_translate, "t2tt", 'yue', src_lang='eng')
print(f'from medium model: {translated_text}')
translated_text, _, _ = translator_large.predict(message_to_translate, "t2tt", 'yue', src_lang='eng')
print(f'from large model: {translated_text}')
from medium model: 如果你喺冬天去北極或者南極, 你會發現北極嘅夜晚, 即係話太陽唔會喺地平線上升. from large model: 如果你喺冬天去北極或者南極 你會經歷北極嘅夜晚 即係話太陽唔喺地平線上面升起
AFAICT:
for SeamlessM4T medium: zho_Hant, which is supposed to return Mandarin in Traditional Han, returned Cantonese in Traditional Han
is True even for the FLORES example.
for SeamlessM4T large: yue, which is supposed to return Cantonese in Traditional Han, returned Mandarin in Simplified Han
this one looks like it's correctly translated in Cantonese. if you agree with my assessment, then the issue in your first example could be caused by the code switching with English in the output. I'll investigate this further, to see if the training data is not wrongly labeled.
@elbayadm Thanks for spending time checking this issue.
this one looks like it's correctly translated in Cantonese. if you agree with my assessment, then the issue in your first example could be caused by the code switching with English in the output. I'll investigate this further, to see if the training data is not wrongly labeled.
Yes, it turned out that the yue
in large model is a bit tricky. I have a shorter example.
to_translate_1 = "The forces of Syria's president, Bashar al-Assad, fight back."
to_translate_2 = "The forces of Syria's president, Bashar al-Assad, fight back soon."
translated_text, _, _ = translator_large.predict(to_translate_1, "t2tt", 'yue', src_lang='eng')
print(f'from large model 1: {translated_text}')
translated_text, _, _ = translator_large.predict(to_translate_2, "t2tt", 'yue', src_lang='eng')
print(f'from large model 2: {translated_text}')
The result of to_translate_1
is in Cantonese in Traditional Han. The result of to_translated_2
is Mandarin in Simplified Han.
from large model 1: 敘利亞總統巴沙爾·阿薩德嘅軍隊反擊
from large model 2: 叙利亚总统巴沙尔·阿萨德 (Bashar al-Assad) 的军队很快就会反击.
When the
seamlessM4T_large
model is used fort2tt
,tgt_lang='yue', src_lang='eng'
, the returned results are in Mandarin with Simplified Han glyphs (expected results are in Cantonese with Traditional Han glyphs.The results