tanshuai commented 9 months ago

For example "Oh, Peter." translated to "\<unk>,彼得.", "Oh, my god" translated to "\<unk>,我的上帝"

Almost all of the "Oh" are translated into \<unk>, making this project almost unusable for Chinese and Cantonese..

$ m4t_predict "Oh, Peter."  t2tt cmn --src_lang eng
2023-09-20 03:22:20,215 INFO -- m4t_scripts.predict.predict: Running inference on the GPU in torch.float16.
Using the cached checkpoint of the model 'seamlessM4T_large'. Set `force=True` to download again.
Using the cached tokenizer of the model 'seamlessM4T_large'. Set `force=True` to download again.
Using the cached checkpoint of the model 'vocoder_36langs'. Set `force=True` to download again.
2023-09-20 03:22:24,949 INFO -- m4t_scripts.predict.predict: Translated text in cmn: <unk>,彼得.

$ m4t_predict "Oh, Peter."  t2tt cmn_Hant --src_lang eng
2023-09-20 03:22:48,454 INFO -- m4t_scripts.predict.predict: Running inference on the GPU in torch.float16.
Using the cached checkpoint of the model 'seamlessM4T_large'. Set `force=True` to download again.
Using the cached tokenizer of the model 'seamlessM4T_large'. Set `force=True` to download again.
Using the cached checkpoint of the model 'vocoder_36langs'. Set `force=True` to download again.
2023-09-20 03:22:53,404 INFO -- m4t_scripts.predict.predict: Translated text in cmn_Hant: <unk>, 彼得.

$ m4t_predict "Oh, Peter."  t2tt yue --src_lang eng
2023-09-20 03:21:16,073 INFO -- m4t_scripts.predict.predict: Running inference on the GPU in torch.float16.
Using the cached checkpoint of the model 'seamlessM4T_large'. Set `force=True` to download again.
Using the cached tokenizer of the model 'seamlessM4T_large'. Set `force=True` to download again.
Using the cached checkpoint of the model 'vocoder_36langs'. Set `force=True` to download again.
2023-09-20 03:21:20,886 INFO -- m4t_scripts.predict.predict: Translated text in yue: <unk>,彼得.

152

64

Qubitium commented 5 months ago

Confirmed also on my end with SeamlessM4TLarge model.

English input:

A witch can fly with a broom.

Chinese output:

一个女巫可以用扫<unk>飞.

Seamless t2t is based on nllb but nllb does not have this issue.

Qubitium commented 5 months ago

@cndn you pushed a readme update at https://github.com/facebookresearch/seamless_communication/commit/df2816adf3574016ffa99eb947ec3bff23310413 but the diff changes is related to audio alignment while this issue is text2text translation. Can you provide an example for we can fix this for t2t? Thanks.

aliencaocao commented 3 months ago

Same issue here on v2 model T2TT...How can the model be predicting a special token in the first place? This does not happen on the HF space demo, and I am using the same code as the demo... It only happens for some instances of Oh for me

asulada commented 3 months ago

me too

skywindy commented 1 week ago

how can i fix this problem?

Qubitium commented 1 week ago

@skywindy I believe the tokenizer mapping in the opensourced repo is completely wrong. Thus creating this UNK issues. Either that or they trained with a broken tokenizer. This is no way for us to fix this without knowing if model or tokenizer is broken.

facebookresearch / seamless_communication

Outputs too many <unk> symbols with Mandarin Chinese (cmn & cmn_Hant) and Cantonese (yue) #168

152

64