Open tanshuai opened 9 months ago
Confirmed also on my end with SeamlessM4TLarge model.
English input:
A witch can fly with a broom.
Chinese output:
一个女巫可以用扫<unk>飞.
Seamless t2t is based on nllb but nllb does not have this issue.
@cndn you pushed a readme update at https://github.com/facebookresearch/seamless_communication/commit/df2816adf3574016ffa99eb947ec3bff23310413 but the diff changes is related to audio alignment while this issue is text2text translation. Can you provide an example for we can fix this for t2t? Thanks.
Same issue here on v2 model T2TT...How can the model be predicting a special token in the first place?
This does not happen on the HF space demo, and I am using the same code as the demo...
It only happens for some instances of Oh
for me
me too
how can i fix this problem?
@skywindy I believe the tokenizer mapping in the opensourced repo is completely wrong. Thus creating this UNK issues. Either that or they trained with a broken tokenizer. This is no way for us to fix this without knowing if model or tokenizer is broken.
For example "Oh, Peter." translated to "\<unk>,彼得.", "Oh, my god" translated to "\<unk>,我的上帝"
Almost all of the "Oh" are translated into \<unk>, making this project almost unusable for Chinese and Cantonese..
152
64