Closed MirzaCickusic closed 8 months ago
By default, the M4T model seems to be quite robust to incorrect choices of the src_lang
, so if you just input an arbitrary language there, it might still work:
from seamless_communication.models.inference import Translator
translator = Translator("seamlessM4T_medium", "vocoder_36langs", torch.device("cpu"), torch.float16)
src_text = "Bonjour, comment allez-vous?"
translated_text, _, _ = translator.predict(src_text, "t2tt", tgt_lang="eng", src_lang="fra") # correct src_lang
print(translated_text) # Hello, how are you?
translated_text, _, _ = translator.predict(src_text, "t2tt", tgt_lang="eng", src_lang="deu") # incorrect src_lang
print(translated_text) # Hello, how are you?
If you want a more educated guess of the source language, you can use the FastText-based LID model released as a part of the NLLB project: https://github.com/facebookresearch/fairseq/tree/nllb#lid-model. Its set of language codes is different, but after conversion, its overlap with Seamless M4T languages is over 95%.
!pip install fasttext
!wget --content-disposition https://tinyurl.com/nllblid218e
import fasttext
tokenizer_langs = sorted(translator.text_tokenizer.langs)
print(tokenizer_langs) # ['ace', 'ace_Latn', 'acm', 'acq', 'aeb', ...
lid_model = fasttext.load_model("lid218e.bin")
renamed_langs = {
"zho_Hans": "cmn",
"zho_Hant": "cmn_Hant",
# there are still about ~10 languages that need to be re-mapped
}
def convert_lang(input_lang):
"""Convert a language from the LID format to the M4T format"""
if input_lang.startswith("__label__"):
input_lang = input_lang[9:]
if input_lang in tokenizer_langs:
return input_lang
prefix = input_lang.split("_")[0]
if prefix in tokenizer_langs:
return prefix
if input_lang in renamed_langs:
return renamed_langs[input_lang]
return input_lang
languages, scores = lid_model.predict(src_text, k=3) # k is the number of hypotheses
print(languages) # ('__label__fra_Latn', '__label__eng_Latn', '__label__nld_Latn')
print(scores) # [0.90197313 0.03340729 0.02291386]
With this, you can build a tiny pipeline of first LID-ing the source text, and then feeding the language code to the translator:
languages, scores = lid_model.predict(src_text, k=1)
src_lang = convert_lang(languages[0])
print(src_lang) # fra
translated_text, _, _ = translator.predict(src_text, "t2tt", tgt_lang="eng", src_lang=src_lang)
print(translated_text) # Hello, how are you?
Of course, the LID accuracy is not perfect, so the source language will sometimes be mis-identified, but, as I have demonstrated above, the translator is often able to generate a correct translation nevertheless.
I was hoping to use T2TT to translate from any language to English quickly, but the need to provide the src_lang parameter makes detecting the source language hard to the point of making the whole T2TT tool unusable.
Are there any known workarounds?