facebookresearch / seamless_communication

Foundational Models for State-of-the-Art Speech and Text Translation
Other
10.53k stars 1.02k forks source link

Requiring src_lang parametar makes T2TT unusable #178

Closed MirzaCickusic closed 8 months ago

MirzaCickusic commented 9 months ago

I was hoping to use T2TT to translate from any language to English quickly, but the need to provide the src_lang parameter makes detecting the source language hard to the point of making the whole T2TT tool unusable.

Are there any known workarounds?

avidale commented 9 months ago

By default, the M4T model seems to be quite robust to incorrect choices of the src_lang, so if you just input an arbitrary language there, it might still work:

from seamless_communication.models.inference import Translator
translator = Translator("seamlessM4T_medium", "vocoder_36langs", torch.device("cpu"), torch.float16)

src_text = "Bonjour, comment allez-vous?"
translated_text, _, _ = translator.predict(src_text, "t2tt", tgt_lang="eng", src_lang="fra")  # correct src_lang
print(translated_text)  # Hello, how are you?
translated_text, _, _ = translator.predict(src_text, "t2tt", tgt_lang="eng", src_lang="deu") # incorrect src_lang
print(translated_text)  # Hello, how are you?

If you want a more educated guess of the source language, you can use the FastText-based LID model released as a part of the NLLB project: https://github.com/facebookresearch/fairseq/tree/nllb#lid-model. Its set of language codes is different, but after conversion, its overlap with Seamless M4T languages is over 95%.

!pip install fasttext
!wget --content-disposition https://tinyurl.com/nllblid218e 
import fasttext
tokenizer_langs = sorted(translator.text_tokenizer.langs)
print(tokenizer_langs)  # ['ace', 'ace_Latn', 'acm', 'acq', 'aeb', ...
lid_model = fasttext.load_model("lid218e.bin")

renamed_langs = {
    "zho_Hans": "cmn", 
    "zho_Hant": "cmn_Hant",
    # there are still about ~10 languages that need to be re-mapped
}

def convert_lang(input_lang):
    """Convert a language from the LID format to the M4T format"""
    if input_lang.startswith("__label__"):
        input_lang = input_lang[9:]
    if input_lang in tokenizer_langs:
        return input_lang
    prefix = input_lang.split("_")[0]
    if prefix in tokenizer_langs:
        return prefix
    if input_lang in renamed_langs:
        return renamed_langs[input_lang]
    return input_lang

languages, scores = lid_model.predict(src_text, k=3)  # k is the number of hypotheses
print(languages)  # ('__label__fra_Latn', '__label__eng_Latn', '__label__nld_Latn')
print(scores)  # [0.90197313 0.03340729 0.02291386]

With this, you can build a tiny pipeline of first LID-ing the source text, and then feeding the language code to the translator:

languages, scores = lid_model.predict(src_text, k=1)
src_lang = convert_lang(languages[0])
print(src_lang) # fra
translated_text, _, _ = translator.predict(src_text, "t2tt", tgt_lang="eng", src_lang=src_lang)
print(translated_text)  # Hello, how are you?

Of course, the LID accuracy is not perfect, so the source language will sometimes be mis-identified, but, as I have demonstrated above, the translator is often able to generate a correct translation nevertheless.