Open asusdisciple opened 10 months ago
For text inputs, Seamless indeed needs the source language code. This language code can be predicted with the NLLB text LID model (https://github.com/facebookresearch/fairseq/tree/nllb?#lid-model) which has language codes mostly consistent with M4T (one of the differences is the Chinese Mandarin language code: it is zho
in NLLB but cmn
in M4T).
For speech inputs, Seamless doesn't need the source language code; instead, Seamless speech encoder figures the input language on its own (and it can work pretty well with crazy code-switching, e.g. where the language changes between English and Spanish mid-sentence). You need to provide only the target language code.
If, however, for some application you still need a speech language identification model, we highly recommend LID models from our sibling project MMS (https://github.com/facebookresearch/fairseq/blob/main/examples/mms/README.md#lid). They support up to 4000 spoken languages.
Oh I see thanks for the clarification. However is there any way to set the target language to the source language for ASR? For example when I do not want a translation and just pure transcription from a audio, currently there is no way to tell the model, just to use the source language.
Also if I may add: Unfortunately none of the MMS versions includes all of the languages in m4t-v2. These languages are not supported by MMS, which makes it kinda hard to use it for transcription only:
arb ary arz azj cmn_Hant fuv gaz khk lvs pbt pes uzn
@asusdisciple thank you for the analysis! Based on the Glottolog database and a little bit of common sense, the M4T languages that are "missing" from the MMS language list can be mapped to them as follows:
M4T code | MMS code | Comment |
---|---|---|
arb | ara | Modern Standard Arabic => Arabic macro-language |
ary | ara | Moroccan Arabic => Arabic macro-language |
arz | ara | Egyptian Arabic => Arabic macro-language |
azj | aze | North Azerbaijani => Azerbaijani macro-language |
cmn_Hant | cmn | cmn_Hant and cmn_Hans are both Mandarin, just with two different writing systems (Traditional or Simplified Chinese characters). In speech, there is supposed to be no difference. |
fuv | ful | Nigerian Fulfulde => Fulah macro-language, with Nigerian Fulfilde as a popular dialect |
gaz | orm | West Central Oromo => Oromo macro-language |
khk | mon | Halh Mongolian => Mongolian macro-language |
lvs | lav | Standard Latvian => Latvian macro-language |
pbt | pus | Southern Pashto => Pashto (Pushto) macro-language |
pes | fas | Western Persian => Persian macro-language |
uzn | uzb | Northern Uzbek => Uzbek macro-language |
Thus, all the confusion in the language codes comes from the fact that in M4T, a language (macrolanguage) is often represented by a particular dialect (individual language), so its definition in M4T tends to be slightly more narrow. But in most cases, M4T has only one dialect per language (for example, Azerbaijani in M4T is represented exclusively by Northern Azerbaijani), so the mapping between the narrower and broader languages is still 1:1.
So in most cases (the only exception: Arabic languoids), MMS LID can unambiguously distinguish between the M4T languages. And in case of Arabic, a safe solution would be probably to treat it as MSA, because speakers of all Arabic dialects are supposed to study it.
However is there any way to set the target language to the source language for ASR? For example when I do not want a translation and just pure transcription from a audio, currently there is no way to tell the model, just to use the source language.
Currently, it is not possible. So if you do not know the source language, the only way to transcribe is to detect this language with some other model, and then pass the detected language code to M4T.
In the future, we might release a version of M4T fine-tuned specifically for the ASR task, so that it always transcribes the input literally, instead of translating it. Or, alternatively, you or someone else from the community could fine-tune such a model.
So since you have to define the source language when you use m4t-v2, I wonder how you guys handle the problem of language identification? At the moment I use whisper to detect the language and delegate the input with the language string to m4T-v2. Are there any good models out there I am not aware of? I know there is fasttext for text but I did not find anything for audios or mel spectrograms yet.