M4T-v2 Language Identification for Audio / Spoken Language

asusdisciple commented 10 months ago

So since you have to define the source language when you use m4t-v2, I wonder how you guys handle the problem of language identification? At the moment I use whisper to detect the language and delegate the input with the language string to m4T-v2. Are there any good models out there I am not aware of? I know there is fasttext for text but I did not find anything for audios or mel spectrograms yet.

avidale commented 10 months ago

For text inputs, Seamless indeed needs the source language code. This language code can be predicted with the NLLB text LID model (https://github.com/facebookresearch/fairseq/tree/nllb?#lid-model) which has language codes mostly consistent with M4T (one of the differences is the Chinese Mandarin language code: it is zho in NLLB but cmn in M4T).

For speech inputs, Seamless doesn't need the source language code; instead, Seamless speech encoder figures the input language on its own (and it can work pretty well with crazy code-switching, e.g. where the language changes between English and Spanish mid-sentence). You need to provide only the target language code.

If, however, for some application you still need a speech language identification model, we highly recommend LID models from our sibling project MMS (https://github.com/facebookresearch/fairseq/blob/main/examples/mms/README.md#lid). They support up to 4000 spoken languages.

asusdisciple commented 10 months ago

Oh I see thanks for the clarification. However is there any way to set the target language to the source language for ASR? For example when I do not want a translation and just pure transcription from a audio, currently there is no way to tell the model, just to use the source language.

asusdisciple commented 9 months ago

Also if I may add: Unfortunately none of the MMS versions includes all of the languages in m4t-v2. These languages are not supported by MMS, which makes it kinda hard to use it for transcription only:

arb ary arz azj cmn_Hant fuv gaz khk lvs pbt pes uzn

avidale commented 9 months ago

@asusdisciple thank you for the analysis! Based on the Glottolog database and a little bit of common sense, the M4T languages that are "missing" from the MMS language list can be mapped to them as follows:

M4T code	MMS code	Comment
arb	ara	Modern Standard Arabic => Arabic macro-language
ary	ara	Moroccan Arabic => Arabic macro-language
arz	ara	Egyptian Arabic => Arabic macro-language
azj	aze	North Azerbaijani => Azerbaijani macro-language
cmn_Hant	cmn	cmn_Hant and cmn_Hans are both Mandarin, just with two different writing systems (Traditional or Simplified Chinese characters). In speech, there is supposed to be no difference.
fuv	ful	Nigerian Fulfulde => Fulah macro-language, with Nigerian Fulfilde as a popular dialect
gaz	orm	West Central Oromo => Oromo macro-language
khk	mon	Halh Mongolian => Mongolian macro-language
lvs	lav	Standard Latvian => Latvian macro-language
pbt	pus	Southern Pashto => Pashto (Pushto) macro-language
pes	fas	Western Persian => Persian macro-language
uzn	uzb	Northern Uzbek => Uzbek macro-language

Thus, all the confusion in the language codes comes from the fact that in M4T, a language (macrolanguage) is often represented by a particular dialect (individual language), so its definition in M4T tends to be slightly more narrow. But in most cases, M4T has only one dialect per language (for example, Azerbaijani in M4T is represented exclusively by Northern Azerbaijani), so the mapping between the narrower and broader languages is still 1:1.

So in most cases (the only exception: Arabic languoids), MMS LID can unambiguously distinguish between the M4T languages. And in case of Arabic, a safe solution would be probably to treat it as MSA, because speakers of all Arabic dialects are supposed to study it.

avidale commented 9 months ago

However is there any way to set the target language to the source language for ASR? For example when I do not want a translation and just pure transcription from a audio, currently there is no way to tell the model, just to use the source language.

Currently, it is not possible. So if you do not know the source language, the only way to transcribe is to detect this language with some other model, and then pass the detected language code to M4T.

In the future, we might release a version of M4T fine-tuned specifically for the ASR task, so that it always transcribes the input literally, instead of translating it. Or, alternatively, you or someone else from the community could fine-tune such a model.

facebookresearch / seamless_communication

M4T-v2 Language Identification for Audio / Spoken Language #325