flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.35k stars 1.02k forks source link

Arabic Language #984

Closed MHumza3656 closed 2 years ago

MHumza3656 commented 2 years ago

Do you have any plan to train an Arabic Model in near future? These models are really good, adding more languages should be a ToDo in this :).

Also, have you already mentioned this? If yes point me to the link, Question is, do you have training information and criteria of current models?

tlikhomanenko commented 2 years ago

Current models are https://github.com/flashlight/wav2letter/tree/master/recipes/rasr and all trained on letter tokens with CTC (details in the paper itself).

What do you mean for arabic language - multilingual or arabic itself?

MHumza3656 commented 2 years ago

Oh, so the transformer model or any other models are basically the Multilingual models, and provided the right token & lexicon files it can provide the text of those languages (Mentioned languages) as well?

tlikhomanenko commented 2 years ago

Our transformer models are trained on English data only, so it is only English.

tlikhomanenko commented 2 years ago

We have released models and data for 8 languages (all models are single language) https://github.com/flashlight/wav2letter/tree/master/recipes/mls. But arabic is not there.

MHumza3656 commented 2 years ago

Alright got it. are these trained models equally good as the transformer models? Do you have any stats/evaluation matrix mentioned for these models?

Yes, this is what I'm asking are you planning to add Arabic to these 8 Models' collection?

tlikhomanenko commented 2 years ago

They are also transformers, but rasr is better for English as trained on more various data, overall should be not bad especially if you do finetuning on your particular data even if they are small. For arabic - no plan for now.

MHumza3656 commented 2 years ago

Right, thank you. Another question I have, Does Wav2letter provides some sort of speaker identification? Speaker Identification would be: say there are 2 people in audio then my transcript will be something like, transcript of speaker 1 and this is of speaker 2, Something like this.

if not then Will I have to fine-tune some model for it or Re-train from scratch?

tlikhomanenko commented 2 years ago

We did not train specific for speaker identification. Depends on the configuration you want to use for speaker id - if arch itself similar to our (not some variation on two heads for example) you can simply use our models as initialization and then finetune. Even you can add necessary layers at the end and then do finetuning.

MHumza3656 commented 2 years ago

sorry for my lack of knowledge in this domain, I didn't understand much. Could you mention the components involved I will look into this a bit more and ask a more relevant question? PS: I'm specifically referring to W2L (Speech-to-Text) models

Thank you so much @tlikhomanenko. Really appreciate your responses (+1000)

tlikhomanenko commented 2 years ago

The models we used in mls similar to the architecture defined here https://github.com/flashlight/wav2letter/blob/master/recipes/slimIPL/100h_supervised.cpp. So assume you will have this structure and all layers are pretrained for ASR. You can use part of these layers and define other layers on top to have speaker identification.

MHumza3656 commented 2 years ago

I was thinking about extracting different frequencies from the audio. I don't know much just an assumption for now. Like maybe through Fourier Transformation. Once I have different frequencies (ideally each speaker's) I do speech-to-text. Any thoughts?

tlikhomanenko commented 2 years ago

These models are trained on MFSC features (so FFT is included there), we used 80 filterbanks. Do you want to have some features computed on whole audio?

MHumza3656 commented 2 years ago

Yes but not Features to be exact, But let's say there are 2 speakers within the Audio. The ASR model either differentiates between the transcripts of those 2 speakers or we somehow Pre-process and separate the Audio files for both speakers and feed that to the Language model separately. Then patch up the Transcripts, Do you get the idea here?

tlikhomanenko commented 2 years ago

I would say you need here speaker identification system. Either it is another head in the model which will give additional output to the transcript pointing where and what the speaker. Or you do another network which will do speaker identification (per frame). After that you need to disjoin prediction of ASR based on speaker and send the emissions to the beam-search decoder. You need to code this simple logic in the Decode.cpp (we didn't do that).

MHumza3656 commented 2 years ago

Alright, this makes sense. Thank you for your help @tlikhomanenko. Appreciate it :) First I will utilize punctuation and then try solving this.