Closed MHumza3656 closed 2 years ago
Current models are https://github.com/flashlight/wav2letter/tree/master/recipes/rasr and all trained on letter tokens with CTC (details in the paper itself).
What do you mean for arabic language - multilingual or arabic itself?
Oh, so the transformer model or any other models are basically the Multilingual models, and provided the right token & lexicon files it can provide the text of those languages (Mentioned languages) as well?
Our transformer models are trained on English data only, so it is only English.
We have released models and data for 8 languages (all models are single language) https://github.com/flashlight/wav2letter/tree/master/recipes/mls. But arabic is not there.
Alright got it. are these trained models equally good as the transformer models? Do you have any stats/evaluation matrix mentioned for these models?
Yes, this is what I'm asking are you planning to add Arabic to these 8 Models' collection?
They are also transformers, but rasr is better for English as trained on more various data, overall should be not bad especially if you do finetuning on your particular data even if they are small. For arabic - no plan for now.
Right, thank you. Another question I have, Does Wav2letter provides some sort of speaker identification? Speaker Identification would be: say there are 2 people in audio then my transcript will be something like, transcript of speaker 1 and this is of speaker 2, Something like this.
if not then Will I have to fine-tune some model for it or Re-train from scratch?
We did not train specific for speaker identification. Depends on the configuration you want to use for speaker id - if arch itself similar to our (not some variation on two heads for example) you can simply use our models as initialization and then finetune. Even you can add necessary layers at the end and then do finetuning.
sorry for my lack of knowledge in this domain, I didn't understand much. Could you mention the components involved I will look into this a bit more and ask a more relevant question? PS: I'm specifically referring to W2L (Speech-to-Text) models
Thank you so much @tlikhomanenko. Really appreciate your responses (+1000)
The models we used in mls similar to the architecture defined here https://github.com/flashlight/wav2letter/blob/master/recipes/slimIPL/100h_supervised.cpp. So assume you will have this structure and all layers are pretrained for ASR. You can use part of these layers and define other layers on top to have speaker identification.
I was thinking about extracting different frequencies from the audio. I don't know much just an assumption for now. Like maybe through Fourier Transformation. Once I have different frequencies (ideally each speaker's) I do speech-to-text. Any thoughts?
These models are trained on MFSC features (so FFT is included there), we used 80 filterbanks. Do you want to have some features computed on whole audio?
Yes but not Features to be exact, But let's say there are 2 speakers within the Audio. The ASR model either differentiates between the transcripts of those 2 speakers or we somehow Pre-process and separate the Audio files for both speakers and feed that to the Language model separately. Then patch up the Transcripts, Do you get the idea here?
I would say you need here speaker identification system. Either it is another head in the model which will give additional output to the transcript pointing where and what the speaker. Or you do another network which will do speaker identification (per frame). After that you need to disjoin prediction of ASR based on speaker and send the emissions to the beam-search decoder. You need to code this simple logic in the Decode.cpp (we didn't do that).
Alright, this makes sense. Thank you for your help @tlikhomanenko. Appreciate it :) First I will utilize punctuation and then try solving this.
Do you have any plan to train an Arabic Model in near future? These models are really good, adding more languages should be a ToDo in this :).
Also, have you already mentioned this? If yes point me to the link, Question is, do you have training information and criteria of current models?