Open kavyamanohar opened 3 years ago
Thank you Kavya, looks great! I'll try to add this model ASAP.
We also need to integrate Malayalam data from https://github.com/Open-Speech-EkStep/ULCA-asr-dataset-corpus eventually.
Thanks @nshmyrev.
The Malayalam dataset in https://github.com/Open-Speech-EkStep/ULCA-asr-dataset-corpus is currently 'unlabelled'. I think we can not use it unless transcript is available.
I reviewed this, looks like we need to work more on the model. Otherwise the error rate are too high.
Thanks for your time and effort for reviewing it @nshmyrev. The WERs are higher on test datasets where OOV rates are quite high. Test set 1 - 8% WER (1% OOV) Test set 2 - 31% WER (8% OOV) Test set 3 - 85% WER (36% OOV)
Considering the agglutinative nature of Malayalam language, do you have any suggestions on improving WER by working on the language modeling aspect. How are good WER achieved in languages like German which forms morphologically complex words? Thanks in advance for any pointers
I have built a vosk compatible model for Malayalam language. The source code, model in zip format, and links to training and test data and test WER are provided in this repository.
How can I help to make this model listed in Vosk website?