alphacep / vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Apache License 2.0
7.68k stars 1.08k forks source link

Adding Malayalam language support to Vosk #701

Open kavyamanohar opened 2 years ago

kavyamanohar commented 2 years ago

I have built a vosk compatible model for Malayalam language. The source code, model in zip format, and links to training and test data and test WER are provided in this repository.

How can I help to make this model listed in Vosk website?

nshmyrev commented 2 years ago

Thank you Kavya, looks great! I'll try to add this model ASAP.

We also need to integrate Malayalam data from https://github.com/Open-Speech-EkStep/ULCA-asr-dataset-corpus eventually.

kavyamanohar commented 2 years ago

Thanks @nshmyrev.

The Malayalam dataset in https://github.com/Open-Speech-EkStep/ULCA-asr-dataset-corpus is currently 'unlabelled'. I think we can not use it unless transcript is available.

nshmyrev commented 2 years ago

I reviewed this, looks like we need to work more on the model. Otherwise the error rate are too high.

kavyamanohar commented 2 years ago

Thanks for your time and effort for reviewing it @nshmyrev. The WERs are higher on test datasets where OOV rates are quite high. Test set 1 - 8% WER (1% OOV) Test set 2 - 31% WER (8% OOV) Test set 3 - 85% WER (36% OOV)

Considering the agglutinative nature of Malayalam language, do you have any suggestions on improving WER by working on the language modeling aspect. How are good WER achieved in languages like German which forms morphologically complex words? Thanks in advance for any pointers