alphacep / vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Apache License 2.0
7.43k stars 1.04k forks source link

Document procedure to train models #1406

Closed severin-lemaignan closed 1 year ago

severin-lemaignan commented 1 year ago

Hi,

I would like to train a new language model (for Norwegian). According to the website, I can follow these steps: https://github.com/alphacep/vosk-api/tree/master/training However, the documentation is 'TBD'. Any chance you could provide some more details, starting for instance from a Mozilla CommonVoice dataset?

Thanks!

nshmyrev commented 1 year ago

You'd better ask for specific details if something is not clear for you.

Commonvoice has tiny amount of Norwegian, you'd better use https://www.nb.no/sprakbanken/

You can use existing data preparation scripts from kaldi repo https://github.com/kaldi-asr/kaldi/blob/master/egs/sprakbanken/s5/run.sh

severin-lemaignan commented 1 year ago

Thanks a lot for the links. I'll have a look and come back to you with more specific questions if needed.