falabrasil / kaldi-br

☕🇧🇷 Scripts para o Kaldi em Português Brasileiro
MIT License
48 stars 7 forks source link

Question about VOSK pt-BR public available trained models #12

Open lfcnassif opened 2 years ago

lfcnassif commented 2 years ago

Hello. First thank you for this great project!

I would like to confirm if you trained the vosk-model-pt-fb-v0.1.1-20220516_2113 recently published on VOSK site: https://alphacephei.com/vosk/models. And also ask if you also trained the vosk-model-small-pt-0.3 small model also available there.

We recently made some very informal and subjective testing with real world audios, protected by privacy laws unfortunately, manually listening to them and comparing to the transcription results of both models using vosk-0.3.32 java library. Seems to us maybe the large model is giving worse results than the small model for some audios, returning more inexistent words in speech. Maybe the large model could have a large bias than the small one towards the data set used for training, generalizing worse to new audios, just a hypothesis...

If you also trained the older small model (?), was the data set used for training the same as the one used to train the large model? If not, I guess the newer large model used a larger data set? If yes, are you planning to train a new small model using the same data set used to train the large model?

I'm asking just to avoid duplicate efforts, because if the data sets used for training were different and if the large model used a larger data set, maybe I'll try to train a new small model using the last data set (public available right?)

Thank you very much for your attention!

lfcnassif commented 2 years ago

Just to complement, we recently integrated the vosk java library and en/pt-BR small models into our open source project, available to everyone: https://github.com/sepinf-inc/IPED/issues/248

cassiotbatista commented 2 years ago

Hi,

We did train the pt-fb-v0.1.1 model, but didn't train the small-pt one. Datasets used in the training of the former come from this repo: https://github.com/falabrasil/speech-datasets. As for the latter, such info is probably unknown, but I'd guess it was trained on less data indeed.

You can probably turn the large model into a small one by removing the rescoring dir and the HCLG.fst file, but make sure that Gr.fst and HCLr.fst exist under the graph dir otherwise it'll definitely not work (so remove only rescoring dir but not the HCLG file).

We didn't do any bias experiments and have no plans to train another model in the short term, but you're welcome to try by using the scripts under this very same repo. Gibberish words at the output is probably bad filtering in lexicon.txt, IIRC the small one has ~100k words and ours has ~250k.

And thanks for the feedback!

lfcnassif commented 2 years ago

Thank you very much @cassiotbatista for the fast response and for all technical tips!

lfcnassif commented 2 years ago

Just a quick warning for you. The large model I referenced and available at VOSK models site doesn't work with vosk-java, take a look here: https://github.com/sepinf-inc/IPED/issues/248#issuecomment-1176860230