Open lfcnassif opened 2 years ago
Just to complement, we recently integrated the vosk java library and en/pt-BR small models into our open source project, available to everyone: https://github.com/sepinf-inc/IPED/issues/248
Hi,
We did train the pt-fb-v0.1.1
model, but didn't train the small-pt
one. Datasets used in the training of the former come from this repo: https://github.com/falabrasil/speech-datasets. As for the latter, such info is probably unknown, but I'd guess it was trained on less data indeed.
You can probably turn the large model into a small one by removing the rescoring
dir and the HCLG.fst
file, but make sure that Gr.fst
and HCLr.fst
exist under the graph
dir otherwise it'll definitely not work (so remove only rescoring dir but not the HCLG file).
We didn't do any bias experiments and have no plans to train another model in the short term, but you're welcome to try by using the scripts under this very same repo. Gibberish words at the output is probably bad filtering in lexicon.txt
, IIRC the small one has ~100k words and ours has ~250k.
And thanks for the feedback!
Thank you very much @cassiotbatista for the fast response and for all technical tips!
Just a quick warning for you. The large model I referenced and available at VOSK models site doesn't work with vosk-java, take a look here: https://github.com/sepinf-inc/IPED/issues/248#issuecomment-1176860230
Hello. First thank you for this great project!
I would like to confirm if you trained the vosk-model-pt-fb-v0.1.1-20220516_2113 recently published on VOSK site: https://alphacephei.com/vosk/models. And also ask if you also trained the vosk-model-small-pt-0.3 small model also available there.
We recently made some very informal and subjective testing with real world audios, protected by privacy laws unfortunately, manually listening to them and comparing to the transcription results of both models using vosk-0.3.32 java library. Seems to us maybe the large model is giving worse results than the small model for some audios, returning more inexistent words in speech. Maybe the large model could have a large bias than the small one towards the data set used for training, generalizing worse to new audios, just a hypothesis...
If you also trained the older small model (?), was the data set used for training the same as the one used to train the large model? If not, I guess the newer large model used a larger data set? If yes, are you planning to train a new small model using the same data set used to train the large model?
I'm asking just to avoid duplicate efforts, because if the data sets used for training were different and if the large model used a larger data set, maybe I'll try to train a new small model using the last data set (public available right?)
Thank you very much for your attention!