alphacep / vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Apache License 2.0
7.36k stars 1.04k forks source link

Post processing and sentence (re)construction for transcripts #1518

Closed gregtzar closed 4 months ago

gregtzar commented 4 months ago

I'm just creating a simple pipeline that converts audio/video tracks into human readable text transcripts. At the basic level I can of course just treat each "Result" that the vosk recognizer returns based on its default timings as a sentence, and put the punctuation back in. But are there other open source libraries and/or methodologies that you would recommend for more advanced "post processing" of this nature? I imagine that vosks complete output of words and timestamps should be able to get a more intelligent treatment by some libraries. I have not had much luck finding any... maybe I'm not searching the right terms. Thanks!

nshmyrev commented 4 months ago

We have models to assign cases and punctuation https://alphacephei.com/vosk/models/vosk-recasepunc-en-0.22.zip

gregtzar commented 4 months ago

@nshmyrev So as a practical use case from the point of view of vosk-api-- could I use one of these models as a substitute for the model that I'm already using -- in this case vosk-model-en-us-0.42-gigaspeech? Is the end results then that the recognizer will segment the results based on this, and the text property of the json results would then contain the punctuation and casing? I appreciate the help and I'm just trying to get my head around how to use what you've linked in conjunction with the api. (Also in my case I'm using your golang wrapper, but that's probably not even relevant).

nshmyrev commented 4 months ago

Unfortunately, you can not use those models from go yet, only with a separate python server probably.

If you need punctuation, you can also try whisper

gregtzar commented 4 months ago

Hmmm... Might be able to pull it off by embedding the python script in go with something like this or this but I will try whisper as well, I guess it might be more suited to what I'm trying to do.