alphacep / vosk-server

WebSocket, gRPC and WebRTC speech recognition server based on Vosk and Kaldi libraries
Apache License 2.0
918 stars 248 forks source link

start and end time for words jump by 1200 seconds #64

Closed ghost closed 4 years ago

ghost commented 4 years ago

I am attempting to transcribe an audio file that is quite long >30 minutes that is stereo audio (2 channels). I split the audio to be a left side and right side and transcribe them separately and parse through the results to join them back together because I do not see an option to recognize speakers or if there is an indicator to see what channel said what. I noticed that there looks to be a timing issue when receiving results. I am writing out the transcription results to a file to parse it out later. Below is a snippet of the results that is being returned, notice how the "start" time for the word "training" is at 1203.15 seconds and then the "start" time for the word "this" is at 1800.0 seconds. I can guarantee the audio files I am transcribing do not have a 10 minute gap or silence. It looks like this issue happens at every 20 minute mark (1200 seconds, 1800 seconds, 2400 seconds, etc). This really messes up parsing because I rely on the timestamps to join both audio channels back together to create a conversation.

{ "result": [ { "conf": 1.000000, "end": 1199.850000, "start": 1199.550000, "word": "training" }, { "conf": 0.999999, "end": 1202.610000, "start": 1202.490000, "word": "for" }, { "conf": 0.999999, "end": 1203.060000, "start": 1202.610000, "word": "quality" }, { "conf": 0.999994, "end": 1203.150000, "start": 1203.060000, "word": "and" }, { "conf": 1.000000, "end": 1203.480000, "start": 1203.150000, "word": "training" } ], "text": "training for quality and training" }, { "result": [ { "conf": 1.000000, "end": 1808.120000, "start": 1808.000000, "word": "this" }, { "conf": 1.000000, "end": 1808.360000, "start": 1808.120000, "word": "call" }, { "conf": 1.000000, "end": 1808.480000, "start": 1808.360000, "word": "may" }, { "conf": 1.000000, "end": 1808.600000, "start": 1808.480000, "word": "be" } ] }

nshmyrev commented 4 years ago

Which version are you using? It has been fixed recently with 0.3.10.

ghost commented 4 years ago

I was using 0.3.9. I just verified and tested with 0.3.10 and this is no longer the case. Thanks!