alphacep / vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Apache License 2.0
7.75k stars 1.09k forks source link

Artificial Pause #397

Closed aalsabag closed 2 years ago

aalsabag commented 3 years ago

Hello,

  1. I was wondering if there was a recommended way for creating an artificial pause in a file so the rec.AcceptWaveform(data) returns True more frequently?
  2. Also, I am working on a transcription tool that creates captions for a stream recorded to a file (live). It is fairly accurate, but extremely slow. Over 1 minute. Is there a better way to read (tail) from a s16le audio file that is constantly growing.:
f = open('/mnt/streaming-1234567', 'rb')
numberOfBytes = os.path.getsize('/mnt/streaming-1234567')
f.read(numberOfBytes) # start from the end of the file
time.sleep(2)

while True:
    data = f.read(4000)
    if len(data) == 0:
        break
    if rec.AcceptWaveform(data):
        print(rec.Result())
    else:
        print(rec.PartialResult())
sskorol commented 3 years ago

Hi @aalsabag,

I guess the first question is more about the latency issue which is already filed. I believe it depends on the model you use + for some languages there's an old architecture used, which requires retraining. I believe @nshmyrev can give you more details on that or you can read his blog post on that. rec.AcceptWaveform(data) gives you a final transcribe and there's a big window between the last interim result and a final transcribe. So probably only changing a model or custom interim results' analysis may help here.

For the second point: what length does your audio file have? And why do you need writing captions to file and then immediately read them back? Wouldn't it be simpler to save captions in the background (if you really need to persist changes) and stream results to your service on-fly? I can't get the real use-case yet.

aalsabag commented 3 years ago

Thank you. I really appreciate your response.

To clarify my second question: -I have a service live streaming to an audio file.

-I want to generate subtitles based on that growing audio file.

-I want to send every few words to another completely separate service.

With regards to the first question, what does "custom interim results" mean? Currently the tool just waits for a pause in speech I think. Sometimes that can be a very long sentence. Are you saying acceptWaveform is dependent entirely on the model I use. If so, I'll play around with other models.

sskorol commented 3 years ago

In this case, you may want to take rec.PartialResult() in the last else block and immediately send it to your service. You can implement it the way it'd seem like a continuous speech refinement for the end-user. In real life when you start pronouncing something, the initial sound might be interpreted as a wide range of words by our ears until we finish. That's what I call interim (or partial) results (from Google STT vocabulary): something probably really close to what you say, but not 100% accurate yet. You can send partial non-empty results to your service and implement this "refinement" feature on the client-side. It'll give a feeling that someone on the other side tries to understand you in real-time. And that's how you can eliminate those delays between the last partial result and a final transcribe. Technically, it won't affect the real response time, but the fact that you immediately display something to the user gives an illusion that there's no delay at all.

aalsabag commented 3 years ago

Yeah that was what I was going to do. Justed wanted to confirm. Thank you for your help.this ticket may be closed