Open nshmyrev opened 4 years ago
@bedapudi6788
Few important things for the API design:
1) we still need to separate models (Model, SpkModel) from the recognizer (KaldiRecognizer). The thing is that the model is the storage of the data and there could be many many recognizers using the same model. You can check server code for details
2) Streaming should be the first class, not file processing. We believe that streaming is the right approach overall, batch processing can be emulated as streaming not the other way around.
3) pocketsphinx-python API is more or less reasonable approach, i.e. the iterators over stream:
model = Model()
recognizer = Recognizer(model) # or recognizer = LiveRecognizer(model) for microphone
for result in recognizer: # here we iterate over stream ourselves
print result
I also want seamless switch between local and cloud API. I.e. we should be able to stream the audio to the client and sometimes to the network. I.e. if we want to transcribe more accurately we might send the same audio for the streaming recognition on the server again.
We need to be able to return the audio recorded for live recognition for further processing.
Google's API also similar iterator (https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/speech/microphone/transcribe_streaming_infinite.py)
recognizer = Recognizer(model=model, sample_rate =16000, partial=True, spk_model=spk_model, grammar=grammar)
also https://stackoverflow.com/questions/11977279/builder-pattern-equivalent-in-python
Got it. I will let you know once I have the initial version.
@nshmyrev I am almost done with the refactor (https://github.com/bedapudi6788/vosk-api/tree/python-refactor) I have a question regarding KaldiRecognizer
from vosk import Model, KaldiRecognizer
import sys
import os
import wave
model = Model("../.VOSK/en-us/")
rec = KaldiRecognizer(model, 16000)
for _ in range(2):
print('\n\n\n======================\n\n\n')
wf = wave.open(sys.argv[1], "rb")
while True:
data = wf.readframes(1000)
if len(data) == 0:
break
rec.AcceptWaveform(data)
print(rec.FinalResult())
The first loop completed fine, the second loop gave the following error.
Traceback (most recent call last):
File "test_words.py", line 17, in <module>
rec.AcceptWaveform(data)
File "/data/anaconda/envs/py36/lib/python3.6/site-packages/vosk/vosk.py", line 73, in AcceptWaveform
return _vosk.KaldiRecognizer_AcceptWaveform(self, data)
RuntimeError: AcceptWaveform called after InputFinished() was called.
Does this mean, KaldiRecognizer is not reusable? i.e: for every input audio, should i do rec = KaldiRecognizer(model, 16000)
Does this mean, KaldiRecognizer is not reusable? i.e: for every input audio, should i do rec = KaldiRecognizer(model, 16000)
Yes, correct, recognizer is lightweight, so no problem to create it again.
got it. thank you.
Also the results returned from the recognizer should be changed too.
Here's my idea of how it should go:
What do you think?
edit: This would only be for the streaming API. We might want to have two APIs, one for the online decoding and another for the offline.
Also, another idea for the streaming API would be to return a generator that yields results. This would require giving the recognizer a generator of audio chunks.
Hey, any progress on this? I'm trying to work with Python currently and don't want to invest too much if the API is about to change ;)
You shouldn't worry much since API is pretty small and you can configure it easily. There are some plans but not deadlines yet.
@nshmyrev this is my idea of "pythonic" api.
let me know if this is ok or any changes required.