alphacep / vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Apache License 2.0
7.47k stars 1.05k forks source link

Pythonic API design #31

Open nshmyrev opened 4 years ago

nshmyrev commented 4 years ago

@nshmyrev this is my idea of "pythonic" api.

import vosk

# print available models
print(vosk.list_models())

# auto downloads the model if not found in local
asr = vosk.load("en-us")

# word_list is optional
# if stream=True, will return iterator for partial results
# if stream=False will return final result
# default stream = False
result = asr.recognize('wav_file_path', word_list, stream=True)

let me know if this is ok or any changes required.

nshmyrev commented 4 years ago

@bedapudi6788

Few important things for the API design:

1) we still need to separate models (Model, SpkModel) from the recognizer (KaldiRecognizer). The thing is that the model is the storage of the data and there could be many many recognizers using the same model. You can check server code for details

2) Streaming should be the first class, not file processing. We believe that streaming is the right approach overall, batch processing can be emulated as streaming not the other way around.

3) pocketsphinx-python API is more or less reasonable approach, i.e. the iterators over stream:

model = Model()
recognizer = Recognizer(model)  # or recognizer = LiveRecognizer(model) for microphone

for result in recognizer: # here we iterate over stream ourselves
    print result
  1. I also want seamless switch between local and cloud API. I.e. we should be able to stream the audio to the client and sometimes to the network. I.e. if we want to transcribe more accurately we might send the same audio for the streaming recognition on the server again.

  2. We need to be able to return the audio recorded for live recognition for further processing.

  3. Google's API also similar iterator (https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/speech/microphone/transcribe_streaming_infinite.py)

nshmyrev commented 4 years ago
  1. We might want to have some kind of builder pattern for recognizer since there will be many parameters:

recognizer = Recognizer(model=model, sample_rate =16000, partial=True, spk_model=spk_model, grammar=grammar)

also https://stackoverflow.com/questions/11977279/builder-pattern-equivalent-in-python

bedapudi6788 commented 4 years ago

Got it. I will let you know once I have the initial version.

bedapudi6788 commented 4 years ago

@nshmyrev I am almost done with the refactor (https://github.com/bedapudi6788/vosk-api/tree/python-refactor) I have a question regarding KaldiRecognizer

from vosk import Model, KaldiRecognizer
import sys
import os
import wave

model = Model("../.VOSK/en-us/")

rec = KaldiRecognizer(model, 16000)

for _ in range(2):
    print('\n\n\n======================\n\n\n')
    wf = wave.open(sys.argv[1], "rb")
    while True:
        data = wf.readframes(1000)
        if len(data) == 0:
            break
        rec.AcceptWaveform(data)

    print(rec.FinalResult())

The first loop completed fine, the second loop gave the following error.

Traceback (most recent call last):
  File "test_words.py", line 17, in <module>
    rec.AcceptWaveform(data)
  File "/data/anaconda/envs/py36/lib/python3.6/site-packages/vosk/vosk.py", line 73, in AcceptWaveform
    return _vosk.KaldiRecognizer_AcceptWaveform(self, data)
RuntimeError: AcceptWaveform called after InputFinished() was called.

Does this mean, KaldiRecognizer is not reusable? i.e: for every input audio, should i do rec = KaldiRecognizer(model, 16000)

nshmyrev commented 4 years ago

Does this mean, KaldiRecognizer is not reusable? i.e: for every input audio, should i do rec = KaldiRecognizer(model, 16000)

Yes, correct, recognizer is lightweight, so no problem to create it again.

bedapudi6788 commented 4 years ago

got it. thank you.

cjbassi commented 4 years ago

Also the results returned from the recognizer should be changed too.

Here's my idea of how it should go:

  1. KaldiRecognizer.AcceptWaveForm should return a result instead of a boolean value.
  2. The result should be a dict or None
  3. The dict should include a field called 'is_final' that is a bool.
  4. And it should also include the text and confidence data too

What do you think?

edit: This would only be for the streaming API. We might want to have two APIs, one for the online decoding and another for the offline.

Also, another idea for the streaming API would be to return a generator that yields results. This would require giving the recognizer a generator of audio chunks.

xeruf commented 3 years ago

Hey, any progress on this? I'm trying to work with Python currently and don't want to invest too much if the API is about to change ;)

nshmyrev commented 3 years ago

You shouldn't worry much since API is pretty small and you can configure it easily. There are some plans but not deadlines yet.