daanzu / kaldi-active-grammar

Python Kaldi speech recognition with grammars that can be set active/inactive dynamically at decode-time
GNU Affero General Public License v3.0
336 stars 50 forks source link

A crude way of using OpenAI Whisper for alternative dictation in KaldiAG #73

Open shervinemami opened 1 year ago

shervinemami commented 1 year ago

This is a fairly crude implementation, including various hard-coded settings for Whisper's "base.en" English model, and probably only currently works on Linux & OS X since it has a hardcoded tmpfile path. But it's good enough to begin playing with.

mallorbc commented 1 year ago

Glad you found my code helpful and added it here. The whisper model takes either a wav file or an array(not sure of the format).

However, I could not get the model working in a timely manner, so I decided to just write to the system. By using io.BytesIO it should be possible to handle it all in memory.

LexiconCode commented 1 year ago

OS agnostic temp path for whisper_server.py

import tempfile
temp_dir = tempfile.TemporaryDirectory().name
audio_filename = os.path.join(temp_dir,"whisper.wav")

temp_dir.cleanup() # place near the end of `die()` function
shervinemami commented 1 year ago

Thanks @LexiconCode for these 2 portability improvements, I've uploaded them now :-)

daanzu commented 1 year ago

@shervinemami I actually don't think these changes are necessary to support using Whisper in KaldiAG. Since the alternative_dictation config parameter naturally supports taking a callable, I think all of the work can (and should) be placed in your user code: specifically your dragonfly loader. But perhaps I've missed something, so feel free to correct me or ask any questions!

I am adding a somewhat-related note here from gitter: You will likely find alternative dictation to work better for dictation utterances that don't include any "command parts". The problem is that, for the example you posted, KaldiAG tries its best to "cut out" the part where you preface the utterance by spanking "whisper", and only pass the rest of the audio to whisper, but doing that is quite difficult and inexact. You might want to try something like having command that enables a pure dictation rule ("<dictation>") for only the next utterance. This is what I have migrated to usually using, although for different reasons (it allows me to collect a better corpus of audio training examples to build my speech model even better).