dictation-toolbox / dragonfly

Speech recognition framework allowing powerful Python-based scripting and extension of Dragon NaturallySpeaking (DNS), Windows Speech Recognition (WSR), Kaldi and CMU Pocket Sphinx
GNU Lesser General Public License v3.0
388 stars 75 forks source link

Supporting whisper.cpp? #376

Closed tachyonicbytes closed 1 year ago

tachyonicbytes commented 1 year ago

Are there any plans to support the OpenAI Whisper automatic speech recognition? How hard it would be to do that? (I am unfamiliar with the codebase).

From a performance stand-point, it seems to be currently one of the best engines, although I wouldn't necessarily trust OpenAI marketing.

From a licensing stand-point, it is FOSS, so it should not be a problem.

drmfinlay commented 1 year ago

Hello @tachyonicbytes,

Support for OpenAI Whisper has come up before, I think in the Gitter chat room. There are no current plans to support it in Dragonfly, at least not on its own. Shervin Emami (@shervinemami) managed to get it working together with Dragonfly's Kaldi engine last year. He was able to use Whisper, instead of Kaldi, for the dictation parts of grammar rules. If I remember correctly, this improved the recognition accuracy of those parts. See https://github.com/daanzu/kaldi-active-grammar/pull/73 for more on that.

In order to use Whisper for the command parts too, it would be necessary to write a dedicated Dragonfly-Whisper engine implementation. However, impressive as Whisper is, its natural language ASR models are quite unsuitable for the typical Dragonfly command phrases defined in speech grammars. Unless I am mistaken, there is no way to trim Whisper's recognition search tree in real time — to have the software strictly consider only those hypotheses which fit active Dragonfly grammars.

If it becomes possible to do that, and if commands are recognisable with a high degree of accuracy and speed, then an engine implementation for Whisper might be worth considering. But those are two big ifs! I don't think the folks at OpenAI are capable of such sorcery. :-)

LexiconCode commented 1 year ago

I went ahead and made an inquiry. Thanks for the verbiage Danesprite. Opening discussion https://github.com/ggerganov/whisper.cpp/discussions/870

There's an early implementation. "Guided mode" https://github.com/ggerganov/whisper.cpp/blob/master/examples/command/README.md

Example https://github.com/ggerganov/whisper.cpp/tree/master/examples/command

drmfinlay commented 1 year ago

Thank you for investigating further, Aaron. I was unaware of guided mode. It is a start, but would not be adequate without significant changes.

Since this mode takes a flat list of commands, a Dragonfly-Whisper implementation would have to output every possible command phrase to a text file. It would be simple enough to do this for a spec string like go right <N>. But, for the complex use of continuous command recognition in, say, Caster, it would be utterly impractical. This problem would be solved if guided mode could recognise commands efficiently from some sort of grammar file.

As you say in the linked discussion, Dragonfly also needs the ability to activate and deactivate command phrases. Without this, contexts wouldn't work properly. Another issue is that it would not be possible to recognise the dictation parts of commands in the same utterance.

This all seems unnecessary to me, really. Dragonfly already has several engine implementations that do these things well. Whisper, in my opinion, is just not the right tool for this type of work.

shervinemami commented 1 year ago

I've used Whisper in Dragonfly for dictation while using KaldiAG for commands, and I definitely agree with Danesprite that Whisper isn't suited to command mode even if you're willing to put lots of effort into customising it. Whisper works great at full sentences, it's an excellent choice for long dictation, but struggles with dictating anything less than a few words, so even dictating something as short as "hi how are you?" is very unreliable in Whisper. This gives me the assumption it would really struggle if used specifically for single-word commands.

drmfinlay commented 1 year ago

Thanks, Shervin. Your point about accuracy for short phrases is important. Whisper's models were not trained for this purpose.

@tachyonicbytes, if you haven't already, I would like to suggest that you try out Dragonfly's KaldiAG engine. It is open source and fairly accurate with low-latency. The documentation for it is here. I think you'll find it is good enough.