Implementing New Engine Support with Whisper Models

dictation-toolbox / dragonfly

Speech recognition framework allowing powerful Python-based scripting and extension of Dragon NaturallySpeaking (DNS), Windows Speech Recognition (WSR), Kaldi and CMU Pocket Sphinx

GNU Lesser General Public License v3.0

375 stars 73 forks source link

Implementing New Engine Support with Whisper Models #383

Closed NigelHiggs30 closed 1 month ago

NigelHiggs30 commented 7 months ago

I've been following this project for several years and previously interacted with it using the built-in Windows speech recognition engine. The core project is impressive, but the limitations were primarily with the speech recognition engines available at that time. I believe today is the time to upgrade this project. Refactoring might be necessary for broader applicability, but the potential of the final product is significant. The primary barrier to wider adoption was the capabilities of the engines used previously. With the advancements in open-source AI and voice-to-text technologies, especially with developments like Whisper models, this project has the potential to reach new heights of performance and usability. Are there any updated documentation or support for integrating new engines, particularly Whisper models? I am considering initiating a pull request to integrate these advancements into the project.

LexiconCode commented 7 months ago

I think your request can be put into two different points.

Is there documentation to support how to integrate new speech recognition engines for dragonfly?

@drmfinlay could speak possibly to the documentation but the code can be found for each engine supported thus far at https://github.com/dictation-toolbox/dragonfly/tree/master/dragonfly/engines

Most engines have a middleware outside of dragonfly that handles compiling grammars from dragonfly down to engine specific implementation and specs. Examples of this middleware are Natlink and Kaldi Active Grammar

How to implement a backend for specifically for Whisper models.

This has previously been discussed in the following issue. https://github.com/dictation-toolbox/dragonfly/issues/376

That is to say doesn't mean it can't be done however there doesn't seem to be a clear path that's performant within the whisper API and possibly a limitation within the model itself.

LexiconCode commented 7 months ago

@NigelHiggs30 This looks interesting https://github.com/facebookresearch/seamless_communication

drmfinlay commented 4 months ago

Hello Nigel,

Thank you for opening this issue. I apologise for my late reply. This issue fell off my radar.

As @LexiconCode has mentioned above, support for Whisper has been discussed previously. Whisper is impressive, but not useful for everything. It simply is not an appropriate tool for this particular job. I went into the details in #376 and elsewhere (I think).

As for the documentation, it is in need of updating. I am not considering the addition of new engines within Dragonfly any more. The engines we have at the moment are quite sufficient, in my opinion. A new engine could be implemented and used externally, however. One should only need to register an engine instance using the register_engine_init() function for things to work properly.