MycroftAI / mycroft-core

Mycroft Core, the Mycroft Artificial Intelligence platform.
https://mycroft.ai
Apache License 2.0
6.48k stars 1.27k forks source link

Allow using the Whisper speech recognition model #3131

Open 12people opened 1 year ago

12people commented 1 year ago

Is your feature request related to a problem? Please describe. While Mimic has been continually improving, Open AI just released their Whisper speech recognition model under the MIT license that seems to be superior, yet still usable offline.

Describe the solution you'd like It'd be great if Mycroft could either replace Mimic with Whisper or offer Whisper as an option.

krisgesling commented 1 year ago

Hi Mirek,

Thanks for starting this thread - Whisper is looking pretty interesting, certainly something that's come up in our own chats.

A couple of clarifications though. Mimic is our Text-to-Speech engine. It synthesizes spoken audio from some input text so that Mycroft can speak. Whisper as you've noted is for speech recognition or speech-to-text. That allows Mycroft to hear what the user is saying.

In terms of running offline you would need some decent hardware for this. I don't believe for example that it would be possible on the Mark II, which has a Raspberry Pi 4 inside. The max RAM you can assign to the GPU on the Pi 4 is 256MB. The smallest ("tiny") Whisper model requires 1GB VRAM. So yeah, unlikely to run on a Pi at all, but I'd be very interested if someone managed it.

More broadly I haven't seen any detail on what the training data for Whisper was. I'm assuming they're going the Microsoft / Github Co-pilot route of saying it doesn't matter and having a big team of lawyers ready to defend that. As a company we certainly don't have any position on this yet.

12people commented 1 year ago

Whoops, you're right, I thought Mimic was an STT engine instead — my bad. But it sounds like I was understood nevertheless. :)

You're right that Raspberry Pi is certainly below the system requirements here. It'd be nice to see this as an option on Linux, though, where most modern systems have sufficient system requirements for at least the smallest model.

ddurdin commented 1 year ago

I have Mycroft on my robot. I'm using an RPI V4 with mostly standalone TTS and STT programs - DeepSpeech and Mimic3. I would like to tackle using OpenAI Whisper as my STT engine. I understand that it may not be resourceful for the Mycroft authors to implement Whisper but I would like to try to do that myself. Is there anyone out there who could give me guidance with that? It seems like all the interfaces and APIs are there now, based on the number of different apps available

forslund commented 1 year ago

You could have a look at the docs for creating STT-plugins: https://mycroft-ai.gitbook.io/docs/mycroft-technologies/mycroft-core/plugins/stt

And also refer to the STT classes includeded in mycroft core. MycroftSTT for example.

ddurdin commented 1 year ago

Thanks. That will get me started.

walking-octopus commented 1 year ago

Whoops, you're right, I thought Mimic was an STT engine instead — my bad. But it sounds like I was understood nevertheless. :)

You're right that Raspberry Pi is certainly below the system requirements here. It'd be nice to see this as an option on Linux, though, where most modern systems have sufficient system requirements for at least the smallest model.

Maybe it can use Whisper through an inference API, which can be hosted by HuggingFace, Mycroft, or on your local network.

el-tocino commented 1 year ago

Maybe it can use Whisper through an inference API, which can be hosted by HuggingFace, Mycroft, or on your local network.

Something like whispering perhaps

walking-octopus commented 1 year ago

Something like whispering perhaps

Do we need real-time? All we have to do is listen to a wake word, record the audio, stop when there's no more voice activity, send the audio to the server, and receive the transcription. I quite don't how it could be useful...

krisgesling commented 1 year ago

Not "real-time" just means it will take longer to return the result. In a really bad scenario you would wake the device, speak your question/command then wait a few minutes for the transcription to come back before Mycroft can act upon it. Honestly you really want (at the very least) less than 2 seconds response time for STT or it just feels too slow. It quickly hits the point where you may as well whip out your phone and open an app or type a search query.

Self-hosting is great, as long as you have a decent GPU on your local network that is always running (at least running while your voice assistants are) which can noticeably add to your power bills.

Someone might publish a plugin that uses a publicly available API, however you would want to trust that API provider with your data and to check the terms of service. If someone in the community creates a plugin that violates a sites terms of service then it's up to each person whether they use it, but it's not necessarily something we can legally distribute as a company.

In terms of an official Mycroft hosted instance - it might be something we choose to host in the future, but it's not something we're working on right at this moment. We'd rather get better on-device STT. Something that can run in real-time on the Pi, and that has a high enough accuracy for the range of vocabulary that people expect a voice assistant to understand. Can't promise anything yet, but we'll see what happens...

samuela commented 1 year ago

The whisper-large-v2 model is available on HuggingFace, and they support a hosted API. I've not used it yet personally, but they appear to support streaming inference. Might be worth exploring!

dgalli1 commented 1 year ago

In terms of running offline you would need some decent hardware for this. I don't believe for example that it would be possible on the Mark II, which has a Raspberry Pi 4 inside. The max RAM you can assign to the GPU on the Pi 4 is 256MB. The smallest ("tiny") Whisper model requires 1GB VRAM. So yeah, unlikely to run on a Pi at all, but I'd be very interested if someone managed it.

There is also whispercpp which uses the same model and only runs on the cpu. https://github.com/ggerganov/whisper.cpp In my tests it works fine on entry level arm hardware. Although it might be a little bit to slow for serious use, but this would need some more testing.

JarbasAl commented 1 year ago

https://github.com/OpenVoiceOS/ovos-stt-plugin-whisper https://github.com/OpenVoiceOS/ovos-stt-plugin-whisper-tflite https://github.com/OpenVoiceOS/ovos-stt-plugin-whispercpp

https://youtu.be/Aor6CFkcWzU