KoljaB / RealtimeSTT

A robust, efficient, low-latency speech-to-text library with advanced voice activity detection, wake word activation and instant transcription.
MIT License
2.09k stars 190 forks source link

Select languages to read #140

Open alex-crr opened 3 weeks ago

alex-crr commented 3 weeks ago

So I'm building an assistant with which I'd like to be able to speak both in french and english (mainly because the models have trouble understanding my accent). However it sometimes Understands me as speaking portuguese Which Is unfortunate

KoljaB commented 3 weeks ago

As of my knowledge Whisper only supports setting either a single language fixed or be open for every language. For your usecase it needs to be restricted to two languages, but that's not possible afaik, so currently you'd need to either decide for french or english or use the multilingual model that might misunderstand sometimes. Maybe another user has an idea?

homelab-00 commented 3 weeks ago

As of my knowledge Whisper only supports setting either a single language fixed or be open for every language.

What KoljaB says is correct. However even if you set it to a specific language it can still detect other languages. I'm not sure about large sentences, but in my experience, setting the language to Greek and speaking to it in Greek, it can still correctly detect English words I throw in here and there.

Note however to make sure to use the regular whisper models (medium, large-v2, large-v3, etc) or the faster whisper models (Systran/faster-whisper-large-v3, etc). Using the distil versions (e.g. Systran/faster-distil-whisper-large-v3) it will auto translate to English.

andrey2620 commented 2 weeks ago

An idea i had, will be running two instances, one for english another french ?

alex-crr commented 2 weeks ago

That's interesting, however the model aims to understand whatever you say to it in' the language specified, so you'd have one correct french translation and a very wrong English one so you'd probably need to run some kind of layer to select which one makes more sense.

homelab-00 commented 2 weeks ago

That's interesting, however the model aims to understand whatever you say to it in' the language specified, so you'd have one correct french translation and a very wrong English one so you'd probably need to run some kind of layer to select which one makes more sense.

You could have a toggle to mute the one and unmute the other. But you'd have to have them both loaded in memory and depending on model size and hardware constraints it could be tough.

Another possible idea could be loading a single multilingual model in a server like setup, and then be able to query that server with two client scripts, each with different language configs. Although I'm not sure if the recorder configs can be changed on the fly without having to reload the model.

EDIT: From the server/client README it looks like the 'language' argument is part of the server config, so you can't query the same multilingual model (on the server) with two different language clients. You'd need to set up two servers instead.