Too high confidence for unrelated words

SvenFackert commented 3 years ago

Hey guys,

first of all thanks so much for this amazing project! Vosk as well as the android project were so easy to setup and provide awesome speech recognition results. I am amazed and thankful!

I have integrated the speech recognition into an app for recognizing numbers in the first place. Thats why I am feeding the KaldiRecognizer with a grammar (that contains mostly numbers from 0 to 200 and a few other commands). Recognition for those numbers works astonishing well! Confidence is almost all the time 1.0 and only seldom less than that.

Now to a problem that I observed: With that fixed set of words, the recognition tends to recognize random words when unrelated words (not in the grammar) are spoken. I am using the german (vosk-model-small-de-0.15) language model with the latest vosk for android (com.alphacep:vosk-android:0.3.17). For example saying the phrase "Waschmaschine" results in the word "Null" with a confidence of 1.0. This happens to many unrelated words so that talking while the app is running is almost impossible without the app understanding wrong commands.

Do you guys maybe have an idea where this problem could stem from? If it's a problem with the model, would it help to train it more with the words from the grammar? Or is it required to add random words to the grammar in order to prevent those false positives?

Looking forward to you response.

Kind regards Sven

nshmyrev commented 3 years ago

Or is it required to add random words to the grammar in order to prevent those false positives?

Yes, you need to add some frequent words to the grammar so they will catch out of grammar speech.

Overall, our confidence is not perfect, yes, we plan to work on it.

SvenFackert commented 3 years ago

thanks very much for the immediate response! I tried to add some of the most frequent german words but that made the recognition worse because it does not recognize our initial set of words as good anymore (often selects words from the most frequent german words, e.g. selects "sehen" instead of "zehn" which sounds very similar so I am not surprised). Maybe it gets better if I select only words that are not similar to the ones we had in our grammar. Looking forward to any improvements in confidence scores :-)

Also another question regarding the recognition results. If talking single words like "acht" I often receive multiple words as a result (for example "nur acht" or "sechs sieben acht"). Is this a known problem? Seems to me like there is some kind of context involved. Is there any way to prevent this behavior?

nshmyrev commented 3 years ago

Also another question regarding the recognition results. If talking single words like "acht" I often receive multiple words as a result (for example "nur acht" or "sechs sieben acht"). Is this a known problem? Seems to me like there is some kind of context involved. Is there any way to prevent this behavior?

You probably specified context phrases incorrectly. If you want to recognize single words you add them into phrases as single words.

SvenFackert commented 3 years ago

Ahh, so you mean instead of ["one two three"] (as in the demo application) I should add the grammar as ["one", "two", "three"] ?

nshmyrev commented 3 years ago

Yes

SvenFackert commented 3 years ago

Alright, I will try that - thanks! In the meantime I removed a few similar (to our initial grammar) sounding words from the most frequent german words (in the grammar). This way I am getting very good results. False positive rate dropped and (if I ignore confidences) the correctly recognized numbers and words are way over 95%. Thats awesome, thanks very much for your help!

One last question remains: I noticed that the model is trained at audio data with 16kHz so recognition works probably best (or even only?) at 16kHz input data from the mobile phone mic. All of my test devices support that frequency, but do you have any idea on how many android devices in general support the 16kHz? I read that the AudioManager from Android is not guaranteed to work with 16kHz (depending on the device). Or am I just misunderstanding this and 16kHz is always available?

nshmyrev commented 3 years ago

You can record 44100 Hz, it shouldn't matter. 44100 must be supported on Android

SvenFackert commented 3 years ago

Perfect, thanks so much. Appreciate it!

alphacep / vosk-android-demo

Too high confidence for unrelated words #122