alphacep / vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Apache License 2.0
7.92k stars 1.1k forks source link

[Maybe bug] False Positive of vosk-android-demo while I shut up #730

Open diyism opened 2 years ago

diyism commented 2 years ago

I'm trying to build a single syllable voice keyboard (without using the NATO phonetic alphabet) to speed up keystroke typing and to accurately input variable abbreviations.

I know chinese characters are all of single syllable, so I make a prototype verification based on vosk-android-demo(https://github.com/alphacep/vosk-android-demo):

I replaced the models/assets/model-en-us in vosk-android-demo with models/assets/vosk-model-small-cn-0.3, modified app/java/org.vosk.demo/VoskActivity.java:

    StorageService.unpack(this, "vosk-model-small-cn-0.3", "model",     .....

 Recognizer rec = new Recognizer(model, 16000.0f, "[\"诶\", \"逼\", \"西\", \"第\", \"依\", \"飞\", \"鸡\", \"黑\", \"挨\", \"接\", \"尅\", \"勒\", \"摸\", \"你\", \"欧\", \"批\", \"哭\", \"啊\", \"斯 [unk]\", \"踢\", \"优\", \"微\", \"乌\", \"叉\", \"歪\", \"热\", \"[unk]\"]");

The 26 chinese characters match the 26 english letters.

Compiled and installed the apk into my phone, click "Recognize Microphone", it shows:

{"partial": ""}
{"partial": ""}
{"partial": ""}
{"partial": ""}
.....

Then I begin to speak, the result is amazing, it's far more accurate and quicker than "pocketsphinx / google SpeechRecognizer(EXTRA_PREFER_OFFLINE) / tensorflow lite(teachable machine)", when I say 诶(ey), 逼(bee), 西(shee), 第(dee), 依(yee), 飞(fey), ...., every chinese chracter will show correctly in the app.

But after I shut up, it continue scrolling to show:

{"partial": "飞"}
{"partial": "你"}
{"partial": "飞"}
{"partial": "你"}
.....

I think it should show "{"partial": ""}" while no human voice detected, maybe it's a bug.

And I don't want use the chinese model to recognize 26 letters, but I can't figure out the detail steps to retrain a single syllable phonetic dictionary of only 26 letters, anyone give me a hint:

a    EY
b    B  EE
c    S  EE
d    D  EE
e    EE
f     F  EY
g    JH EE
h     HH  EY
.....
nshmyrev commented 2 years ago

Hi

Thank you for report.

Could you reproduce the same issue with python on desktop? Could you please try to reproduce the same problem with an audio file?

nshmyrev commented 2 years ago

Btw, in Chinese model [unk]probably should be <UNK>

diyism commented 2 years ago

No luck after I modified it into:

Recognizer rec = new Recognizer(model, 16000.0f,
                        "[\"诶\", \"逼\", \"西\", \"第\", \"依\", \"飞\", \"鸡\", \"黑\", \"挨\", \"接\", \"尅\", \"勒\", \"摸\", \"你\", \"欧\", \"批\", \"哭\", \"啊\", \"斯\", \"踢\", \"优\", \"微\", \"乌\", \"叉\", \"歪\", \"热\", \"<UNK>\"]");

and recompile and install it into my phone, while my room is silent, the app continue scrolling with:

{"partial": "飞"}
{"partial": "飞"}
{"partial": "飞"}
{"partial": "飞"}
.....

It seems we should add some threshold parameter to filter background noise, my room's silent noise is at 38 dB (test it with the app: https://play.google.com/store/apps/details?id=com.gamebasic.decibel&hl=en_US&gl=US)

diyism commented 2 years ago

Faint, after I get rid of 飞(fey) and 你(knee):

Recognizer rec = new Recognizer(model, 16000.0f,
                        "[\"诶\", \"逼\", \"西\", \"第\", \"依\", \"鸡\", \"黑\", \"挨\", \"接\", \"尅\", \"勒\", \"摸\", \"欧\", \"批\", \"哭\", \"啊\", \"斯\", \"踢\", \"优\", \"微\", \"乌\", \"叉\", \"歪\", \"热\", \"[unk]\"]");

everything is ok now.

Could anyone do me a favor to write a detail manual on how to retrain a phonetic dictionary of only 26 letters?

a    EY
b    B  EE
c    S  EE
d    D  EE
e    EE
f     F  EY
g    JH EE
h     HH  EY
i     AY
j      JH   EY
k      K    EY
l       L    ER
m     M    OW
n       N    IY
o       OW
p       P IY
q       K   UW
r        AA
s        S   EH
t         T   IY
u        IY   OW
v        W    IY
w       W    UH
x        TS    AA
y         W   EY
z          Z    IY
.....
nshmyrev commented 2 years ago

Yeah, it is an issue with current Chinese models unfortunately. We have to retrain them with the new data and proper UNK handler. It will take couple weeks probably.

diyism commented 2 years ago

Maybe I can help, I'm playing with google colab(GPU/TPU), anything you need my help, don't hesitate to send to me (kexianbin@diyism.com)

nshmyrev commented 2 years ago

Maybe I can help, I'm playing with google colab(GPU/TPU), anything you need my help, don't hesitate to send to me (kexianbin@diyism.com)

Its ok, thank you. Once we have a new version I'll post the update here.

diyism commented 2 years ago

I've tested the default models/assets/model-en-us again with:

Recognizer rec = new Recognizer(model, 16000.0f, "[\"a\", \"be\", \"see\", \"d\", \"e\", \"faye\", \"gee\", 
\"her\", \"i\", \"jay\", \"k\", \"lie\", \"more\", \"knee\", \"oh\", \"p\", \"queue\", \"r\", \"say\", 
\"tea\", \"u\", \"we\", \"woo\", \"she\", \"why\", \"thee\", \"[unk]\"]");

while I speak these words continuously, the vosk-android-demo app will show:

{"partial": "a be"}
{"partial": "a be"}
{"partial": "a be see"}
{"partial": "a be see d"}
{"partial": "a be see d"}
{"partial": "a be see d e"}
{"partial": "a be see d e faye"}
{"partial": "a be see d e faye gee"}
{"partial": "a be see d e faye gee"}
{"partial": "a be see d e faye gee"}
{"partial": "a be see d e faye gee her"}
{"partial": "a be see d e faye gee her i"}
{"partial": "a be see d e faye gee her i jay k"}
{"partial": "a be see d e faye gee her i jay k lie"}
{"partial": "a be see d e faye gee her i jay k lie"}
{"partial": "a be see d e faye gee her i jay k lie"}
{"partial": "a be see d e faye gee her i jay k lie more"}
{"partial": "a be see d e faye gee her i jay k lie more knee"}

I can see the syllable recognition is very accurate, I guess the accuracy is near 98%, I can extract the recognized english letters in real time by diff the previous and next partial results.

Great!

The only flaw is that sometimes while I say "a be see d e faye gee", the "e" in the middle has been swallowed by vosk engine, the app shows: {"partial": "a be see d faye gee"}

Every word(each is of single syllable) pronouncication last only 0.3 seconds. I guess the "dee eee"(last about 0.3+0.3=0.6 seconds) was recognized as "deeeee" . Is there any parameter/option to force vosk engine to recognize against only 0.3 seconds of words?