alphacep / vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Apache License 2.0
8.1k stars 1.11k forks source link

Recogniser with grammar: German and Spanish model does hardly ever produce unknown #1017

Open omlins opened 2 years ago

omlins commented 2 years ago

With the small English model, the recognizers with grammar behave as expected: [unk] is recognized if a sentence is spoken that is clearly something different from what is specified in the grammar. However, with the small German and Spanish models [1], the recognizers do hardly ever return [unk] when a sentence is spoken, even if it has clearly nothing to do with the specified grammar. It looks to me like the problem is that the recognizer gives immediately back some result without analyzing the whole word group (even though there are no silences between the words).

For example, I said "was möchtest du als nachtes tun" and analyzed it with a recognizer with the following grammar: ["terminus", "punkt", "klein", "ausruf", "frage", "doppelpunkt", "zurück", "sprache", "komma", "vor", "gross", "paragraf", "buchstaben", "ziffern", "strichpunkt", "[unk]"]. The recognizer switched three times the partial result (from "buchstaben" to "gross" to strichpunkt) and then gave back the the result "gross", after having processed only the beginning of the word group (probably about 20%), despite absence of silence between the words of the spoken sentence.

@nshmyrev , how can one avoid that the recognizer gives back results without analyzing the whole word group?

Thanks!!

[1] "vosk-model-small-de-0.15" and "vosk-model-small-es-0.22"

nshmyrev commented 2 years ago

I've just tried with the attached audio file, it returned [unk]. Here is the full output:

test-de-unk-1017.zip

vosk-model-small-de-0.15.zip: 100%|████████████████████████████████████████████████████████████████| 44.4M/44.4M [00:00<00:00, 51.7MB/s]
LOG (VoskAPI:ReadDataFiles():model.cc:213) Decoding params beam=10 max-active=3000 lattice-beam=2
LOG (VoskAPI:ReadDataFiles():model.cc:216) Silence phones 1:2:3:4:5:6:7:8:9:10
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 0 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 0 orphan components.
LOG (VoskAPI:ReadDataFiles():model.cc:248) Loading i-vector extractor from /home/shmyrev/.cache/vosk/vosk-model-small-de-0.15/ivector/final.ie
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (VoskAPI:ReadDataFiles():model.cc:282) Loading HCL and G from /home/shmyrev/.cache/vosk/vosk-model-small-de-0.15/graph/HCLr.fst /home/shmyrev/.cache/vosk/vosk-model-small-de-0.15/graph/Gr.fst
LOG (VoskAPI:ReadDataFiles():model.cc:303) Loading winfo /home/shmyrev/.cache/vosk/vosk-model-small-de-0.15/graph/phones/word_boundary.int
LOG (VoskAPI:Recognizer():recognizer.cc:63) ["terminus", "punkt", "klein", "ausruf", "frage", "doppelpunkt", "zurück", "sprache", "komma", "vor", "gross", "paragraf", "buchstaben", "ziffern", "strichpunkt", "[unk]"]
LOG (VoskAPI:Estimate():language_model.cc:142) Estimating language model with ngram-order=2, discount=0.5
LOG (VoskAPI:OutputToFst():language_model.cc:209) Created language model with 17 states and 32 arcs.
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : "strichpunkt"
}
{
  "partial" : "strichpunkt"
}
{
  "partial" : "strichpunkt"
}
{
  "partial" : "ausruf"
}
{
  "partial" : ""
}
{
  "partial" : "strichpunkt"
}
{
  "partial" : "strichpunkt"
}
{
  "partial" : "[unk]"
}
{
  "partial" : "[unk]"
}
{
  "text" : "[unk]"
}
omlins commented 2 years ago

Thanks @nshmyrev for trying! I will try your example too and report back...

omlins commented 2 years ago

@nshmyrev : I tried with your audio and also got [unk]. An observation is though that the computer voice in the audio is pretty fast and there is no silence at all between words which does probably not represent all kind of natural speakers. I have therefore recorded the sentence myself in different fashions. When I spoke super fast without silences at all, I also got [unk] as desired. When I spoke rather slow and clear but without silences, I got again some undesired result (the audio is here: German_sentence_clearbutnosilences.zip):

┌ Debug: Dynamic recognizer created for the following grammar: ["terminus", "punkt", "klein", "ausruf", "frage", "doppelpunkt", "zurück", "sprache", "komma", "vor", "gross", "paragraf", "buchstaben", "ziffern", "strichpunkt", "[unk]"]
└ @ JustSayIt ~/tmpwdir/juliadev/JustSayIt/src/next_token.jl:193
│ Partial result: gross
│ Result: gross

Could you confirm that you obtain the same result? Thanks!!

omlins commented 2 years ago

@nshmyrev : it would be of great help if you could already confirm that you get the same result as I with the audio I shared above (to rule out that there is an issue on how I am calling vosk). I urgently need to find a solution here in order to finalize the PR adding multi-lang support to JustSayIt.jl this week - I need to submit the video for the JuliaCon conference shortly and it should feature multi-lang support! Also I would be very grateful for any comments/suggestions concerning this issue in general... Thanks a lot!!

nshmyrev commented 2 years ago

it would be of great help if you could already confirm that you get the same result as I with the audio I shared above (to rule out that there is an issue on how I am calling vosk).

Yes, I see the same thing as you.

nshmyrev commented 2 years ago

Well, honestly it returns 'gross [unk]' with second phrase [unk]

omlins commented 2 years ago

Well, honestly it returns 'gross [unk]' with second phrase [unk]

Thanks @nshmyrev . I am not quite sure what you mean and the devil is in the detail here! Thus, could you please tell me if it returns A) or B) in the following (noting "full result" as opposed to "partial result")?

A) full result 1: 'gross' full result 2: '[unk]'

B) full result 1: 'gross [unk]' full result 2: '[unk]'

After that, what do you think can be done to obtain exactly one result and which would be 'unk'?

omlins commented 2 years ago

@nshmyrev : I am back to this issue, which is blocking the merging of the multi-language support in JustSayIt. I really hope this can be solved before JuliaCon (July 27-29).

First, I have tested the above by modifying the example script from the vosk-api github minimally, and found that it produces A) full result 1: 'gross' full result 2: '[unk]' That means it produces the same as when I called it from JustSayIt. So, unfortunately, the issue is definitively in Vosk itself.

Second, In order to allow for a better understanding of the issue, I have created another more elaborate example, without using JustSayIt. I created it in both German and English in order to be able to compare the undesired behaviour in German against the behaviour in English, where things work very much as desired.

The example is very simple; it consists of:

  1. an English audio (en_short_fast.zip) and a German audio (de_short_fast.zip)

  2. a simple Python script to process the English audio (test_grammar_en.zip) and a second one to process the German audio (test_grammar_de.zip)

Each audio consists three sentences, where the first sentence contains some words that are part of the grammar used for processing ("letters", "digits", "undo" and "redo" in English and "buchstaben", "ziffern", "gross" and "klein" in German).

The four results for the English audio contain all '[unk]' as expected and desired:

$ python3 test_grammar_en.py en_short_fast.wav 
LOG (VoskAPI:ReadDataFiles():model.cc:213) Decoding params beam=10 max-active=3000 lattice-beam=2
LOG (VoskAPI:ReadDataFiles():model.cc:216) Silence phones 1:2:3:4:5:6:7:8:9:10
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 0 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 0 orphan components.
LOG (VoskAPI:ReadDataFiles():model.cc:248) Loading i-vector extractor from /home/omlins/.cache/vosk/vosk-model-small-en-us-0.15/ivector/final.ie
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (VoskAPI:ReadDataFiles():model.cc:282) Loading HCL and G from /home/omlins/.cache/vosk/vosk-model-small-en-us-0.15/graph/HCLr.fst /home/omlins/.cache/vosk/vosk-model-small-en-us-0.15/graph/Gr.fst
LOG (VoskAPI:ReadDataFiles():model.cc:303) Loading winfo /home/omlins/.cache/vosk/vosk-model-small-en-us-0.15/graph/phones/word_boundary.int
LOG (VoskAPI:Recognizer():recognizer.cc:63) ["undo", "redo", "uppercase", "lowercase", "letters", "digits", "point", "comma", "colon", "semicolon", "exclammation", "interrogation", "paragrafh", "language", "[unk]"]
WARNING (VoskAPI:Recognizer():recognizer.cc:84) Ignoring word missing in vocabulary: 'exclammation'
WARNING (VoskAPI:Recognizer():recognizer.cc:84) Ignoring word missing in vocabulary: 'paragrafh'
LOG (VoskAPI:Estimate():language_model.cc:142) Estimating language model with ngram-order=2, discount=0.5
LOG (VoskAPI:OutputToFst():language_model.cc:209) Created language model with 14 states and 26 arcs.
{
  "text" : ""
}
{
  "result" : [{
      "conf" : 0.940491,
      "end" : 5.570606,
      "start" : 5.190000,
      "word" : "letters"
    }, {
      "conf" : 0.645985,
      "end" : 6.007534,
      "start" : 5.670000,
      "word" : "digits"
    }, {
      "conf" : 1.000000,
      "end" : 6.720000,
      "start" : 6.007534,
      "word" : "[unk]"
    }],
  "text" : "letters digits [unk]"
}
{
  "result" : [{
      "conf" : 1.000000,
      "end" : 7.620000,
      "start" : 6.720000,
      "word" : "[unk]"
    }, {
      "conf" : 1.000000,
      "end" : 8.040000,
      "start" : 7.620000,
      "word" : "redo"
    }],
  "text" : "[unk] redo"
}
{
  "result" : [{
      "conf" : 1.000000,
      "end" : 12.750000,
      "start" : 10.500000,
      "word" : "[unk]"
    }],
  "text" : "[unk]"
}
{
  "result" : [{
      "conf" : 1.000000,
      "end" : 17.280000,
      "start" : 15.450000,
      "word" : "[unk]"
    }],
  "text" : "[unk]"
}
{
  "text" : ""
}

The results for the German audio, however, do not contain '[unk]', with exception of the last result. This is unexpected and undesired as the audio is constructed completely analogue to the English audio and should therefore lead to an analogue result (as mentioned above, the German audio also contains only four words that are part of the grammar - "buchstaben", "ziffern", "gross" and "klein' - and they are located all in the first sentence). Here are the results for the German audio:

$ python3 test_grammar_de.py de_short_fast.wav 
LOG (VoskAPI:ReadDataFiles():model.cc:213) Decoding params beam=10 max-active=3000 lattice-beam=2
LOG (VoskAPI:ReadDataFiles():model.cc:216) Silence phones 1:2:3:4:5:6:7:8:9:10
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 0 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 0 orphan components.
LOG (VoskAPI:ReadDataFiles():model.cc:248) Loading i-vector extractor from /home/omlins/.cache/vosk/vosk-model-small-de-0.15/ivector/final.ie
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (VoskAPI:ReadDataFiles():model.cc:282) Loading HCL and G from /home/omlins/.cache/vosk/vosk-model-small-de-0.15/graph/HCLr.fst /home/omlins/.cache/vosk/vosk-model-small-de-0.15/graph/Gr.fst
LOG (VoskAPI:ReadDataFiles():model.cc:303) Loading winfo /home/omlins/.cache/vosk/vosk-model-small-de-0.15/graph/phones/word_boundary.int
LOG (VoskAPI:Recognizer():recognizer.cc:63) ["rückgängig", "wiederholen", "gross", "klein", "buchstaben", "ziffern", "punkt", "komma", "doppelpunkt", "strichpunkt", "ausrufezeichen", "fragezeichen", "paragraf", "sprache", "[unk]"]
LOG (VoskAPI:Estimate():language_model.cc:142) Estimating language model with ngram-order=2, discount=0.5
LOG (VoskAPI:OutputToFst():language_model.cc:209) Created language model with 16 states and 30 arcs.
{
  "result" : [{
      "conf" : 1.000000,
      "end" : 5.250000,
      "start" : 4.770000,
      "word" : "buchstaben"
    }, {
      "conf" : 1.000000,
      "end" : 5.610000,
      "start" : 5.310000,
      "word" : "ziffern"
    }, {
      "conf" : 0.921601,
      "end" : 6.360000,
      "start" : 6.060000,
      "word" : "komma"
    }, {
      "conf" : 1.000000,
      "end" : 6.750000,
      "start" : 6.420000,
      "word" : "gross"
    }, {
      "conf" : 1.000000,
      "end" : 7.080000,
      "start" : 6.840000,
      "word" : "gross"
    }, {
      "conf" : 1.000000,
      "end" : 7.560000,
      "start" : 7.170000,
      "word" : "klein"
    }],
  "text" : "buchstaben ziffern komma gross gross klein"
}
{
  "result" : [{
      "conf" : 1.000000,
      "end" : 12.270000,
      "start" : 12.000000,
      "word" : "ziffern"
    }, {
      "conf" : 0.601431,
      "end" : 13.080000,
      "start" : 12.570000,
      "word" : "sprache"
    }],
  "text" : "ziffern sprache"
}
{
  "result" : [{
      "conf" : 1.000000,
      "end" : 14.400000,
      "start" : 13.890000,
      "word" : "rückgängig"
    }],
  "text" : "rückgängig"
}
{
  "result" : [{
      "conf" : 0.682580,
      "end" : 19.320000,
      "start" : 19.140000,
      "word" : "[unk]"
    }, {
      "conf" : 1.000000,
      "end" : 20.160000,
      "start" : 19.740000,
      "word" : "wiederholen"
    }],
  "text" : "[unk] wiederholen"
}
{
  "text" : ""
}

Could you please answer the following questions:

  1. Why do you think in German the behaviour is not analogue to the behaviour in English, i.e., why does it produce 'unk' much less often in German?
  2. How do you think this can be fixed?

PS: note that this does not seem to be an isolated problem with German, but one that affects also other languages. I have made a similar experience with Spanish.

omlins commented 2 years ago

@nshmyrev: any insights on that?

omlins commented 2 years ago

@nshmyrev : to avoid confusion: the fact that I merged the multi-lang PR (see above) does not mean that the issue is solved. In fact, I had to deactivate support for German and Spanish (and leave only English and French).

nshmyrev commented 2 years ago

@omlins thanks for information. For now it a complex question without easy solution. We are looking on similar problems but it will take time.

In general we recommend to build bigger grammars / language models than to rely on [unk]. See also https://github.com/alphacep/vosk-api/issues/319#issuecomment-1192207050

nshmyrev commented 2 years ago

@omlins please share if you have any news on your presentation, we'd be happy to check too

omlins commented 2 years ago

@nshmyrev : you can find the JuliaCon 2022 presentation here: https://www.youtube.com/watch?v=W7oQb7pLc04 The abstract is found here: https://pretalx.com/juliacon-2022/talk/H3N8UN/

Furthermore, JustSayIt is now in the process of being registered in the Julia package registry (a temporary note this concerning is found on the github repo: https://github.com/omlins/JustSayIt.jl ). Immediately after that, the first release will be done.

Finally, JustSayIt has now a documentation webpage: https://omlins.github.io/JustSayIt.jl

Thanks for your interest and I am looking forward to your feedback (please don't hesitate to reach out to me by private e-mail)!

nshmyrev commented 2 years ago

Thank you, amazing work! I'm impressed with language switch.