Open omlins opened 2 years ago
I've just tried with the attached audio file, it returned [unk]. Here is the full output:
vosk-model-small-de-0.15.zip: 100%|████████████████████████████████████████████████████████████████| 44.4M/44.4M [00:00<00:00, 51.7MB/s]
LOG (VoskAPI:ReadDataFiles():model.cc:213) Decoding params beam=10 max-active=3000 lattice-beam=2
LOG (VoskAPI:ReadDataFiles():model.cc:216) Silence phones 1:2:3:4:5:6:7:8:9:10
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 0 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 0 orphan components.
LOG (VoskAPI:ReadDataFiles():model.cc:248) Loading i-vector extractor from /home/shmyrev/.cache/vosk/vosk-model-small-de-0.15/ivector/final.ie
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (VoskAPI:ReadDataFiles():model.cc:282) Loading HCL and G from /home/shmyrev/.cache/vosk/vosk-model-small-de-0.15/graph/HCLr.fst /home/shmyrev/.cache/vosk/vosk-model-small-de-0.15/graph/Gr.fst
LOG (VoskAPI:ReadDataFiles():model.cc:303) Loading winfo /home/shmyrev/.cache/vosk/vosk-model-small-de-0.15/graph/phones/word_boundary.int
LOG (VoskAPI:Recognizer():recognizer.cc:63) ["terminus", "punkt", "klein", "ausruf", "frage", "doppelpunkt", "zurück", "sprache", "komma", "vor", "gross", "paragraf", "buchstaben", "ziffern", "strichpunkt", "[unk]"]
LOG (VoskAPI:Estimate():language_model.cc:142) Estimating language model with ngram-order=2, discount=0.5
LOG (VoskAPI:OutputToFst():language_model.cc:209) Created language model with 17 states and 32 arcs.
{
"partial" : ""
}
{
"partial" : ""
}
{
"partial" : ""
}
{
"partial" : "strichpunkt"
}
{
"partial" : "strichpunkt"
}
{
"partial" : "strichpunkt"
}
{
"partial" : "ausruf"
}
{
"partial" : ""
}
{
"partial" : "strichpunkt"
}
{
"partial" : "strichpunkt"
}
{
"partial" : "[unk]"
}
{
"partial" : "[unk]"
}
{
"text" : "[unk]"
}
Thanks @nshmyrev for trying! I will try your example too and report back...
@nshmyrev : I tried with your audio and also got [unk]
. An observation is though that the computer voice in the audio is pretty fast and there is no silence at all between words which does probably not represent all kind of natural speakers. I have therefore recorded the sentence myself in different fashions. When I spoke super fast without silences at all, I also got [unk]
as desired. When I spoke rather slow and clear but without silences, I got again some undesired result (the audio is here: German_sentence_clearbutnosilences.zip):
┌ Debug: Dynamic recognizer created for the following grammar: ["terminus", "punkt", "klein", "ausruf", "frage", "doppelpunkt", "zurück", "sprache", "komma", "vor", "gross", "paragraf", "buchstaben", "ziffern", "strichpunkt", "[unk]"]
└ @ JustSayIt ~/tmpwdir/juliadev/JustSayIt/src/next_token.jl:193
│ Partial result: gross
│ Result: gross
Could you confirm that you obtain the same result? Thanks!!
@nshmyrev : it would be of great help if you could already confirm that you get the same result as I with the audio I shared above (to rule out that there is an issue on how I am calling vosk). I urgently need to find a solution here in order to finalize the PR adding multi-lang support to JustSayIt.jl this week - I need to submit the video for the JuliaCon conference shortly and it should feature multi-lang support! Also I would be very grateful for any comments/suggestions concerning this issue in general... Thanks a lot!!
it would be of great help if you could already confirm that you get the same result as I with the audio I shared above (to rule out that there is an issue on how I am calling vosk).
Yes, I see the same thing as you.
Well, honestly it returns 'gross [unk]' with second phrase [unk]
Well, honestly it returns 'gross [unk]' with second phrase [unk]
Thanks @nshmyrev . I am not quite sure what you mean and the devil is in the detail here! Thus, could you please tell me if it returns A) or B) in the following (noting "full result" as opposed to "partial result")?
A) full result 1: 'gross'
full result 2: '[unk]'
B) full result 1: 'gross [unk]'
full result 2: '[unk]'
After that, what do you think can be done to obtain exactly one result and which would be 'unk'
?
@nshmyrev : I am back to this issue, which is blocking the merging of the multi-language support in JustSayIt. I really hope this can be solved before JuliaCon (July 27-29).
First, I have tested the above by modifying the example script from the vosk-api github minimally, and found that it produces A) full result 1: 'gross' full result 2: '[unk]' That means it produces the same as when I called it from JustSayIt. So, unfortunately, the issue is definitively in Vosk itself.
Second, In order to allow for a better understanding of the issue, I have created another more elaborate example, without using JustSayIt. I created it in both German and English in order to be able to compare the undesired behaviour in German against the behaviour in English, where things work very much as desired.
The example is very simple; it consists of:
an English audio (en_short_fast.zip) and a German audio (de_short_fast.zip)
a simple Python script to process the English audio (test_grammar_en.zip) and a second one to process the German audio (test_grammar_de.zip)
Each audio consists three sentences, where the first sentence contains some words that are part of the grammar used for processing ("letters", "digits", "undo" and "redo" in English and "buchstaben", "ziffern", "gross" and "klein" in German).
The four results for the English audio contain all '[unk]' as expected and desired:
$ python3 test_grammar_en.py en_short_fast.wav
LOG (VoskAPI:ReadDataFiles():model.cc:213) Decoding params beam=10 max-active=3000 lattice-beam=2
LOG (VoskAPI:ReadDataFiles():model.cc:216) Silence phones 1:2:3:4:5:6:7:8:9:10
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 0 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 0 orphan components.
LOG (VoskAPI:ReadDataFiles():model.cc:248) Loading i-vector extractor from /home/omlins/.cache/vosk/vosk-model-small-en-us-0.15/ivector/final.ie
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (VoskAPI:ReadDataFiles():model.cc:282) Loading HCL and G from /home/omlins/.cache/vosk/vosk-model-small-en-us-0.15/graph/HCLr.fst /home/omlins/.cache/vosk/vosk-model-small-en-us-0.15/graph/Gr.fst
LOG (VoskAPI:ReadDataFiles():model.cc:303) Loading winfo /home/omlins/.cache/vosk/vosk-model-small-en-us-0.15/graph/phones/word_boundary.int
LOG (VoskAPI:Recognizer():recognizer.cc:63) ["undo", "redo", "uppercase", "lowercase", "letters", "digits", "point", "comma", "colon", "semicolon", "exclammation", "interrogation", "paragrafh", "language", "[unk]"]
WARNING (VoskAPI:Recognizer():recognizer.cc:84) Ignoring word missing in vocabulary: 'exclammation'
WARNING (VoskAPI:Recognizer():recognizer.cc:84) Ignoring word missing in vocabulary: 'paragrafh'
LOG (VoskAPI:Estimate():language_model.cc:142) Estimating language model with ngram-order=2, discount=0.5
LOG (VoskAPI:OutputToFst():language_model.cc:209) Created language model with 14 states and 26 arcs.
{
"text" : ""
}
{
"result" : [{
"conf" : 0.940491,
"end" : 5.570606,
"start" : 5.190000,
"word" : "letters"
}, {
"conf" : 0.645985,
"end" : 6.007534,
"start" : 5.670000,
"word" : "digits"
}, {
"conf" : 1.000000,
"end" : 6.720000,
"start" : 6.007534,
"word" : "[unk]"
}],
"text" : "letters digits [unk]"
}
{
"result" : [{
"conf" : 1.000000,
"end" : 7.620000,
"start" : 6.720000,
"word" : "[unk]"
}, {
"conf" : 1.000000,
"end" : 8.040000,
"start" : 7.620000,
"word" : "redo"
}],
"text" : "[unk] redo"
}
{
"result" : [{
"conf" : 1.000000,
"end" : 12.750000,
"start" : 10.500000,
"word" : "[unk]"
}],
"text" : "[unk]"
}
{
"result" : [{
"conf" : 1.000000,
"end" : 17.280000,
"start" : 15.450000,
"word" : "[unk]"
}],
"text" : "[unk]"
}
{
"text" : ""
}
The results for the German audio, however, do not contain '[unk]', with exception of the last result. This is unexpected and undesired as the audio is constructed completely analogue to the English audio and should therefore lead to an analogue result (as mentioned above, the German audio also contains only four words that are part of the grammar - "buchstaben", "ziffern", "gross" and "klein' - and they are located all in the first sentence). Here are the results for the German audio:
$ python3 test_grammar_de.py de_short_fast.wav
LOG (VoskAPI:ReadDataFiles():model.cc:213) Decoding params beam=10 max-active=3000 lattice-beam=2
LOG (VoskAPI:ReadDataFiles():model.cc:216) Silence phones 1:2:3:4:5:6:7:8:9:10
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 0 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 0 orphan components.
LOG (VoskAPI:ReadDataFiles():model.cc:248) Loading i-vector extractor from /home/omlins/.cache/vosk/vosk-model-small-de-0.15/ivector/final.ie
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (VoskAPI:ReadDataFiles():model.cc:282) Loading HCL and G from /home/omlins/.cache/vosk/vosk-model-small-de-0.15/graph/HCLr.fst /home/omlins/.cache/vosk/vosk-model-small-de-0.15/graph/Gr.fst
LOG (VoskAPI:ReadDataFiles():model.cc:303) Loading winfo /home/omlins/.cache/vosk/vosk-model-small-de-0.15/graph/phones/word_boundary.int
LOG (VoskAPI:Recognizer():recognizer.cc:63) ["rückgängig", "wiederholen", "gross", "klein", "buchstaben", "ziffern", "punkt", "komma", "doppelpunkt", "strichpunkt", "ausrufezeichen", "fragezeichen", "paragraf", "sprache", "[unk]"]
LOG (VoskAPI:Estimate():language_model.cc:142) Estimating language model with ngram-order=2, discount=0.5
LOG (VoskAPI:OutputToFst():language_model.cc:209) Created language model with 16 states and 30 arcs.
{
"result" : [{
"conf" : 1.000000,
"end" : 5.250000,
"start" : 4.770000,
"word" : "buchstaben"
}, {
"conf" : 1.000000,
"end" : 5.610000,
"start" : 5.310000,
"word" : "ziffern"
}, {
"conf" : 0.921601,
"end" : 6.360000,
"start" : 6.060000,
"word" : "komma"
}, {
"conf" : 1.000000,
"end" : 6.750000,
"start" : 6.420000,
"word" : "gross"
}, {
"conf" : 1.000000,
"end" : 7.080000,
"start" : 6.840000,
"word" : "gross"
}, {
"conf" : 1.000000,
"end" : 7.560000,
"start" : 7.170000,
"word" : "klein"
}],
"text" : "buchstaben ziffern komma gross gross klein"
}
{
"result" : [{
"conf" : 1.000000,
"end" : 12.270000,
"start" : 12.000000,
"word" : "ziffern"
}, {
"conf" : 0.601431,
"end" : 13.080000,
"start" : 12.570000,
"word" : "sprache"
}],
"text" : "ziffern sprache"
}
{
"result" : [{
"conf" : 1.000000,
"end" : 14.400000,
"start" : 13.890000,
"word" : "rückgängig"
}],
"text" : "rückgängig"
}
{
"result" : [{
"conf" : 0.682580,
"end" : 19.320000,
"start" : 19.140000,
"word" : "[unk]"
}, {
"conf" : 1.000000,
"end" : 20.160000,
"start" : 19.740000,
"word" : "wiederholen"
}],
"text" : "[unk] wiederholen"
}
{
"text" : ""
}
Could you please answer the following questions:
PS: note that this does not seem to be an isolated problem with German, but one that affects also other languages. I have made a similar experience with Spanish.
@nshmyrev: any insights on that?
@nshmyrev : to avoid confusion: the fact that I merged the multi-lang PR (see above) does not mean that the issue is solved. In fact, I had to deactivate support for German and Spanish (and leave only English and French).
@omlins thanks for information. For now it a complex question without easy solution. We are looking on similar problems but it will take time.
In general we recommend to build bigger grammars / language models than to rely on [unk]. See also https://github.com/alphacep/vosk-api/issues/319#issuecomment-1192207050
@omlins please share if you have any news on your presentation, we'd be happy to check too
@nshmyrev : you can find the JuliaCon 2022 presentation here: https://www.youtube.com/watch?v=W7oQb7pLc04 The abstract is found here: https://pretalx.com/juliacon-2022/talk/H3N8UN/
Furthermore, JustSayIt is now in the process of being registered in the Julia package registry (a temporary note this concerning is found on the github repo: https://github.com/omlins/JustSayIt.jl ). Immediately after that, the first release will be done.
Finally, JustSayIt has now a documentation webpage: https://omlins.github.io/JustSayIt.jl
Thanks for your interest and I am looking forward to your feedback (please don't hesitate to reach out to me by private e-mail)!
Thank you, amazing work! I'm impressed with language switch.
With the small English model, the recognizers with grammar behave as expected:
[unk]
is recognized if a sentence is spoken that is clearly something different from what is specified in the grammar. However, with the small German and Spanish models [1], the recognizers do hardly ever return[unk]
when a sentence is spoken, even if it has clearly nothing to do with the specified grammar. It looks to me like the problem is that the recognizer gives immediately back some result without analyzing the whole word group (even though there are no silences between the words).For example, I said
"was möchtest du als nachtes tun"
and analyzed it with a recognizer with the following grammar:["terminus", "punkt", "klein", "ausruf", "frage", "doppelpunkt", "zurück", "sprache", "komma", "vor", "gross", "paragraf", "buchstaben", "ziffern", "strichpunkt", "[unk]"]
. The recognizer switched three times the partial result (from"buchstaben"
to"gross"
tostrichpunkt
) and then gave back the the result"gross"
, after having processed only the beginning of the word group (probably about 20%), despite absence of silence between the words of the spoken sentence.@nshmyrev , how can one avoid that the recognizer gives back results without analyzing the whole word group?
Thanks!!
[1] "vosk-model-small-de-0.15" and "vosk-model-small-es-0.22"