Open johngebbie opened 1 year ago
I have implemented this the best I can with what's there and the result is okay but not ideal. It quite often has to abort because the results include broken up phrases. For example, I have the phrases "down" and "function ten", and I say "down" and it often gives "ten" (and so my program does nothing).
I will try to work around this by making multiword phrases from the same words as other phrases, like "function one zero" instead of "function ten" because there's already "one" and "zero" and that won't add any new vocab for it to wrongly return, but it's a shame.
Said differently, it would be good to be able to constrain the results to phrases rather than words as broken up phrases pollute the results.
Having thought about it more, grouping the results, though maybe convenient, is already possible to do yourself and narrows the use.
What is missing is the ability to constrain the results to phrases with multiple words.
This could look like SetKeyphrases(1)
, or probably better, there could be a function to set the suggestion strength like SetSuggestionStrength(0.75)
and you could then use SetSuggestionStrength(1.0)
.
Hi @nshmyrev, what do you think of a SetSuggestionStrength(0.123)
function?
I wanted to make a pull request of it but I think the code is beyond me.
(I think it would involve Recognizer::UpdateGrammarFst
and the LanguageModelEstimator
?)
I think it is more a question of more accurate acoustic model. Ideally the model should recognize such things very reliably without the need to set probabilities. Like prompting works for Whisper.
As for constraining to phrases, you'd better get nbest and analyze each output with regex. As explained before:
https://github.com/alphacep/vosk-api/issues/55#issuecomment-604104567
I didn't mean to tweak each phrase's probability but just to tweak how unbreakable all the phrase hints in the restricted grammar should be, so with SetSuggestionStrength(1.0)
they would be like big words that are only returned as a whole.
I thought that would be a more general purpose way to allow constraining to phrases, but probably over optimistic.
Using SetMaxAlternatives
is a good shout though, thank you.
(The phrases come from user config files so I'd rather keep it simple and not adapt models if possible. And I had quick search but I'm not familiar how Whisper's prompting works.)
I had a similar program and I solved it with adding [unk]
and text postprocessing.
Add [unk]
to keyphrases. It will match anything that is not in the keyword and greatly reduce false triggers.
["function three", "three two one go", "go forth", "[unk]"]
This API example shows the importance of [unk]
: https://github.com/alphacep/vosk-api/blob/master/src/vosk_api.h#L134
Write a text post-processing program that triggers the corresponding operation when it detects that phrase words appear consecutively in the recognition result (and are not separated by [unk]
).
In order to trigger actions as quickly as possible, the program will perform keyword detection in partial results. But to avoid false triggers, the last word of partial results is ignored (because the last word might be a guess).
If a keyword is detected in a partial result and triggers a command, the position in the text will be recorded. After that, only the text after this position is detected to avoid repeated triggering.
I have implemented this the best I can with what's there and the result is okay but not ideal. It quite often has to abort because the results include broken up phrases. For example, I have the phrases "down" and "function ten", and I say "down" and it often gives "ten" (and so my program does nothing).
According to my observation, "no match" is not allowed if [unk]
is not in the grammar. That is, Any non-silent portion of the audio stream must match a word, even if it is unrelated dialogue or noise.
If you add [unk]
to the grammar, the mismatch problem will likely improve, and you may get something like this: [unk] down
, indicating extraneous noise before down
.
EDIT: I now propose a more general
SetSuggestionStrength(0.123)
function instead, see my later posts.Hello, my program lets people use a computer with their voice. You type keys with syllables like "air", "bat" and "cap", and can transcribe a sentence after saying "scribe". I'd like to support phrases with multiple words like "uppercase scribe" and "please caps lock".
Currently, if you specify a grammar like this:
the groupings are only taken as hints and you can still get results like "one function go", as you say here: https://github.com/alphacep/vosk-api/issues/484
It would be great if there was something like
SetKeyphrases(1)
that would constrain the results and return them as words withSetWords(1)
:I have no clue how hard this would be but I think it's a common use case. If you don't think this will happen, I'll try to recombining the results after the fact, but I'd have to just bail if the result were invalid like "one function three", whereas I'd prefer the best valid match how it his currently when you don't put "[unk]" in the grammar.