alphacep / vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Apache License 2.0
8.06k stars 1.11k forks source link

Feature request: Keyphrases with multiple words #1231

Open johngebbie opened 1 year ago

johngebbie commented 1 year ago

EDIT: I now propose a more general SetSuggestionStrength(0.123) function instead, see my later posts.

Hello, my program lets people use a computer with their voice. You type keys with syllables like "air", "bat" and "cap", and can transcribe a sentence after saying "scribe". I'd like to support phrases with multiple words like "uppercase scribe" and "please caps lock".

Currently, if you specify a grammar like this:

["function three", "three two one go", "go forth"]

the groupings are only taken as hints and you can still get results like "one function go", as you say here: https://github.com/alphacep/vosk-api/issues/484

It would be great if there was something like SetKeyphrases(1) that would constrain the results and return them as words with SetWords(1):

{
  "result" : [{
      "conf" : 1.000000,
      "end" : 0.810000,
      "start" : 0.120000,
      "word" : "go forth"
    }, {
      "conf" : 1.000000,
      "end" : 1.260000,
      "start" : 0.840000,
      "word" : "function three"
    }],
  "text" : "go forth function three"
}

I have no clue how hard this would be but I think it's a common use case. If you don't think this will happen, I'll try to recombining the results after the fact, but I'd have to just bail if the result were invalid like "one function three", whereas I'd prefer the best valid match how it his currently when you don't put "[unk]" in the grammar.

johngebbie commented 1 year ago

I have implemented this the best I can with what's there and the result is okay but not ideal. It quite often has to abort because the results include broken up phrases. For example, I have the phrases "down" and "function ten", and I say "down" and it often gives "ten" (and so my program does nothing).

I will try to work around this by making multiword phrases from the same words as other phrases, like "function one zero" instead of "function ten" because there's already "one" and "zero" and that won't add any new vocab for it to wrongly return, but it's a shame.

Said differently, it would be good to be able to constrain the results to phrases rather than words as broken up phrases pollute the results.

johngebbie commented 1 year ago

Having thought about it more, grouping the results, though maybe convenient, is already possible to do yourself and narrows the use.

What is missing is the ability to constrain the results to phrases with multiple words. This could look like SetKeyphrases(1), or probably better, there could be a function to set the suggestion strength like SetSuggestionStrength(0.75) and you could then use SetSuggestionStrength(1.0).

johngebbie commented 1 year ago

Hi @nshmyrev, what do you think of a SetSuggestionStrength(0.123) function? I wanted to make a pull request of it but I think the code is beyond me. (I think it would involve Recognizer::UpdateGrammarFst and the LanguageModelEstimator?)

nshmyrev commented 1 year ago

I think it is more a question of more accurate acoustic model. Ideally the model should recognize such things very reliably without the need to set probabilities. Like prompting works for Whisper.

nshmyrev commented 1 year ago

As for constraining to phrases, you'd better get nbest and analyze each output with regex. As explained before:

https://github.com/alphacep/vosk-api/issues/55#issuecomment-604104567

johngebbie commented 1 year ago

I didn't mean to tweak each phrase's probability but just to tweak how unbreakable all the phrase hints in the restricted grammar should be, so with SetSuggestionStrength(1.0) they would be like big words that are only returned as a whole. I thought that would be a more general purpose way to allow constraining to phrases, but probably over optimistic.

Using SetMaxAlternatives is a good shout though, thank you.

(The phrases come from user config files so I'd rather keep it simple and not adapt models if possible. And I had quick search but I'm not familiar how Whisper's prompting works.)

SwimmingTiger commented 1 year ago

I had a similar program and I solved it with adding [unk] and text postprocessing.

  1. Add [unk] to keyphrases. It will match anything that is not in the keyword and greatly reduce false triggers.

    ["function three", "three two one go", "go forth", "[unk]"]

    This API example shows the importance of [unk]: https://github.com/alphacep/vosk-api/blob/master/src/vosk_api.h#L134

  2. Write a text post-processing program that triggers the corresponding operation when it detects that phrase words appear consecutively in the recognition result (and are not separated by [unk]).

  3. In order to trigger actions as quickly as possible, the program will perform keyword detection in partial results. But to avoid false triggers, the last word of partial results is ignored (because the last word might be a guess).

  4. If a keyword is detected in a partial result and triggers a command, the position in the text will be recorded. After that, only the text after this position is detected to avoid repeated triggering.

SwimmingTiger commented 1 year ago

I have implemented this the best I can with what's there and the result is okay but not ideal. It quite often has to abort because the results include broken up phrases. For example, I have the phrases "down" and "function ten", and I say "down" and it often gives "ten" (and so my program does nothing).

According to my observation, "no match" is not allowed if [unk] is not in the grammar. That is, Any non-silent portion of the audio stream must match a word, even if it is unrelated dialogue or noise.

If you add [unk] to the grammar, the mismatch problem will likely improve, and you may get something like this: [unk] down, indicating extraneous noise before down.