alphacep / vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Apache License 2.0
7.99k stars 1.11k forks source link

[.net] Keyword spotting issue #484

Closed pauleffect90 closed 3 years ago

pauleffect90 commented 3 years ago

Regards,

First of all, great job! I'm really impressed by the accuracy of your engine.

I've applied my somewhat modest googling skills and came up with this solution for live keyword spotting, in .net 5:

var rec = new VoskRecognizer(model, 16000.0f, "[\"rate one two three four five hello world\", \"[unk]\"]");

I'm using vosk-model-en-us-daanzu-20200905-lgraph (from Kaldi-active-grammar project with configurable graph).

Using this I can get, for example, "rate five". But I can also get a "hello five". If my target would be "rate one[-five]" and "hello world", but never "hello world one", how would I go about setting a multi-word keyword?

nshmyrev commented 3 years ago

var rec = new VoskRecognizer(model, 16000.0f, '["rate one", "rate two", "rate three", "hello world", "[unk]"]');

pauleffect90 commented 3 years ago

I had already tried that exact solution. It outputs "rate", for example. One other thing:

var rec = new VoskRecognizer(model, 16000.0f, '["rate one", "rate two", "three", "hello world", "[unk]"]');

This can output "rate three". How in the seven planes of oblivion (Morrowind was better), since "rate three" is not a registered keyword.

Any suggestions?

nshmyrev commented 3 years ago

The phrases you specify are not keywords, they are hints. If user said "rate three" it will repo "rate three". It reports what user said.

If you need to check for "rate three" you compare results of the recognizer with a required string.

pauleffect90 commented 3 years ago

I understand. I had a hunch, but I figured it was worth a shot asking. I'm going to keep this issue open for one day, two tops. If I come up with a viable solution in the meantime, I'll post it here as a closing comment. Thank you for your time, Mr. Nickolay.

pauleffect90 commented 3 years ago

Ok. Let's assume one needs to execute certain short commands. We'll take "rate one", "rate two", "rate three", "play", "full screen (for lack of fullscreen in the model)". My best solution so far, in (more or less) pseudocode, is:

1. Build a grammar string & a Choices list, dictionary etc

    // here Grammar is of string type 
    // ChoicesDictionary is a Dict. of <string, Choice> (which, for now, is basically a class with only one property, Text).
    // The code is cr**, but I'm sure you'll get the point.
    public static void AddChoice(string choice)
    {
        var x = choice.Split(' ');
        foreach (var item in x)
        {
            if (!Grammar.Contains(item.ToLower())) Grammar += " " + item.ToLower();
        }
        ChoicesDictionary.Add(choice.ToLower(), new Choice(choice));
    }

So basically, when we feed this "rate one", it:

  1. checks if "rate" is in Grammar, if not, it adds it.
  2. checks if "one" is in Grammar, if not, it adds it.
  3. it adds a new entry in a dictionary, with the "rate one" key and a Choice value. For "rate two":
  4. checks "rate", but because it already is contained in Grammar, skips it.
  5. checks "two", adds it to grammar.
  6. Three's the same as above.

2. Create a timer with a timeout value of say, 1000 ms.

[...]
                    if (!rec.AcceptWaveform(frameBuffer, length))
                    {
                        var partialResult = JsonConvert.DeserializeObject<PartialResult>(rec.PartialResult());
                        if(!string.IsNullOrEmpty(partialResult.partial))
                        {
                            var finalResult = JsonConvert.DeserializeObject<FinalResult>(rec.FinalResult());
                            if(ChoicesTimer.Enabled)
                            {
                                // TIMER RUNNING
                                Candidate += " " + finalResult.text;
                                Console.WriteLine("CANDIDATE APPENDED: CANDIDATE = " + Candidate);
                            }
                            else
                            {
                                Candidate = finalResult.text;
                                ChoicesTimer.Start();
                                Console.WriteLine("STARTED TIMER WITH CANDIDATE = " + Candidate);
                            }
                        }
                    }
[...]

final and partialResult here are simple classes built with https://json2csharp.com/. They take a json string and output a c# class from it.

Now, when we check the partial results, if partial is != "" (ex. contains actual recognized text), we can check against the final result. Now, the finalResult.text will contain a full or a partial Choice. If the timer isn't running, it means we're building a new Choice. Assign to Candidate the value of finalResult. If the timer is running, append the value of finalResult to Candidate. On timer, check if the Dictionary contains the given key. If so, you've successfully recognized a multiple word command. The timer is needed because you could have a command "rate", one "five" and a "rate five" which do different things. So basically after recognizing "rate" the timer helps determine if it's a standalone keyword or if more stuff is coming after it.

This is faaaaar from working code, but it's a starting point. It would be really awesome if the recognizer could be limited to single words only, as sometimes it recognizes "rate one" and sometimes "rate" and "one", making the timer redundant a lot of times. But, with some luck and tweaking, I'm thinking it could work.

Sorry for the messy post, I'm actually @ work and there's only so much time I can spend pretending to actually be working.