securigy commented 1 year ago

So far, I tried 3 different approaches to achieve dictation in a way Microsoft does it in Word with Dictate button.

I used the following:

.NET Framework classes such as SpeechRecognitionEngine from System.Speech.Recognition
I used Vosk with output to the screen
I used Vosk with output to audio file (*.wav)

All the examples below were recognized perfectly by Word Dictation.

1 was pretty bad - it could recognized just a few words and the rest was nonsense.

2 I had great hopes for it but it was bad as well, here are some results, and believe me I had great headset mic and I spoke very clearly and a bit slower to separate between words...

I said Output "This is the first example" "the for example" "This is the second example" "the descending bull" "I do not understand what is going on" "i don't understand what going on" “So, what is I am doing wrong here" "the what is in the winter" "Hello there, are you alive?" "the previous peak of the blood" "Why the speech recognition is so bad?" "the white asparagus is bad"

As you can see, it is a complete nonsense as well. Therefore, I am asking myself: "What did I do wrong?" I downloaded and pointing to the model folder of "vosk-model-en-us-0.22" (BTW, initialization takes 15 seconds!) Here is the code, it followed the example I got here that used PortAudioSharp library from NuGet. Based on this I did not even try generating audio file using NAudio... Can anybody tell me what am I doing wrong? (I have a slight German accent in English) I assume here that all the projects that are using vosk has better quality... How MS Word doing it with Dictate? What are they using?

When I click on the button in my WinForms app, the following is executed:

                string voskModelFolder = Path.Combine(Program.InstallDir, @"Gpt4AudioData\Models\vosk-model-en-us-0.22");
                VoskPortAudioSharp voskPortAudio = new VoskPortAudioSharp();
                voskPortAudio.SpeechRecognizedEvent += VoskPortAudio_SpeechRecognizedEvent;
                voskPortAudio.Init(voskModelFolder);
                voskPortAudio.Start();

and the functions:

public void Init(string modelFolder)
        {
            using (Log.VerboseCall())
            {
                try
                {
                    Model model = new Model(modelFolder);
                    recognizer = new VoskRecognizer(model, 16000.0f);

                    PortAudio.LoadNativeLibrary();
                    PortAudio.Initialize();

                    oParams.device = PortAudio.DefaultInputDevice;
                    if (oParams.device == PortAudio.NoDevice)
                    {
                        string err = "No default audio input device available";
                        Log.Verbose(err);
                        throw new Exception(err);
                    }

                    oParams.channelCount = 1;
                    oParams.sampleFormat = SampleFormat.Int16;
                    oParams.hostApiSpecificStreamInfo = IntPtr.Zero;

                    var callbackData = new VoskCallbackData()
                    {
                        textResult = String.Empty
                    };

                    mStream = new PortAudioSharp.Stream(
                        oParams,
                        null,
                        16000,
                        8192,
                        StreamFlags.ClipOff,
                        playCallback,
                        callbackData
                    );
                }
                catch (Exception ex)
                {
                    Log.Verbose(ex);
                }
            }
        }`

`        public void Start()
        {
            using(Log.VerboseCall())
            {
                try
                {
                    mStream.Start();
                    Log.Verbose("Started...");
                }
                catch(Exception ex)
                {
                    Log.Verbose(ex);
                }
            }
        }

        public void Stop()
        {
            using(Log.VerboseCall())
            {
                try
                {
                    mStream.Stop();
                    Log.Verbose("Stopped");
                }
                catch(Exception ex)
                {
                    Log.Verbose(ex);
                }
            }
        }

        class VoskCallbackData
        {
            public String textResult { get; set; }
        }

        private StreamCallbackResult playCallback( IntPtr input, 
                                                   IntPtr output,
                                                   System.UInt32 frameCount,
                                                   ref StreamCallbackTimeInfo timeInfo,
                                                   StreamCallbackFlags statusFlags,
                                                   IntPtr dataPtr )
        {
            try
            {
                byte[] buffer = new byte[frameCount];
                Marshal.Copy(input, buffer, 0, buffer.Length);
                System.IO.Stream streamInput = new MemoryStream(buffer);
                using (System.IO.Stream source = streamInput)
                {
                    byte[] bufferRead = new byte[frameCount];
                    int bytesRead;
                    while ((bytesRead = source.Read(bufferRead, 0, bufferRead.Length)) > 0)
                    {
                        if (recognizer.AcceptWaveform(bufferRead, bytesRead))
                        {
                            string result = recognizer.Result();
                            Log.VerboseFormat("Result: {0}", result);
                            if (SpeechRecognizedEvent != null)
                            {
                                SpeechRecognizedEvent(result);
                            }
                        }
                    }
                }
            }
            catch (Exception ex)
            {
                Log.Error(ex);
            }

            return StreamCallbackResult.Continue;
        }

EDIT: Half an hour later, I tried option #3: vosk with NAudio and it not only generated the correct audio file (no wonders here because that use case worked 12 years ago), but also generated the correct text that I output to screen... Problem Solved!

Still, I wondering why vosk initialization takes 15 seconds... probably due to model size of 1GB+ (?)

nshmyrev commented 1 year ago

You can remove rescore folder from the model to make it load and run faster. Big models are for more powerful servers.

securigy commented 1 year ago

I have a good machine AMD Ryzen 7 5800X 8-Core Processor with RTX 3060 GPU... I will try to use your advise...

Now that I have working code, I am wondering about the rest of the vosk capabilities...

Do I need to call Dispose for Recognizer (mandatory?)
I see a log of functions: 2.1 what is GpuInit - is it beneficial to call it? and what will it do? 2.2 what is GpuThreadInit - is it beneficial to call it? and what will it do?
What is Reset() for and when to call it?
What's the difference between Result and FinalResult
What is SpkModel? when to use it and what is it for? how this parameter looks in code?
What is SetWords? and what is it for?

nshmyrev commented 1 year ago

You can check API docs here:

https://github.com/alphacep/vosk-api/blob/master/src/vosk_api.h

securigy commented 1 year ago

I still don't understand the entire 'spk' thing. It says: /* Loads speaker model data from the file and returns the model object

@param model_path: the path of the model on the filesystem
@returns model object or NULL if problem occurred */

But what is the speaker model? is it something you train for particular parson's voice? I am asking this because it would be great if I conducted interview with someone and the vosk could recognize and designate 2 different voices in the response... mine and the person I interview. I obviously can format it correctly when displayed, but at least some indication that this is the other person's voice would be great...

alphacep / vosk-api

How to use Vosk in C# #1298

1 was pretty bad - it could recognized just a few words and the rest was nonsense.

2 I had great hopes for it but it was bad as well, here are some results, and believe me I had great headset mic and I spoke very clearly and a bit slower to separate between words...