alphacep / vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Apache License 2.0
8.19k stars 1.12k forks source link

Incorrect recognition with wrong data format #1075

Open BlackHawkCH91 opened 2 years ago

BlackHawkCH91 commented 2 years ago

This is my first time using Vosk, so please bear with me. I'm using .NET Core 6 with C# 10, Vosk 0.3.38 and the "vosk-model-en-us-0.22-lgraph" model (renamed the folder to "model"). The model appears to be loading fine, with no errors or warnings showing up.

The attached audio file says: static flexible rubber soles static

However, Vosk always outputs:

{
  "partial" : ""
}
{
  "partial" : ""
}
. . .
{
  "partial" : "the"
}
{
  "partial" : "the"
}
. . .
{
  "result" : [{
      "conf" : 1.000000,
      "end" : 4.859122,
      "start" : 0.140574,
      "word" : "the"
    }],
  "text" : "the"
}
{
  "partial" : ""
}
{
  "partial" : ""
}
. . .
{
  "result" : [{
      "conf" : 1.000000,
      "end" : 11.854490,
      "start" : 7.685259,
      "word" : "the"
    }],
  "text" : "the"
}

The output is the same even when using different audio files. I have also tried using the "vosk-model-small-en-us-0.15" model, but the output was mostly the same.

Here is the code for speech recognition:

void SpeechRec(string modelPath, string audioPath)
{
    //Convert mp3 to wav
    using (Mp3FileReader mp3 = new Mp3FileReader(audioPath))
    {
        using (WaveStream pcm = WaveFormatConversionStream.CreatePcmStream(mp3))
        {
            WaveFileWriter.CreateWaveFile(audioPath, pcm);
        }
    }

    //Create Vosk STT
    Model model = new Model(modelPath);
    VoskRecognizer rec = new VoskRecognizer(model, 16000);
    rec.SetMaxAlternatives(0);
    rec.SetWords(true);

    using (Stream source = File.OpenRead(audioPath))
    {
        byte[] buffer = new byte[4096];
        int bytesRead;
        while ((bytesRead = source.Read(buffer, 0, buffer.Length)) > 0)
        {
            if (rec.AcceptWaveform(buffer, bytesRead))
            {
                Console.WriteLine(rec.Result());
            }
            else
            {
                Console.WriteLine(rec.PartialResult());
            }
        }
    }
    Console.WriteLine(rec.FinalResult());
}

To use the audio file, change the file extension from .mp4 to .mp3.

https://user-images.githubusercontent.com/49353890/179464547-3ee3a00e-2941-4443-b2a2-14165b21354f.mp4

Vosk console/debug messages.

output.txt

nshmyrev commented 2 years ago

Your audio is stereo, you also need to convert it to mono

BlackHawkCH91 commented 2 years ago

Yep that was the issue, though it seems that it's quite inaccurate, now outputting: "the works should be within the rules" instead of "flexible rubber soles". Might need to experiment with the models.

Thanks for answering!

EDIT:

yeah, the accuracy depends on the model. Changed it to a different one and it was more accurate.

nshmyrev commented 2 years ago

Yep that was the issue, though it seems that it's quite inaccurate, now outputting: "the works should be within the rules" instead of "flexible rubber soles".

Sample rate is also wrong, the file has 22khz,not 16. If you fix everything it should output "flexible rubber soles" like with vosk-transcriber command line utility

nshmyrev commented 2 years ago

Even with a small model

BlackHawkCH91 commented 2 years ago

Yep, it's a lot more accurate. Thanks for the help. I'm now getting the sample rate of the file and then passing it to the VoskRecogniser.