alphacep / vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Apache License 2.0
7.76k stars 1.09k forks source link

Add speaker vector to nbest #1199

Open ecsplendid opened 1 year ago

ecsplendid commented 1 year ago

Hello!

I have modified the Unity sample to do speaker diarization, and I can't seem to get the diarization information coming through.

voskRecognizerCreateMarker.Begin();
if (!_recognizerReady)
{

    var spkModel = new SpkModel(_decompressedModelPath);
    _recognizer = new VoskRecognizer(_model, 16000.0f);
    _recognizer.SetSpkModel(spkModel);

    //
    _recognizer.SetMaxAlternatives(1);
    //_recognizer.SetWords(true);

    _recognizerReady = true;

    Debug.Log("Recognizer ready");
}

I am on an M1 mac, and to get it working at all I needed to take your m1 build from your Python distribution and add it to the Unity project.

I also note that your code for pulling the PCM audio from Unity doesn't seem to work, it just produces gibberish, for now I just read in some PCM shorts from a test wave file and that works:

// stream bytes from output.wav in StreamingAssets folder
var stream = new FileStream(Path.Combine(Application.streamingAssetsPath, "output.wav"), FileMode.Open);
// this is a PCM wave file, parse it in read the audio data in shorts
var reader = new WaveFileReader(stream);
byte[] buffer = new byte[reader.Length / 2];
reader.Read(buffer, 0, buffer.Length);
// feed the audio data to the recognizer 4000 samples at a time

// get a byte[] of 4000 samples buffer, i, length
byte[] buffer4000 = new byte[4000];

for (int i = 0; i < buffer.Length; i += 4000)
{

    Array.Copy(buffer, i, buffer4000, 0, 4000);

    if (_recognizer.AcceptWaveform(buffer4000, buffer4000.Length))
    {
        var result = _recognizer.FinalResult();
        //_recognizer.Reset();
        Debug.Log(result);
        _threadedResultQueue.Enqueue(result);
    }
    else
    {
        var result = _recognizer.PartialResult();
        Debug.Log(result);
        _threadedResultQueue.Enqueue(result);
    }

    await Task.Delay(100);
}

I make some minor modifications:

private async Task ThreadedWork()
{
    voskRecognizerCreateMarker.Begin();
    if (!_recognizerReady)
    {

        var spkModel = new SpkModel(_decompressedModelPath);
        _recognizer = new VoskRecognizer(_model, 16000.0f);
        _recognizer.SetSpkModel(spkModel);

        //
        _recognizer.SetMaxAlternatives(1);
        //_recognizer.SetWords(true);

        _recognizerReady = true;

        Debug.Log("Recognizer ready");
    }

There is no documentation on this but I I think this is right i.e.

It runs, and produces English transcription, but I don't see any speaker embeddings on the final transcript

Partials are coming through like this:

{
  "partial" : "just because you can execute a program on the computer"
}

And finals are coming through like this:

{
  "alternatives" : [{
      "confidence" : 677.793457,
      "text" : "just because you can execute a program on the computer to perform the task does not mean that the system you've created understands the dos"
    }]
}

I know I am probably being stupid here, what am I doing wrong?!

Help! Thanks!

nshmyrev commented 1 year ago

Hi.

Speaker vector is only returned with 1-best (SetMaxAlternatives 0) results unfortunately, we need to implement that bit with n-best

https://github.com/alphacep/vosk-api/blob/master/src/recognizer.cc#L525

I also note that your code for pulling the PCM audio from Unity doesn't seem to work, it just produces gibberish, for now I just read in some PCM shorts from a test wave file and that works:

Unity returns floats usually, not pcm, you need to process them as float array (or convert to shorts). We have demo project here:

https://github.com/alphacep/vosk-unity-asr

https://github.com/alphacep/vosk-unity-asr/blob/master/Assets/Scripts/VoiceProcessor.cs#L315

ecsplendid commented 1 year ago

Thanks for the tip! I will try!

I was referring to your unity project you linked, you do convert it to PCM in there. I think you were using a public library for reading the audio, the one with the coroutines and microphone and no AudioSource. It was interesting to see actually, first time I had seen that approach. We have been using OnAudioFilterRead with various hacks to improve the latency and resorted to writing native code for Android because the Unity audio stack is so laggy and limited i.e. if you plug a BT mic in, you can't select an on-device mic.

nshmyrev commented 1 year ago

Yes, unity sound sucks a lot. We used OnAudioFilterRead before too, it is hopeless.

nshmyrev commented 1 year ago

@ecsplendid

Btw, you should not use FinalResult, it is for the end of the stream

var result = _recognizer.FinalResult();
nshmyrev commented 1 year ago

I'll reopen to track the change