Open ecsplendid opened 1 year ago
Hi.
Speaker vector is only returned with 1-best (SetMaxAlternatives 0) results unfortunately, we need to implement that bit with n-best
https://github.com/alphacep/vosk-api/blob/master/src/recognizer.cc#L525
I also note that your code for pulling the PCM audio from Unity doesn't seem to work, it just produces gibberish, for now I just read in some PCM shorts from a test wave file and that works:
Unity returns floats usually, not pcm, you need to process them as float array (or convert to shorts). We have demo project here:
https://github.com/alphacep/vosk-unity-asr
https://github.com/alphacep/vosk-unity-asr/blob/master/Assets/Scripts/VoiceProcessor.cs#L315
Thanks for the tip! I will try!
I was referring to your unity project you linked, you do convert it to PCM in there. I think you were using a public library for reading the audio, the one with the coroutines and microphone and no AudioSource. It was interesting to see actually, first time I had seen that approach. We have been using OnAudioFilterRead with various hacks to improve the latency and resorted to writing native code for Android because the Unity audio stack is so laggy and limited i.e. if you plug a BT mic in, you can't select an on-device mic.
Yes, unity sound sucks a lot. We used OnAudioFilterRead before too, it is hopeless.
@ecsplendid
Btw, you should not use FinalResult, it is for the end of the stream
var result = _recognizer.FinalResult();
I'll reopen to track the change
Hello!
I have modified the Unity sample to do speaker diarization, and I can't seem to get the diarization information coming through.
I am on an M1 mac, and to get it working at all I needed to take your m1 build from your Python distribution and add it to the Unity project.
I also note that your code for pulling the PCM audio from Unity doesn't seem to work, it just produces gibberish, for now I just read in some PCM shorts from a test wave file and that works:
I make some minor modifications:
There is no documentation on this but I I think this is right i.e.
vosk-model-spk-0.4
vosk-model-en-us-0.22
It runs, and produces English transcription, but I don't see any speaker embeddings on the final transcript
Partials are coming through like this:
And finals are coming through like this:
I know I am probably being stupid here, what am I doing wrong?!
Help! Thanks!