SpeechRecognizer - Channel is always 0 during speech recognition when using stereo WAV file

JakubHolovsky commented 9 months ago

Describe the bug

When I use continuous speech recognition and observe the result of a property called "SpeechServiceResponse_JsonResult" the channel is always set to 0 even though I am using stereo WAV file - so the audio must be observed from LEFT or RIGHT channel as the speaker position is separated by that.

var speechConfig = SpeechConfig.FromSubscription(/*api key*/, /*region*/);
var audioConfig = AudioConfig.FromWavFileInput(filePath);
var recognizer = new SpeechRecognizer(speechConfig, audioConfig);

/* In recognized event handler */

var speechServiceResponseJsonResultJson = eventArgs.Result.Properties.GetProperty(PropertyId.SpeechServiceResponse_JsonResult);

var channel = 0;
if (speechServiceResponseJsonResultJson != null)
{
    var speechServiceResponseJsonResult =
        JsonConvert.DeserializeObject<JObject>(
            eventArgs.Result.Properties.GetProperty(PropertyId
                .SpeechServiceResponse_JsonResult));

    if (speechServiceResponseJsonResult.TryGetValue("Channel", StringComparison.InvariantCultureIgnoreCase, out var channelValue))
    {
        channel = channelValue.ToObject<int>();
    }
}

If relevant, a WAV file of your input audio. 993956cf-603a-4f7c-b5e0-37398e2b9df8.zip

To Reproduce

Steps to reproduce the behavior:

Create speech config instance with default params
Create audio config from wav file providing path to the wav file
Create recognizer with providing the speech and audio config
Start continuous recognition
Observe that when "Recognized" event is triggered and result is observed the PropertyId.SpeechServiceResponse_JsonResult and it's nested property "Channel" is always 0 even though the speech to text is recognized from a stereo wav file where speakers are spearated with LEFT and RIGHT channel.

Expected behavior

A channel should be detected by either 0 (left) or 1 (right) depending on the channel the text was detected from.

Version of the Cognitive Services Speech SDK

1.33.0

Platform, Operating System, and Programming Language

OS: Windows
Hardware - x64
Programming language: C#

jhakulin commented 9 months ago

@JakubHolovsky The feature you are asking is not currently supported, speech service side will downmix multi-channel to mono which will be then processed. What kind of user scenario you are working on? Maybe I can help further with more details on the scenario.

JakubHolovsky commented 9 months ago

@jhakulin in my scenario I have a conversation split between two speakers. Speaker 1 is in LEFT channel, speaker 2 is in RIGHT channel of the stereo .wav file. When I do the continuous transcription, I need to mark each sentence transcribed with the appropriate speaker.

Example:

Speaker 1 is saying "Hello, this is Mark" I expect channel to be set to 0 for that particular sentence as it's coming from the LEFT channel. When speaker 2 says "Hello, this is Adam, thank you for calling" I expect speaker 2 channel to be set to 1 as it's coming from the RIGHT channel.

Is there any workaround to be used if I cannot rely on the channel property?

I saw that I could use the batch processing API but that requires a bit different approach.

jhakulin commented 9 months ago

@JakubHolovsky For your scenario, we have real-time speaker diarization https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-stt-diarization?tabs=macos&pivots=programming-language-csharp which you can use or batch transcription https://learn.microsoft.com/en-us/azure/ai-services/speech-service/batch-transcription which supports speaker diarization as well. Real-time speaker diarization detects the speakers from mono audio, you can also input multichannel audio in wave PCM format if you like.

Let us know if those would fit to your project and if not what kind of problems you see.

JakubHolovsky commented 9 months ago

@jhakulin Thank you, I decided to split the wav file into two mono L & R files and process them in parallel and combine the results at the end. That will give me exactly what I am after without many changes. Appreciate your help though.

manshar-ish commented 8 months ago

@jhakulin Thank you, I decided to split the wav file into two mono L & R files and process them in parallel and combine the results at the end. That will give me exactly what I am after without many changes. Appreciate your help though.

Hi @jhakulin , Could you please guide me, how you are merging L & R files in sync?

Azure-Samples / cognitive-services-speech-sdk

SpeechRecognizer - Channel is always 0 during speech recognition when using stereo WAV file #2158