Diarization in Speech SDK for overlapping audio of multiple speakers speaking simultaneously

To the Microsoft Support Team,

We have been using ConversationTranscriber of the Azure Speech SDK, to implement Diarization in our project, and have encountered an issue in which we need your assistance.

In our project, the Transcriber works well when 2 or more speakers speak separately, i.e., their audios do not overlap. In this scenario, separate speakers and their spoken audio is recognized. But when 2 or more speakers speak simultaneously, i.e., their audios overlap, the Transcriber does not identify the speakers separately. Instead, it clubs their spoken audio together, and classifies it as a single speaker. Sometimes, it detects different parts of the different audios, returning erroneous results.

Our project setup is as follows :

We have a Gstreamer C++ project, in which we are implementing the Azure Speech SDK.
The project receives an OPUS audio stream, containing audio of speakers speaking in real time.
The OPUS audio stream is converted into a raw audio stream (format : S16LE, rate : 16000, channel : mono). Please not that our channel is mono.
Samples from this raw audio stream are pushed to a pushstream (whenever they become available). The pushstream has been configured with the Transcriber.
The transcribing asynchronous process is running in the background, and it transcribes audio from the pushstream.

We have been using the following documentation as reference :

https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/samples/cpp/windows/console/samples/conversation_transcriber_samples.cpp (ConversationTranscriptionWithPushAudioStream())
https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-stt-diarization?tabs=linux&pivots=programming-language-cpp

As mentioned above, we get correct results when speakers speak separately. But when they speak simultaneously, we get the following types of erroneous results :

The voice of the speaker which is speaking more loudly is taken, suppressing/ignoring the other speakers.
The audio of the different speakers is clubbed together and declared as a single speaker
Parts of different audios are taken and transcribed, hence we do not get the transcribed result of the entire utterances.

SAMPLE :

We have attached an audio file, the Speech SDK logs and the results in 3 separate files in a ZIP folder below.

The audio file contains 2 speakers speaking :

Speaker 1: "In other sports, it could be impossible to do it, but I think, in tennis, we have a really good relationship of the court as well." (This speaker is speaking in the background continuously) Speaker 2: "Tell me about Echo and Charlie Series" (this speaker starts speaking after some delay, and the voices overlap)

We got the following result : TRANSCRIBED: Text=In other sports, it. Speaker ID=Guest-1 TRANSCRIBED: Text=Could be impossible to do it, but I. Speaker ID=Guest-2 TRANSCRIBED: Text=Tell you about Echo, Civilian and Charlie series. Speaker ID=Guest-3 TRANSCRIBED: Text=Of the court as well. Speaker ID=Guest-2

As you can see, some parts of the background speaker's audio was not transcribed, which happened when the audios overlapped.

Please not that our project setup is NOT directly reading from an audio file. The Transcriber transcribes data from a pushstream, which contains raw audio taken through the microphone, where the speakers talk directly to the microphone.

In our sample attached below, we recorded 2 speakers speaking (audio file attached below), and then played it to our microphone, and got the transcriber results (which we have attached below).

Speakers Audio Overlap Sample and Results.zip

We wished to know whether Diarization using ConversationTranscriber has been implemented for overlapping speakers’? If so, could you kindly assist us in identifying what might be going wrong with our project setup or our approach to implementing the Transcriber? Are we using the correct functions from the Speech SDK to implement overlapping audios’ Diarization ? Could you also provide us with the relevant documentation/working examples to help us further?

Thanks and regards, Shyamal Goel (edited)

I also have some questions around how to actually interpret the responses sent back when using diarization of partial results. We are struggling to find any algorithm which sensibly orders and formats the responses. Here’s an example:

Transcribing <Unknown>: <S>
Transcribing <Unknown>: <so>
Transcribing <Unknown>: <so look>
Transcribing <Unknown>: <so look at that>
Transcribing <Unknown>: <at that>
Transcribing <Guest-1>: <so look at that hair>
Transcribing <Guest-1>: <look at that hair>
Transcribing <Guest-1>: <so look at that hair but>
Transcribing <Guest-1>: <so look at that hair but i>
Transcribing <Guest-1>: <so look at that hair but i think>
Transcribing <Guest-1>: <so look at that hair but i think so>
Transcribing <Unknown>: <but i think so tall>
Transcribing <Guest-2>: <but i think so tall what>
Transcribing <Guest-2>: <but i think so tall what a>
Transcribing <Guest-2>: <but i think so tall what a target>
Transcribing <Guest-1>: <so look at that hair but i think so tall what a target they>
Transcribing <Unknown>: <they can't seem to>
Transcribing <Unknown>: <they can't seem to hit>
Transcribing <Unknown>: <they can't seem to hit him>
Transcribed <Guest-1>: <Look at that hair.>
Transcribed <Guest-2>: <But I think so tall. What a target.>
Transcribed <Guest-1>: <They can't seem to hit him.>

First we see an unknown utterance builds and gets identified as guest 1. Then the in-flight result starts fresh and becomes unknown and then gets identified as guest 2. Then we see a spurious result identified as guest 1 containing the whole transcript so far. Then eventually we get three correct finalised utterances.

What algorithm could you apply to sensibly display this in real time? The best I have is to bucket the responses by speaker, but this doesn’t always work well. Is the response stream intended to be interpreted in a serial way? Based on the above, it looks to me as if you’re meant to track responses by speaker and assume they will interleave until finalised. But then how do you interpret the various unknown responses?

Here’s another example I can’t make sense of:

Transcribing <Unknown>: <amazing but>
Transcribing <Unknown>: <amazing but i>
Transcribing <Guest-1>: <but i you know>
Transcribing <Guest-1>: <but i you know laure>
Transcribing <Guest-1>: <but i you know lauren>
Transcribing <Guest-1>: <but i you know lauren has>
Transcribing <Guest-1>: <but i you know lauren has an interesting thing>
Transcribing <Guest-1>: <but i you know lauren has an interesting thing which is>
Transcribing <Guest-2>: <but i you know lauren has an interesting thing which is he>
Transcribing <Guest-2>: <but i you know lauren has an interesting thing which is he totally>
Transcribing <Unknown>: <totally put me>
Transcribing <Guest-1>: <but i you know lauren has an interesting thing which is he totally put me on the>
...

You see as the initial identification happens again we miss the first word, which seems to be a common issue. And then the speaker changes and we get a spurious unknown fragment which then resolves.

I saw some indication in code samples that timestamp metadata might need to be used to filter out unwanted results:

// If the end timestamp for the previous result is later
// than the end timestamp for this result, drop the result.
// This sometimes happens when we receive a lot of Recognizing results close together.

But based on my last sample, how would I for example filter out the short random 'unknown' fragment? Its end timestamp is later than the previous result.

Am I missing something? Where is the documentation for this feature?

Azure-Samples / cognitive-services-speech-sdk

Diarization in Speech SDK for overlapping audio of multiple speakers speaking simultaneously #2615