Open ShyamalG97 opened 1 month ago
I also have some questions around how to actually interpret the responses sent back when using diarization of partial results. We are struggling to find any algorithm which sensibly orders and formats the responses. Here’s an example:
Transcribing <Unknown>: <S>
Transcribing <Unknown>: <so>
Transcribing <Unknown>: <so look>
Transcribing <Unknown>: <so look at that>
Transcribing <Unknown>: <at that>
Transcribing <Guest-1>: <so look at that hair>
Transcribing <Guest-1>: <look at that hair>
Transcribing <Guest-1>: <so look at that hair but>
Transcribing <Guest-1>: <so look at that hair but i>
Transcribing <Guest-1>: <so look at that hair but i think>
Transcribing <Guest-1>: <so look at that hair but i think so>
Transcribing <Unknown>: <but i think so tall>
Transcribing <Guest-2>: <but i think so tall what>
Transcribing <Guest-2>: <but i think so tall what a>
Transcribing <Guest-2>: <but i think so tall what a target>
Transcribing <Guest-1>: <so look at that hair but i think so tall what a target they>
Transcribing <Unknown>: <they can't seem to>
Transcribing <Unknown>: <they can't seem to hit>
Transcribing <Unknown>: <they can't seem to hit him>
Transcribed <Guest-1>: <Look at that hair.>
Transcribed <Guest-2>: <But I think so tall. What a target.>
Transcribed <Guest-1>: <They can't seem to hit him.>
First we see an unknown utterance builds and gets identified as guest 1. Then the in-flight result starts fresh and becomes unknown and then gets identified as guest 2. Then we see a spurious result identified as guest 1 containing the whole transcript so far. Then eventually we get three correct finalised utterances.
What algorithm could you apply to sensibly display this in real time? The best I have is to bucket the responses by speaker, but this doesn’t always work well. Is the response stream intended to be interpreted in a serial way? Based on the above, it looks to me as if you’re meant to track responses by speaker and assume they will interleave until finalised. But then how do you interpret the various unknown responses?
Here’s another example I can’t make sense of:
Transcribing <Unknown>: <amazing but>
Transcribing <Unknown>: <amazing but i>
Transcribing <Guest-1>: <but i you know>
Transcribing <Guest-1>: <but i you know laure>
Transcribing <Guest-1>: <but i you know lauren>
Transcribing <Guest-1>: <but i you know lauren has>
Transcribing <Guest-1>: <but i you know lauren has an interesting thing>
Transcribing <Guest-1>: <but i you know lauren has an interesting thing which is>
Transcribing <Guest-2>: <but i you know lauren has an interesting thing which is he>
Transcribing <Guest-2>: <but i you know lauren has an interesting thing which is he totally>
Transcribing <Unknown>: <totally put me>
Transcribing <Guest-1>: <but i you know lauren has an interesting thing which is he totally put me on the>
...
You see as the initial identification happens again we miss the first word, which seems to be a common issue. And then the speaker changes and we get a spurious unknown fragment which then resolves.
I saw some indication in code samples that timestamp metadata might need to be used to filter out unwanted results:
// If the end timestamp for the previous result is later
// than the end timestamp for this result, drop the result.
// This sometimes happens when we receive a lot of Recognizing results close together.
But based on my last sample, how would I for example filter out the short random 'unknown' fragment? Its end timestamp is later than the previous result.
Am I missing something? Where is the documentation for this feature?
Is there any update on this issue? At present I can’t see how you could build anything around the realtime diarization feature. There doesn’t seem to be any algorithm or state machine logic to interpret the order of responses from the service.
Hello @danhalliday , I too have not received a response from Microsoft regarding my query. Also, I went through your issue.
This is the question I had posted on Microsoft Q and A forum and had got this response from Microsoft (link below). They have addressed the issue you are facing and have asked to use a property "PropertyId.SpeechServiceResponse_DiarizeIntermediateResults", to get recognized speakers for partial results, which might solve your problem. They have also asked to raise an issue with the dev team on github to ask for more such properties, as I think this property is undocumented. More details in the link below
Also, just on a side note, if you check out this link below there is a flowchart, indicating that Azure expects the speakers' audios to not overlap. I think the solution to overlapping audio issue of multiple speakers has not been released by Microsoft. https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/announcing-general-availability-of-real-time-diarization/ba-p/4147556 Plus all the samples given by Azure in the Transcriber documentation online don't have any overlapping audio.
Do let me know the results you get, if you try this property.
This item has been open without activity for 19 days. Provide a comment on status and remove "update needed" label.
To the Microsoft Support Team,
We have been using ConversationTranscriber of the Azure Speech SDK, to implement Diarization in our project, and have encountered an issue in which we need your assistance.
In our project, the Transcriber works well when 2 or more speakers speak separately, i.e., their audios do not overlap. In this scenario, separate speakers and their spoken audio is recognized. But when 2 or more speakers speak simultaneously, i.e., their audios overlap, the Transcriber does not identify the speakers separately. Instead, it clubs their spoken audio together, and classifies it as a single speaker. Sometimes, it detects different parts of the different audios, returning erroneous results.
Our project setup is as follows :
We have been using the following documentation as reference :
https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/samples/cpp/windows/console/samples/conversation_transcriber_samples.cpp (ConversationTranscriptionWithPushAudioStream())
https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-stt-diarization?tabs=linux&pivots=programming-language-cpp
As mentioned above, we get correct results when speakers speak separately. But when they speak simultaneously, we get the following types of erroneous results :
SAMPLE :
We have attached an audio file, the Speech SDK logs and the results in 3 separate files in a ZIP folder below.
The audio file contains 2 speakers speaking :
Speaker 1: "In other sports, it could be impossible to do it, but I think, in tennis, we have a really good relationship of the court as well." (This speaker is speaking in the background continuously) Speaker 2: "Tell me about Echo and Charlie Series" (this speaker starts speaking after some delay, and the voices overlap)
We got the following result : TRANSCRIBED: Text=In other sports, it. Speaker ID=Guest-1 TRANSCRIBED: Text=Could be impossible to do it, but I. Speaker ID=Guest-2 TRANSCRIBED: Text=Tell you about Echo, Civilian and Charlie series. Speaker ID=Guest-3 TRANSCRIBED: Text=Of the court as well. Speaker ID=Guest-2
As you can see, some parts of the background speaker's audio was not transcribed, which happened when the audios overlapped.
Please not that our project setup is NOT directly reading from an audio file. The Transcriber transcribes data from a pushstream, which contains raw audio taken through the microphone, where the speakers talk directly to the microphone.
In our sample attached below, we recorded 2 speakers speaking (audio file attached below), and then played it to our microphone, and got the transcriber results (which we have attached below).
Speakers Audio Overlap Sample and Results.zip
We wished to know whether Diarization using ConversationTranscriber has been implemented for overlapping speakers’? If so, could you kindly assist us in identifying what might be going wrong with our project setup or our approach to implementing the Transcriber? Are we using the correct functions from the Speech SDK to implement overlapping audios’ Diarization ? Could you also provide us with the relevant documentation/working examples to help us further?
Thanks and regards, Shyamal Goel (edited)