MicrosoftDocs / azure-docs

Open source documentation of Microsoft Azure
https://docs.microsoft.com/azure
Creative Commons Attribution 4.0 International
10.24k stars 21.41k forks source link

conversation transcription - speech service #35466

Closed cassm199 closed 4 years ago

cassm199 commented 5 years ago

Hi I'm trying to use the conversation transcription service found in the Speech service, however the service enters the Cancelled event, reason: endofstream and not transcript is returned and also the duration is 00:00:00.

First I generate the voice signature through the use of CreateVoiceSignatureByUsingBody, then I pass the signature to create a new participant object. Then in the ConversationWithPullAudioStreamAsync, I pass another audio contains the voice of the participant using Helper.OpenWavFile, which makes use of PullAudioInputStreamCallback, then after all events are initialized after a few seconds it goes on the transcriber.Canceled, reason endofstream.

Is it mandatory to have ROOBO devkit or you can still process an audio recorded from other devices?


Document Details

Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

shashishailaj commented 5 years ago

@cassm199 Thank you for your feedback. We will investigate and update the thread further.

cassm199 commented 5 years ago

@shashishailaj it think the issue is more related with https://github.com/MicrosoftDocs/azure-docs/blob/master/articles/cognitive-services/Speech-Service/how-to-use-conversation-transcription-service.md

shashishailaj commented 5 years ago

@cassm199 Thank you for letting us know that . I have updated the reference and it has been assigned to the relevant team . They will further update the thread with their findings.

RohitMungi-MSFT commented 5 years ago

@cassm199 Yes, The ROOBO devkit is the supported hardware for creating conversation transcriptions. For the input audio wave file for creating voice signatures it should be in 16-bit samples, 16 kHz sample rate, and a single channel (Mono) format where the recommended length for each audio sample is between 30 seconds and two minutes.

@jhakulin Could you please confirm the same?

cassm199 commented 5 years ago

Thanks. I've tried with an 8 channel audio and infact it worked, although sometimes rather than returning the correct userID or Unidentified user it returns $ref$. Is it the same as an Unidentified user?

RohitMungi-MSFT commented 5 years ago

@cassm199 Unfortunately, did not encounter such a scenario. An unidentified user is returned if the result reason is a nomatch. May be adding more logs might help to understand the return response.

Weatwagon commented 5 years ago

I've experienced the same thing trying to create an 8 channel audio file, as we do not have the ROBO devkit at this time. With the file I crafted it will transcribe the audio but will not identify the speakers. The transcription only comes back on the "Recognizing" event, the "Recognized" event returns blank text and no identified participant.

How is the live demo achieved from Build 2018 keynote? Can we use the ROBO devkit to transcribe live audio or only from files recorded by ROBO devkit? Will a sample of that demo be added to the github demos?

RohitMungi-MSFT commented 5 years ago

@jhakulin Do we have plans to update the github demos with the demo from build 2018?

jhakulin commented 5 years ago

For conversation transcription, it is recommended to use ROOBO dev kit or Azure Kinect DK. See conversation transcriber samples under Speech Devices SDK samples for both of those (Android, Windows and Linux) https://github.com/Azure-Samples/Cognitive-Services-Speech-Devices-SDK/tree/master/Samples

@Weatwagon, how did you created voice signatures ? how about 8-channel audio ? Could you share those ?

Adding @sarahlume also

cassm199 commented 5 years ago

@jhakulin by any chance have you ever encounter the userID being $ref$? Kindly note that this doesn't always happen.

Weatwagon commented 5 years ago

@jhakulin I created the voice signature file with Audacity and use the example script located here https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-use-conversation-transcription-service which produced the voice signature

I created the 8-channel audio also using Audacity. I tried transcribing the audio using same example giving on the previous link. Attached is the signature file and the 8 channel file I was trying to get a positive match on. AudioSamples.zip

andrewvr99 commented 5 years ago

I am getting a similar problem: $ref$ for people for whom a signature was provided and Unknown for the other participants. Speech is being correctly transcribed so I don't think it is an issue with the input files.

I can't tell from the above what the status of this issue is - is this an acknowledged problem/is the issue still open?

cassm199 commented 5 years ago

do you have any updates on this issue?

jhakulin commented 5 years ago

Apologies for the slow response,

Currently, only following kits are supported for 8 channel audio capture when used with Conversation Transcription feature: ROOBO Smart Audio Circular 7-Mic DK (https://ddk.roobo.com/) Azure Kinect DK (https://azure.microsoft.com/en-in/services/kinect-dk/)

Are you using one of these or building the 8-channel audio some other way ?

Weatwagon commented 5 years ago

We have purchased an Azure Kinect DK. I would like to test the issue with audio recorded by the Azure Kinect. What tool do we use to capture an audio recording? Azure Kinect DK recorder https://docs.microsoft.com/en-us/azure/kinect-dk/azure-kinect-recorder has a note that it does not record audio. Using audacity the Azure Kinect with the Windows WASAPI the device is recognized as 7 channel. Do you have a sample file from an Azure Kinect I can compare? Do I need to add a silenced audio channel?

lei0706w commented 5 years ago

You can try to record 7 channels using Audacity, and add an extra silenced channel of the same length as other 7 channels.

Ideally, customers are suggested to use Speech Devices SDK for Windows for conversation transcription when using Azure Kinect DK. It will automatically handle recording 7 channel audio + 1 silenced channel as well as do the speech recognition. You can find Speech Devices SDK sample app on Windows from here. Hope that is helpful to you.

Weatwagon commented 5 years ago

@lei0706w I will try to create the audio file and test with a silenced channel. Do you know if it needs to be on a specific channel?

I'm familiar the the sample app. It seems to function fine. Though I prefer .net stack over java development. I would like to use a recording of a previous conversation because right now I'm driving my co-workers crazy asking them to "hey say something" (open office floor plan) every time I want to test. I'm getting a lot of false identifications which I assume is due to poor voice signatures. Is there a place to gain more information on what a quality voice signature should be or how to reduce incorrect identifications?

lei0706w commented 5 years ago

The silenced channel needs to be set as the last channel. Someone will answers your last question later.

sarahlume commented 5 years ago

@Weatwagon, currently CTS only support the DDK provided by Roobo https://ddk.roobo.com/ because of the Microphone array geometry restriction.

Let me know if you have any other quesitons

Weatwagon commented 5 years ago

@sarahlume I'm getting conflicting information.

According to this how to guide from the Microsoft docs here about the conversation transcription service references the Azure Kinect DK

Currently, only these kits are supported for 8 channel audio capture: ROOBO Smart Audio Circular 7-Mic DK Azure Kinect DK.

The "Get The Speech Devices SDK" doc here shows that senerios for use includes conversation transcription

Azure Kinect DKSetup / Quickstart 7 Mic Array RGB and Depth cameras.Windows/Linux A developer kit with advanced artificial intelligence (AI) sensors for building sophisticated computer vision and speech models. It combines a best-in-class spatial microphone array and depth camera with a video camera and orientation sensor—all in one small device with multiple modes, options, and SDKs to accommodate a range of compute types. Conversation Transcription, Robotics, Smart Building

And the github samples page here also references both the ROBO and Azure Kinect

I feel I have hijacked this issue enough, @cassm199 I hope you get a resolution to your issue. Please direct me on how to create an official support ticket with Microsoft on matters related to Speech Cognitive Services and the Azure Kinect DK.

andrewvr99 commented 5 years ago

The silenced channel needs to be set as the last channel. Someone will answers your last question later. Thanks for following up @lei0706w.

I took a mono WAV file and used Audacity to create an 8 channel, 16-bit 16 kHz PCM version. The Speech service was able to transcribe the text, but identified the speaker as $ref$.

I then took that same file and silenced the 8th track (so it was a flat line) but when I fed that into the Speech service, it was unable to transcribe any text at all.

I take it the Speech service works as advertised, so I must assume that it is not possible to take a non-compliant WAV file and convert it into an acceptable format for the Speech service. Or I am mangling the file somehow, but I have checked that quite carefully.

Has anyone else managed to get this to work without starting from a Roobo device?

nimai-agarwal commented 5 years ago

I'm getting the exam same issue as andrewvr99. I edited the wav file in audacity to fit the specifications and it's being transcribed successfully, but all speakers are being identified as $ref$. Is there a solution to this?

jhakulin commented 5 years ago

https://docs.microsoft.com/en-us/azure/cognitive-services/Speech-Service/how-to-use-conversation-transcription-service Currently, only these kits are supported for 8 channel audio capture: ROOBO Smart Audio Circular 7-Mic DK Azure Kinect DK.

@nimai-agarwal, @andrewvr99 if you are manually creating wav files, that approach probably will not be successful. Current support is designed for the above devices only.

tchristiani commented 4 years ago

This has been identified as a product issue, not a doc bug or enhancement. We want you to get an answer as quickly as possible so, please take a look at this document and post your issue to the relevant forum.

please-close

danieljlevine commented 2 years ago

https://docs.microsoft.com/en-us/azure/cognitive-services/Speech-Service/how-to-use-conversation-transcription-service Currently, only these kits are supported for 8 channel audio capture: ROOBO Smart Audio Circular 7-Mic DK Azure Kinect DK.

@nimai-agarwal, @andrewvr99 if you are manually creating wav files, that approach probably will not be successful. Current support is designed for the above devices only.

Hi, just starting to play with this capability in April 2022. I am currently playing with the conversation transcription quickstart project in C#. I have used a sample .wav file called KateAndSteve.wav file (found in the java Speech SDK quickstart) with my "eastus" and corresponding key with success. So I actually get what I'd expect from transcription. Yay.

So now I want to move the quickstart to something more like what I'm trying to do using a microphone. I actually have an Azure Kinect sitting on my desk and connected to my system. I have verified that it works as the default microphone using the simple speech to text quickstart with microphone. However, using it as the default microphone for speech transcription fails with the transcription being cancelled immediately:

CANCELED: Reason=Error
CANCELED: ErrorCode=RuntimeError
CANCELED: ErrorDetails=Exception with an error code: 0x1b (SPXERR_RUNTIME_ERROR) SessionId: 0cba61aa7a08426b82d87e77bd9e03b3

Now I'm not actually using the Azure Kinect DK SDK at all. I plugged the USB into my computer. Is there something I need to do with the Azure Kinect DK SDK to enable this setup to just work out of the box?

The only changes I made to the C# conversation transcription quickstart were:

// Line 83 in Program.cs
config.SetProperty("DifferentiateGuestSpeakers", "true"); // We only have guests.  Not looking to identify who it actually was speaking, just differentiate speakers.

// Line 88:
//using (var audioInput = AudioStreamReader.OpenWavFile(conversationWaveFile))
using (var audioInput = AudioConfig.FromDefaultMicrophoneInput())  // Use microphone instead of .wav file
atyshka commented 1 year ago

@danieljlevine I am having the same issue. Did you ever resolve it?