Azure-Samples / cognitive-services-speech-sdk

Sample code for the Microsoft Cognitive Services Speech SDK
MIT License
2.87k stars 1.85k forks source link

Moving from Speech Recognition to Diarization #1470

Closed danieljlevine closed 2 years ago

danieljlevine commented 2 years ago

Hi,

I have inherited some C# code that does a nice job of using Azure to convert words spoken into a microphone to text using the SpeechRecognizer class. It makes nice calls to what I believe are callbacks recognizer.recognizing(s, e) and recognizer.recognized(s, e) to report back intermediary results and successfully successful speech recognition. The parameter value e has useful information in both cases like: e.Result.Text and e.Result.Duration. I am interested in evolving this existing capability to one that also identifies different speakers (as opposed to specific speakers), so it's important to me to know when speaker1, speaker2, and speaker3 spoke and what they said, but I am not interested in identifying who speaker1, speaker2, and speaker3 are.

I was hoping that I could turn on diarizationEnabled and wordLevelTimestampsEnabled, by setting them to true like this:

speechConfig.SetProperty("diarizationEnabled", "true"); speechConfig.SetProperty("wordLevelTimestampsEanbled", "true");

As far as I could tell it didn't have any effect. I was trying to figure out how to change the result from simple to detailed, in case that made a difference, but wasn't able to figure out how to do that yet either.

Suggestions?

danieljlevine commented 2 years ago

As a follow-up, I have the quickstart\csharp\dotnet\from-microphone\helloworld working in my environment. So perhaps we could talk in terms of this program as opposed to my much larger code base.

I added these lines after line 18 in Program.cs:

config.SetProperty("diarizationEnabled", "true");
config.SetProperty("wordLevelTimestampsEnabled", "true");

As far as I can tell, it had no effect. Perhaps I'm looking in the wrong place. It does hear my single sentence and does provide the correct text from my speech. I was hoping there might be something like Speaker1 in a field, since it doesn't know who I am.

pankopon commented 2 years ago

Hi, diarization is not supported with SpeechRecognizer in the Speech SDK API. The current options are:

danieljlevine commented 2 years ago

Thanks, I suspected as much.

How would I change the transcription QuickStart to work with the a single default microphone? When I attempted to do that, I got run time errors that I assumed were from not having the the right audio parameters on the mic. (I don’t have the errors handy, but could post tomorrow.) For that matter I couldn’t point to a sample .wav file either and make it work (probably for similar audio requirement reasons), so I wouldn’t mind starting there.

Is there some way to have Azure take the audio I provide (mic or .wav) and turn it into compatible audio (bitrate, PCM, monaural, 7 channel, etc)?

Also, is there a microphone one can buy that provides what is needed (i.e., implements the conversation transcription mic reference spec)?

From: pankopon @.**@.>> Date: Tuesday, Apr 19, 2022, 8:26 PM To: Azure-Samples/cognitive-services-speech-sdk @.**@.>> Cc: Levine, Daniel J. @.**@.>>, Author @.**@.>> Subject: [EXT] Re: [Azure-Samples/cognitive-services-speech-sdk] Moving from Speech Recognition to Diarization (Issue #1470)

APL external email warning: Verify sender @.*** before clicking links or attachments

Hi, diarization is not supported with SpeechRecognizer in the Speech SDK API. The current options are:

— Reply to this email directly, view it on GitHubhttps://github.com/Azure-Samples/cognitive-services-speech-sdk/issues/1470#issuecomment-1103293488, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AILSLXSPQIMW7VXUTZUDZ3TVF5FKPANCNFSM5TW2VBZQ. You are receiving this because you authored the thread.Message ID: @.***>

danieljlevine commented 2 years ago

I Should add that I did get word level time stamps enabled using the SDK’s function call. It’s a shame the diarizarionEnabled doesn’t have suck an SDK function. However, there might be a good reason for that. Guessing maybe it’s because it would only remember the speakers in single audio snippets and forget them. Perhaps that’s what ConversationTranscription works around?

From: pankopon @.**@.>> Date: Tuesday, Apr 19, 2022, 8:26 PM To: Azure-Samples/cognitive-services-speech-sdk @.**@.>> Cc: Levine, Daniel J. @.**@.>>, Author @.**@.>> Subject: [EXT] Re: [Azure-Samples/cognitive-services-speech-sdk] Moving from Speech Recognition to Diarization (Issue #1470)

APL external email warning: Verify sender @.*** before clicking links or attachments

Hi, diarization is not supported with SpeechRecognizer in the Speech SDK API. The current options are:

— Reply to this email directly, view it on GitHubhttps://github.com/Azure-Samples/cognitive-services-speech-sdk/issues/1470#issuecomment-1103293488, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AILSLXSPQIMW7VXUTZUDZ3TVF5FKPANCNFSM5TW2VBZQ. You are receiving this because you authored the thread.Message ID: @.***>

danieljlevine commented 2 years ago

Ok, so I tried to get the Speech Transcription quick start working. I initially was using a regular microphone headset and made a small change to get input from the default microphone instead of the 7-channel .wav I don't have lying around.

My only code changes to the quickstart (other than adding my key and setting the zone to "eastus") were:

// Line 83 in Program.cs
config.SetProperty("DifferentiateGuestSpeakers", "true"); // We only have guests.  Not looking to identify who it actually was speaking, just differentiate speakers.

// Line 88:
//using (var audioInput = AudioStreamReader.OpenWavFile(conversationWaveFile))
using (var audioInput = AudioConfig.FromDefaultMicrophoneInput())  // Use microphone instead of .wav file

This found it's way to the cancelled callback: CANCELED: Reason=Error CANCELED: ErrorCode=RuntimeError CANCELED: ErrorDetails=Exception with an error code: 0x1b (SPXERR_RUNTIME_ERROR) SessionId: 0cba61aa7a08426b82d87e77bd9e03b3

I also managed to get my hands on an Azure Kinect, which has the 7-mic array built in. It also had the same problem.

So, perhaps I haven't come up with the right code to use the microphone yet in the quickstart?

I should also mention that I have successfully used both the standard headset mic and Azure Connect with the from-microphone quickstart. So the hardware seems to be working just fine with the Speech Recognizer.

pankopon commented 2 years ago

@danieljlevine Is your service region among the supported regions for conversation transcription (as of now: centralus, eastasia, eastus, westeurope)?

If yes, are you using the readily available conversation transcription quickstart project in github which uses files as input? Please start with this project as is and first verify you can get the default setup working, only using your subscription key and service region, then gradually modify it. (As it appears, example wav files are not included with the C# sample but you can find them from the corresponding quickstart project for Java. The file katiesteve.wav there has the expected input format.)

We are considering adding support for single channel input audio in conversation transcription, potentially later this year but not confirmed yet.

danieljlevine commented 2 years ago

Yes, I am using eastus and the corresponding key. Ok, I can using that sample .wav first.

Given that the Azure Kinect has the secluding 7-channel mic, shouldn’t it just work once I get it coded right?

pankopon commented 2 years ago

It depends on what the actual format of audio from Azure Kinect is, I'm not sure if the feature has been tested with that. The input to conversation transcriber should actually have 7+1 channels i.e. include a reference channel. The format of the example conversation wav file is

$ file katiesteve.wav
katiesteve.wav: RIFF (little-endian) data, WAVE audio, 8 channels 16000 Hz
danieljlevine commented 2 years ago

Ok, I have verified that my setup is capable of doing the transcription using the KatieSteve.wav file. The ouput eventually identified one speaker (Kate) as Guest_0. It probably doesn't identify Steve as Guest_1 because he doesn't talk enough. But this is fine and exactly what I'd like to start trying to do, but with a microphone.

Here's my output: from KateSteve.wav:

Session started event. SessionId=32d530b0b7f34ce492a4fd3b10a634a8
TRANSCRIBING: Text=good morning SpeakerId=Unidentified
TRANSCRIBING: Text=good morning steve SpeakerId=Unidentified
TRANSCRIBED: Text=Good morning, Steve. SpeakerId=Unidentified
TRANSCRIBING: Text=good morning SpeakerId=Unidentified
TRANSCRIBING: Text=good morning kate SpeakerId=Unidentified
TRANSCRIBING: Text=good morning katie SpeakerId=Unidentified
TRANSCRIBED: Text=Good morning, Katie. SpeakerId=Unidentified
TRANSCRIBING: Text=have you SpeakerId=Unidentified
TRANSCRIBING: Text=have you heard SpeakerId=Unidentified
TRANSCRIBING: Text=have you heard of SpeakerId=Unidentified
TRANSCRIBING: Text=have you heard SpeakerId=Unidentified
TRANSCRIBING: Text=have you heard about SpeakerId=Unidentified
TRANSCRIBING: Text=have you heard about that SpeakerId=Unidentified
TRANSCRIBING: Text=have you heard about the new SpeakerId=Unidentified
TRANSCRIBING: Text=have you heard about the new conversation SpeakerId=Unidentified
TRANSCRIBING: Text=have you heard about the new conversation transcription SpeakerId=Unidentified
TRANSCRIBING: Text=have you heard about the new conversation transcription K SpeakerId=Unidentified
TRANSCRIBING: Text=have you heard about the new conversation transcription SpeakerId=Unidentified
TRANSCRIBING: Text=have you heard about the new conversation transcription capability SpeakerId=Unidentified
TRANSCRIBED: Text=Have you heard about the new conversation transcription capability? SpeakerId=Guest_0
TRANSCRIBED: Text= SpeakerId=Unidentified
TRANSCRIBING: Text=no tell SpeakerId=Unidentified
TRANSCRIBING: Text=no tell me SpeakerId=Unidentified
TRANSCRIBING: Text=no tell me more SpeakerId=Unidentified
TRANSCRIBED: Text=No, tell me more. SpeakerId=Unidentified
TRANSCRIBING: Text=it's the SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new feature SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new feature that SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new feature that transcribes SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new feature that transcribes our SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new feature that transcribes our discussion SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and let SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and let's SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and lets SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and lets us SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and lets us know SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and lets us know who SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and lets us know who said SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and lets us know who said what SpeakerId=Unidentified
TRANSCRIBED: Text=It's the new feature that transcribes our discussion and lets us know who said what. SpeakerId=Guest_0
TRANSCRIBED: Text= SpeakerId=Unidentified
TRANSCRIBING: Text=that sounds SpeakerId=Unidentified
TRANSCRIBING: Text=that sounds interesting SpeakerId=Unidentified
TRANSCRIBING: Text=that sounds interesting i'm going SpeakerId=Unidentified
TRANSCRIBING: Text=that sounds interesting i'm going to SpeakerId=Unidentified
TRANSCRIBING: Text=that sounds interesting i'm going to give SpeakerId=Unidentified
TRANSCRIBING: Text=that sounds interesting i'm going to give this SpeakerId=Unidentified
TRANSCRIBING: Text=that sounds interesting i'm going to give this a SpeakerId=Unidentified
TRANSCRIBING: Text=that sounds interesting i'm going to give this a try SpeakerId=Unidentified
TRANSCRIBED: Text=That sounds interesting. SpeakerId=Unidentified
TRANSCRIBED: Text=I'm going to give this a try. SpeakerId=Unidentified
TRANSCRIBED: Text= SpeakerId=Unidentified
TRANSCRIBED: Text= SpeakerId=Unidentified
TRANSCRIBED: Text= SpeakerId=Unidentified
TRANSCRIBED: Text=Good morning, Steve. SpeakerId=Guest_0
CANCELED: Reason=EndOfStream

Session stopped event. SessionId=32d530b0b7f34ce492a4fd3b10a634a8

Stop recognition.
Please press <Return> to continue.

So since this worked, I guess this implies that the format from the mics I'm using (single standard mic and Azure Kinect) to test with also don't create a suitable 8-channel, 16khz audio stream? Since that appears to be the case, how would one evolve this quickstart to support either mic? I've seen demos where multiple people appear to be speaking and the conversation is transcribed. Does .Net or Azure provide a way to massage the audio stream into the format needed? Is this done by setting something in AudioConfig?

pankopon commented 2 years ago

In case of Kinect, it would require adding one (silent) channel in order to have 7+1 channels in input. Which means using AudioConfig.FromStreamInput with PullAudioInputStreamCallback, and implementing the callback so that audio is captured from the microphone and the "missing" channel added, then the data is passed to the SDK. Unfortunately we probably don't have a good C# example for that at the moment.

However, there is a way to try out conversation transcription with single channel input already, although the way the configuration is done is not final. I discussed this with our program manager and it is allowed to share the information now. (In any case, we recommend you use the latest Speech SDK 1.21.0 release if not already so.)

To try it, change the following in Program.cs

    // Join to the conversation.
    await conversationTranscriber.JoinConversationAsync(conversation);

    // Starts transcribing of the conversation. Uses StopTranscribingAsync() to stop transcribing when all participants leave.
    await conversationTranscriber.StartTranscribingAsync().ConfigureAwait(false);

to

    // Join to the conversation.
    await conversationTranscriber.JoinConversationAsync(conversation);

    // Enable single-channel conversation audio
    Connection connection = Connection.FromRecognizer(conversationTranscriber);
    connection.SetMessageProperty("speech.config", "DisableReferenceChannel", "\"True\"");
    connection.SetMessageProperty("speech.config", "MicSpec", "\"1_0_0\"");

    // Starts transcribing of the conversation. Uses StopTranscribingAsync() to stop transcribing when all participants leave.
    await conversationTranscriber.StartTranscribingAsync().ConfigureAwait(false);

i.e. create a Connection object and set properties as shown (note that the values must be written exactly as above).

This way the conversation audio can be single-channel, from a file or microphone. When the support is finalized there will be no need for the property settings, but we have no ETA for that yet.

I've attached katiesteve_mono.wav which was downmixed from the 7+1 channel katiesteve.wav, please try with that first. The results should be similar. katiesteve_mono.zip .

danieljlevine commented 2 years ago

Thanks for sharing! I’ll try that and report back. Then I might move on to mixing in the +1 channel.

danieljlevine commented 2 years ago

Can I use the Azure Kinect with just the DisableReferenceChannel option and expect it to work?

What is the purpose of the reference channel if it’s just a blank channel? If for example there was music being played in the room, would I put that into the reference channel so that you wouldn’t try to transcribe that?

Also what happens if I don’t use this setting: config.SetProperty("ConversationTranscriptionInRoomAndOnline", "true"); What happens if I set it to false? Are there other options? My use case is in-room only.

danieljlevine commented 2 years ago

I copied and pasted these lines verbatim into the sample application where you said:

// Enable single-channel conversation audio
Connection connection = Connection.FromRecognizer(conversationTranscriber);
connection.SetMessageProperty("speech.config", "DisableReferenceChannel", "\"True\"");
connection.SetMessageProperty("speech.config", "MicSpec", "\"1_0_0\"");

Then I went into Solution Explorer to see what Speech SDK version we were using and discovered we were using 1.20.0, so I upgraded to 1.21.0 as you recommended, which succeeded without issue.

Then I also changed the code to now use katiesteve_mon.wav instead of the original katiesteve.wav.

It failed with:

Session started event. SessionId=944827301da848eebb1718acdf36cebf
CANCELED: Reason=Error
CANCELED: ErrorCode=RuntimeError
CANCELED: ErrorDetails=Exception with an error code: 0x1b (SPXERR_RUNTIME_ERROR) SessionId: 944827301da848eebb1718acdf36cebf
CANCELED: Did you update the subscription info?

Session stopped event. SessionId=944827301da848eebb1718acdf36cebf

Stop recognition.
Please press <Return> to continue.

I switched back to using katiesteve.wav and it now fails as well the same way. I commented out the 3 lines you gave me to enable the mono capability and it fails in the same way.

So it would seem that upgrading to 1.21.0 made things stop working for me. I'm using "eastus" and the associated key that used to work with 1.20.0. Should I be using a different service region and key for 1.21.0 to work for American English transcription?

pankopon commented 2 years ago

This sounds odd, I personally tested it with 1.21.0 and centralus. So does it work if you switch back to 1.20.0 release? (Hopefully you didn't misspell katiesteve_mono.wav in code.)

danieljlevine commented 2 years ago

Ok, tried early in the morning and must have had the microphone line still in. Switch back to using the .wav file input and here's what I found:

  1. With the magical 3 lines and 1.21.0 installed, I was able to process katiesteve_mono.wav fine. (Yay!)
  2. With the magical 3 lines and 1.21.0 installed, I was able to process katiesteve.wav (Unexpected, but maybe it just uses the first stream and ignores the others).
  3. Without the magical 3 lines and 1.21.0 installed, I got an error message using katiesteve_mono.wav. (So those 3 lines are doing something good regarding monaural input).

Thanks, this is a step in the right direction!

So I put everything back to handle katieandsteve_mon.wav and ran successfully. Then looked to change to use a mono-mic instead. This is the change I made:

//using (var audioInput = AudioStreamReader.OpenWavFile(conversationWaveFile))
using (var audioInput = AudioConfig.FromDefaultMicrophoneInput())

This still give me this error: CANCELED: Reason=Error CANCELED: ErrorCode=RuntimeError CANCELED: ErrorDetails=Exception with an error code: 0x1b (SPXERR_RUNTIME_ERROR)

Am I not making the right change?

danieljlevine commented 2 years ago

I also verified that I could use voice recorder and my mono-mic setup to create a .m4a file with me speaking. The I converted it to a .wav file and transcribed it successfully.

So, now I just want to process straight from the mic.

jhakulin commented 2 years ago

@danieljlevine Could you please let us know which platform you have, Windows, Linux or Mac?

danieljlevine commented 2 years ago

We’re, currently using Windows 10, but I believe I’d be also targeting Linux, so portability is important as well. If I could get Mac to work that would be fantastic as well, but to start I believe Windows 10 and Linux would suffice in that order of priority.

jhakulin commented 2 years ago

@danieljlevine First of all, an error code: 0x1b (SPXERR_RUNTIME_ERROR) is not clear error message and I have created internal workitem to make it clearer. The error message happens if e.g. Microsoft.CognitiveServices.Speech.extension.mas.dll is not found.

I personally tried https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/master/quickstart/csharp/dotnet/conversation-transcription quickstart and made modifications

  1. use audioInput = AudioConfig.FromDefaultMicrophoneInput() instread of stream input
  2. configure the single-channel conversation (earlier mentioned 3 lines)

Transcription worked ok with the Jabra microphone I have. Are you using Speech SDK 1.21.0 NuGet package ? If that is installed correctly, Microsoft.CognitiveServices.Speech.extension.mas.dll should be found when running the application using default microphone.

NOTE: Single channel support for CTS is not yet officially supported and information given here is experimental, where your input is valuable. Thanks

danieljlevine commented 2 years ago

I agree, the error is not very helpful to figure out what's wrong. I believe I get the exact same error if my Azure credentials are wrong for this service.

Sounds like you did exactly what I was trying. I am using 1.21.0 and the magical 3 lines.

I'm using a different headset mic (iMicro SP-IM320). Now, I am using a remote desktop to a VM where all this speech stuff is really running. But, the speech recognition from microphone quickstart works just fine with this setup. It plugs in via USB and I'm seeing that it is monaural for the microphone online.

danieljlevine commented 2 years ago

Perhaps there's a way to get more diagnostics?

danieljlevine commented 2 years ago

You changed this:

using (var audioInput = AudioStreamReader.OpenWavFile(conversationWaveFile))

to this:

using (var audioInput = AudioConfig.FromDefaultMicrophoneInput())

Right?

danieljlevine commented 2 years ago

Just tried again. It still fails the same way for me.

danieljlevine commented 2 years ago

Would adding a line like this provide useful information to you?

config.SetProperty(PropertyId.Speech_LogFilename, "C:\Temp\SpeechSDK.log.txt");

danieljlevine commented 2 years ago

Ok, today I installed everything locally on my Windows 10 desktop system.

Using the SDK as it comes (no changes), the mono mic and Azure Kinect 7-channel mic seem to work out if the box. As a matter of fact I moved over to my application using Conversation Transcription and it also worked. So something is going on to interfere with the mic working via Remote Desktop to a VM for Conversation Transcription but not Speech to Text.

So now I guess I need to figure out how to build my DotNet solution on Linux. Is that an easy thing to do?

jhakulin commented 2 years ago

@danieljlevine Thanks for info. It would be useful indeed to see the log as you proposed earlier, when the fail happens. Could you still provide that ?

danieljlevine commented 2 years ago

Sure. I'll see if I can generate one for you.

danieljlevine commented 2 years ago

Here is the log from the failed mono-mic via remote desktop SpeechSDK-mono-mic.log.txt .

danieljlevine commented 2 years ago

Here is the log from the failed Azure Kinect (7 channel) mic via remote desktop. Note I commented out the "magical 3 lines", since the code works perfectly using this device without them on my desktop system SpeechSDK-azure-kinect.log.txt .

danieljlevine commented 2 years ago

It seems like the Azure Kinect's extra channels don't find their way to the VM. I recorded something on the VM with Audacity and it came through as 2 channels as well, so that's consistent.

pankopon commented 2 years ago

To build a .NET solution locally on Linux, you should have a .NET Core project. Make a copy of e.g. https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/master/quickstart/csharp/dotnetcore/from-microphone to use it as a base, and

  1. Install the .NET Core CLI (dotnet) for your Linux distribution as instructed in the readme of the above
  2. Copy AudioStreamReader.cs and Program.cs from the conversation-transcription solution, overwrite the default Program.cs
  3. Edit helloworld.csproj and add <PackageReference Include="Newtonsoft.Json" Version="12.0.3" /> just below the existing PackageReference line
  4. Build and run using the .NET Core CLI as per instructions in the readme:
    dotnet build helloworld/helloworld.csproj
    dotnet helloworld/bin/Debug/netcoreapp3.1/helloworld.dll

    Note that if you use wav files for input then by default they must reside in the same directory where you run the second command above.

danieljlevine commented 2 years ago

Thanks, it turns out we have people who already know how to do this here. I would imagine they did something equivalent to this. So now the application runs on our Ubuntu 64-bit linux platform. We have verified that using the katiesteve.wav it does transcribe as expected. So everything but the mic has been checked out successfully.

However, using the Azure Kinect mic doesn't work yet. I am not using the magical 3 lines, since I didn't need them on Windows 10 with the Azure Kinect mic. arecord -L reports a number of devices we could use as mics, but none of the ones we tried seem to work. Have you ever tested an Azure Kinect with the conversation transcription demo on Linux? If so, what device did you specify? Perhaps we have been using the wrong one.

danieljlevine commented 2 years ago

I have yet to get this working on Linux with a microphone. The katiesteve.wav is working. I have tried with and without the 3 magical monaural lines using the Azure Kinect and USB microphones without success. I always get the 0x1b runtime error. It would be great if this error was more meaningful. Then we could address the issue, as it appears to be the biggest stumbling block to getting this capability working. I'm not sure if the logs would provide more useful information.

pankopon commented 2 years ago

I'm afraid we haven't tried using Azure Kinect with the Speech SDK on Linux. Does your USB microphone work with any other Speech SDK samples on Linux? Depending on your system it could require using FromMicrophoneInput with specific parameters (see https://github.com/Azure-Samples/cognitive-services-speech-sdk/issues/1346) if the default capture device is not proper.

I tested the following with Ubuntu 20.04 LTS and Sennheiser PC 8 USB headset on Raspberry Pi 4 (ARM64):

$ arecord -l
**** List of CAPTURE Hardware Devices ****
card 1: headset [Sennheiser USB headset], device 0: USB Audio [USB Audio]
  Subdevices: 1/1
  Subdevice #0: subdevice #0

$ dotnet helloworld/bin/Debug/netcoreapp3.1/helloworld.dll
Session started event. SessionId=f7bc27899aaf40d4b1739d43c6f30322
TRANSCRIBING: Text=testing conversation SpeakerId=Unidentified
TRANSCRIBING: Text=testing conversation through SpeakerId=Unidentified
TRANSCRIBING: Text=testing conversation thread SpeakerId=Unidentified
TRANSCRIBING: Text=testing conversation SpeakerId=Unidentified
TRANSCRIBING: Text=testing conversation transcription SpeakerId=Unidentified
TRANSCRIBING: Text=testing conversation transcription with SpeakerId=Unidentified
TRANSCRIBING: Text=testing conversation transcription with the SpeakerId=Unidentified
TRANSCRIBING: Text=testing conversation transcription with a normal SpeakerId=Unidentified
TRANSCRIBING: Text=testing conversation transcription within normal SpeakerId=Unidentified
TRANSCRIBING: Text=testing conversation transcription with the normal SpeakerId=Unidentified
TRANSCRIBING: Text=testing conversation transcription with a normal SpeakerId=Unidentified
TRANSCRIBING: Text=testing conversation transcription with a normal head SpeakerId=Unidentified
TRANSCRIBING: Text=testing conversation transcription with a normal SpeakerId=Unidentified
TRANSCRIBING: Text=testing conversation transcription with a normal headset SpeakerId=Unidentified
TRANSCRIBED: Text=Testing conversation transcription with a normal headset. SpeakerId=Unidentified

I've attached the test project in test.zip, this was created based on what I wrote about .NET Core earlier. Please check if it matches what you have tried.

danieljlevine commented 2 years ago

What micID value do you pass to FromMicrophoneInput()? I want to specify my micID rather than using the default one. Is conversation transcription extra picky about the microphone (compared to just speech to text)? It seems like it is (hence those 3 magical lines that tend to get mono-mics working).

hw:0,0 hw:CARD=0,DEVICE=0 hw:headset

Something else?

Do you require the 3 special mono lines to work with speech transcription?

pankopon commented 2 years ago

Please run a basic speech recognition example https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/master/quickstart/csharp/dotnetcore/from-microphone and see if it works on your Linux system with default settings.

DisableReferenceChannel and MicSpec only need to be set when trying to use a conversation transcriber with mono input.

The mic ID for FromMicrophoneInput depends on what microphone devices your system detects (see documentation https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-select-audio-input-devices).

Is your Linux machine local, not remote, not VM? Can you post the full output of arecord -l and arecord -L? Did you try the test project I attached and it did not work?

danieljlevine commented 2 years ago

Ok, I've been doing a lot of rechecking where I stand because I was no longer really sure.

  1. My application currently runs perfectly on Windows 10 and Linux using katiesteve.wav (multiple channel file)
  2. My application currently runs perfectly on Windows 10 using my Azure Kinect device's microphone array. It works when set as the default mic in properties and when I specify the device ID for this mic from the Control Panel. I have tried both with and without the magical 3 lines with no effect. This configuration just works.
  3. On Windows the internal laptop microphone array does not crash with the 0x1B error, but it doesn't seem to hear any inputs, almost like it's listening to the "wrong" channel or the wrong microphone. I have set the default mic to this device specifically (and removed all other mics from the system) and I have also used the device ID from the control panel for this mic. I have tried both with and without the magical 3 lines with no effect.
  4. No microphones seem to work on Linux. I have tried a number of possibilities from arecord -L. They either give me the 0x1b real-time (I believe these are the hw: options, and I suspect it's because they produce bit-rate that is 32kbps) error or act like the Windows platform and doesn't seem to "hear" anything (plughw: devices seem to act this way).

Is there some sort of output logs I can produce for you to help me get past this? My real target is the Ubuntu AMDx86 Linux.

danieljlevine commented 2 years ago

Is your Linux machine local, not remote, not VM?

Local.

pankopon commented 2 years ago

For testing on Linux:

  1. Do not try to use Azure Kinect. Use an internal microphone, if available and verified to work (see the next steps), or a USB (headset) microphone. I've personally used a basic USB headset without problems.
  2. Verify that this microphone works for recording that's not Speech SDK specific, using e.g. Audacity or arecord, in 16kHz 16-bit mono PCM. For example: arecord -f S16_LE -c1 -r16000 -t wav test.wav to record (if a microphone is not found, try adding the device option e.g. -D hw:USB,0) and aplay test.wav to play.
  3. After the above, build and run https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/master/quickstart/csharp/dotnetcore/from-microphone. Does it work? This is the simplest speech recognition example (in C#), it is no use to try more complex cases on the system before this can be run successfully.
  4. Finally, please post the full unmodified output of both arecord -l and arecord -L.
danieljlevine commented 2 years ago

Ok, starting to collect this information for you.

danieljlevine commented 2 years ago

Ok, I used a USB headset using: arecord -Dhw:CARD=Device,DEV=0 -f S16_LE -r16000 -t wav test.wav It would have recorded, but spit a message out warning saying: Warning: rate is not accurate (requested = 16000Hz, got = 44100Hz) please, try the plug plugin So, I took it's suggestion and and successfully recorded using: arecord -Dplughw:CARD=Device,DEV=0 -f S16_LE -r16000 -t wav test.wav It played back loud and clear.

So we have a working combination on Linux at the ALSA level. Now to build and test the quickstart you requested.

danieljlevine commented 2 years ago

I got the samples and received this error when I tried to build with: dotnet build helloworld.csproj

Microsoft (R) Build Engine version 17.1.1+a02f73656 for .NET Copyright (C) Microsoft Corporation. All rights reserved.

Determining projects to restore... Restored /home/omni/levindj1/src/cognitive-services-speech-sdk-master/quickstart/csharp/dotnet/from-microphone/helloworld/helloworld.csproj (in 6.18 sec). /usr/share/dotnet/sdk/6.0.202/Microsoft.Common.CurrentVersion.targets(1220,5): error MSB3644: The reference assemblies for .NETFramew<PackageReference Include="Newtonsoft.Json" Version="12.0.3" /ork,Version=v4.6.1 were not found. To resolve this, install the Developer Pack (SDK/Targeting Pack) for this framework version or retarget your application. You can download .NET Framework Developer Packs at https://aka.ms/msbuild/developerpacks

So I went here: https://aka.ms/msbuild/developerpacks But I don't see linux stuff to download. Ultimately I see .exe files, which I'm pretty sure won't work out in Linux. So I went into my helloword.csproject and added this (was directed to do so above):

It didn't resolve the issue. So I added this:

runtime; build; native; contentfiles; analyzers all

Build now produces this output: .../cognitive-services-speech-sdk-master/quickstart/csharp/dotnet/from-microphone/helloworld/helloworld.csproj(109,3): warning MSB4011: "/usr/share/dotnet/sdk/6.0.202/Microsoft.CSharp.targets" cannot be imported again. It was already imported at ".../cognitive-services-speech-sdk-master/quickstart/csharp/dotnet/from-microphone/helloworld/helloworld.csproj (108,3)". This is most likely a build authoring error. This subsequent import will be ignored. /usr/share/dotnet/sdk/6.0.202/Microsoft.CSharp.CurrentVersion.targets(130,9): warning MSB3884: Could not find rule set file "MinimumRecommendedRules.ruleset". [.../src/cognitive-services-speech-sdk-master/quickstart/csharp/dotnet/from-microphone/helloworld/helloworld.csproj] .../cognitive-services-speech-sdk-master/quickstart/csharp/dotnet/from-microphone/helloworld/Program.cs(8,17): error CS0234: The type or namespace name 'CognitiveServices' does not exist in the namespace 'Microsoft' (are you missing an assembly reference?) [.../cognitive-services-speech-sdk-master/quickstart/csharp/dotnet/from-microphone/helloworld/helloworld.csproj]

Build FAILED.

So, I'm stuck at the moment.

Here's the output from arecord -l List of CAPTURE Hardware Devices card 0: PCH [HDA Intel PCH], device 0: ALC892 Analog [ALC892 Analog] Subdevices: 1/1 Subdevice #0: subdevice #0 card 2: Device [USB PnP Sound Device], device 0: USB Audio [USB Audio] Subdevices: 1/1 Subdevice #0: subdevice #0 List of CAPTURE Hardware Devices card 0: PCH [HDA Intel PCH], device 0: ALC892 Analog [ALC892 Analog] Subdevices: 1/1 Subdevice #0: subdevice #0 card 2: Device [USB PnP Sound Device], device 0: USB Audio [USB Audio] Subdevices: 1/1 Subdevice #0: subdevice #0

And here's the output from arecord -L default Playback/recording through the PulseAudio sound server surround21 2.1 Surround output to Front and Subwoofer speakers surround40 4.0 Surround output to Front and Rear speakers surround41 4.1 Surround output to Front, Rear and Subwoofer speakers surround50 5.0 Surround output to Front, Center and Rear speakers surround51 5.1 Surround output to Front, Center, Rear and Subwoofer speakers surround71 7.1 Surround output to Front, Center, Side, Rear and Woofer speakers null Discard all samples (playback) or generate zero samples (capture) samplerate Rate Converter Plugin Using Samplerate Library speexrate Rate Converter Plugin Using Speex Resampler jack JACK Audio Connection Kit oss Open Sound System pulse PulseAudio Sound Server upmix Plugin for channel upmix (4,6,8) vdownmix Plugin for channel downmix (stereo) with a simple spacialization sysdefault:CARD=PCH HDA Intel PCH, ALC892 Analog Default Audio Device front:CARD=PCH,DEV=0 HDA Intel PCH, ALC892 Analog Front speakers dmix:CARD=PCH,DEV=0 HDA Intel PCH, ALC892 Analog Direct sample mixing device dsnoop:CARD=PCH,DEV=0 HDA Intel PCH, ALC892 Analog Direct sample snooping device hw:CARD=PCH,DEV=0 HDA Intel PCH, ALC892 Analog Direct hardware device without any conversions plughw:CARD=PCH,DEV=0 HDA Intel PCH, ALC892 Analog Hardware device with all software conversions usbstream:CARD=PCH HDA Intel PCH USB Stream Output usbstream:CARD=NVidia HDA NVidia USB Stream Output sysdefault:CARD=Device USB PnP Sound Device, USB Audio Default Audio Device front:CARD=Device,DEV=0 USB PnP Sound Device, USB Audio Front speakers surround21:CARD=Device,DEV=0 USB PnP Sound Device, USB Audio 2.1 Surround output to Front and Subwoofer speakers surround40:CARD=Device,DEV=0 USB PnP Sound Device, USB Audio 4.0 Surround output to Front and Rear speakers surround41:CARD=Device,DEV=0 USB PnP Sound Device, USB Audio 4.1 Surround output to Front, Rear and Subwoofer speakers surround50:CARD=Device,DEV=0 USB PnP Sound Device, USB Audio 5.0 Surround output to Front, Center and Rear speakers surround51:CARD=Device,DEV=0 USB PnP Sound Device, USB Audio 5.1 Surround output to Front, Center, Rear and Subwoofer speakers surround71:CARD=Device,DEV=0 USB PnP Sound Device, USB Audio 7.1 Surround output to Front, Center, Side, Rear and Woofer speakers iec958:CARD=Device,DEV=0 USB PnP Sound Device, USB Audio IEC958 (S/PDIF) Digital Audio Output dmix:CARD=Device,DEV=0 USB PnP Sound Device, USB Audio Direct sample mixing device dsnoop:CARD=Device,DEV=0 USB PnP Sound Device, USB Audio Direct sample snooping device hw:CARD=Device,DEV=0 USB PnP Sound Device, USB Audio Direct hardware device without any conversions plughw:CARD=Device,DEV=0 USB PnP Sound Device, USB Audio Hardware device with all software conversions usbstream:CARD=Device USB PnP Sound Device USB Stream Output

danieljlevine commented 2 years ago

So, I'm trying to get you basically same information your requested from running the SpeechRecognizer from microphone demo...

We have a version of our application in our AMDx64 environment that works that is basically the SpeechRecognizer from microphone demo. It works great with the USB microphone I specify it with "plughw:CARD=Device,DEV=0". It also works great with the Azure Kinect using "plughw:CARD=Array,DEV=0". So I know what microphone device IDs to use with the SpeechRecognizer part of the Speech SDK.

However, if I take the code and basically transform it from the SpeechRecognizer capability to the Conversation Transcription capability and use these same microphone device IDs, it either occasionally hears things I don't actually say, hears nothing, or crashes with the 0x1b error. The katiesteve.wav file works, so I know I've implemented things pretty much right or that wouldn't get transcribed and the Azure Kinect microphone works on Windows when I specify it's microphone device ID. It's almost like it's listening to the wrong channel or cancelling the voice channel on Linux somehow. So, I've been playing with the AudioProcessingOptions to see if I can get something to work differently (i.e., better). I'm thinking that perhaps ALSA is not providing all the information that the Speech SDK gets from Windows and so perhaps I need to coach it a little with these options.

But there definitely seems to be a difference in how SpeechRecognizer and ConversationTranscriber use the microphone.

danieljlevine commented 2 years ago

I played with the AudioProcessingOptions on Linux and got it to work on all my microphones. I can't quite explain why it now works using that as the options I'm using don't make much sense., but somehow it causes it to work. I need to head home now, but will be able to explain what I'm doing now. I believe these options cause it to fail when I use them on Windows.

pankopon commented 2 years ago

Great to hear you've got it working - please post details for analysis. I presume it is related to your system having multiple audio capture devices so that the proper device needs to be specified in options as you did with arecord.

Your sample build failed because you tried to build a .NET Framework project which is Windows only:

Restored /home/omni/levindj1/src/cognitive-services-speech-sdk-master/quickstart/csharp/dotnet/from-microphone/helloworld/helloworld.csproj (in 6.18 sec).

On Linux and other non-Windows systems you need to use .NET Core projects like quickstart/csharp/dotnetcore/from-microphone.

danieljlevine commented 2 years ago

Thanks for showing me the build issue. I may try that again with the dotnet core version later.

I can provide more details when J back to the terminal, but I’m still using the same microphone device IDs that didn’t crash, but didn’t seem to “hear” anything.

Basically, I just provided AudioProcessingOptions to the process. I did some test with all the audio processing disabled and the defaults. I don’t believe this mattered. Since I have an Azure Connect I specified the 7-mic circle configuration. I was able to use this configuration with Azure Kinect, another microphone array, and mono-USB mic successfully, despite them being the wrong configuration of microphones in latter 2 cases. I also turned off the feedback channel and some times left that parameter out. Leaving it out with audio processing off is probably like not having it on. I could swear that I had it on with default audio processing and 7 channel circle geometry and it still worked for all 3 microphone configurations. I’ll have to check. This is why I was saying I couldn’t really explain why this did the trick on Linux.

I don’t believe this helped the Windows 10 version, and I believe it may have broken it.

danieljlevine commented 2 years ago

Here was my test code. I made it so I could prepend a number and # to the microphone so I could have our admins build our environment and then I could test a few different parameters with different microphones.

var audioProcessingOptions = AudioProcessingOptions.Create(AudioProcessingConstants.AUDIO_INPUT_PROCESSING_ENABLE_DEFAULT);

                        if ((microphoneDeviceId != null) || microphoneDeviceId.Length >= 2)
                        {
                            if (microphoneDeviceId.Substring(0, 2) == "1#")
                            {
                                audioProcessingOptions = AudioProcessingOptions.Create(AudioProcessingConstants.AUDIO_INPUT_PROCESSING_NONE, PresetMicrophoneArrayGeometry.Circular7);
                                microphoneDeviceId = microphoneDeviceId.Substring(2);
                            }
                            else if (microphoneDeviceId.Substring(0, 2) == "2#")
                            {
                                audioProcessingOptions = AudioProcessingOptions.Create(AudioProcessingConstants.AUDIO_INPUT_PROCESSING_NONE, PresetMicrophoneArrayGeometry.Circular7, SpeakerReferenceChannel.None);
                                microphoneDeviceId = microphoneDeviceId.Substring(2);
                            }
                            else if (microphoneDeviceId.Substring(0, 2) == "3#")
                            {
                                audioProcessingOptions = AudioProcessingOptions.Create(AudioProcessingConstants.AUDIO_INPUT_PROCESSING_ENABLE_DEFAULT, PresetMicrophoneArrayGeometry.Circular7);
                                microphoneDeviceId = microphoneDeviceId.Substring(2);
                            }
                            else if (microphoneDeviceId.Substring(0, 2) == "4#")
                            {
                                audioProcessingOptions = AudioProcessingOptions.Create(AudioProcessingConstants.AUDIO_INPUT_PROCESSING_ENABLE_DEFAULT, PresetMicrophoneArrayGeometry.Circular7, SpeakerReferenceChannel.None);
                                microphoneDeviceId = microphoneDeviceId.Substring(2);
                            }
                            _logger.LogInformation($"Really using microphoneDeviceId: {microphoneDeviceId}");
                            audioConfig = AudioConfig.FromMicrophoneInput(microphoneDeviceId, audioProcessingOptions);

If I recall correctly, test 3 was the one that didn't work with the USB microphone on linux. I believe it didn't crash, but just didn't hear any speech. Look at the options I'm using there, that seems to turn on the default audio processing, sets the mic geometry to the Azure Kinect geometry, and since it omits the SpeakerReferenceChannel.None, my guess is it's probably all about the speaker reference channel.

In case 1, it probably ignore the speaker reference channel because I told it not to do any audio processing. In case 2, it's the same as case 1, but I told it I didn't have a speaker reference channel, which it probably didn't use anyway, since the processing is disabled. In case 3, it's probably trying to use the speaker reference channel and it's not working out (especially since the USB mic only has one channel, so it might be ignoring the only channel with audio) in the default processing. In case 4, we are enabling default audio processing, but removing the secondary audio channel, so that probably disables the ability to do some of that processing. So, perhaps Automatic Gain Control is done, but echo cancellation and such are not. I believe this is what I'll probably change the default to in our application until I figure out how this can be performed more automatically.

pankopon commented 2 years ago

@danieljlevine Hi, just to check back on this, is diarization now working in your application on target platforms? What are the final settings you used with the microphone(s), if different from what you posted previously? Please confirm.

pankopon commented 2 years ago

Closed since no further updates received and it's understood a working solution was found. Please open a new issue if further support is needed.