Closed danieljlevine closed 2 years ago
As a follow-up, I have the quickstart\csharp\dotnet\from-microphone\helloworld working in my environment. So perhaps we could talk in terms of this program as opposed to my much larger code base.
I added these lines after line 18 in Program.cs:
config.SetProperty("diarizationEnabled", "true");
config.SetProperty("wordLevelTimestampsEnabled", "true");
As far as I can tell, it had no effect. Perhaps I'm looking in the wrong place. It does hear my single sentence and does provide the correct text from my speech. I was hoping there might be something like Speaker1 in a field, since it doesn't know who I am.
Hi, diarization is not supported with SpeechRecognizer in the Speech SDK API. The current options are:
diarizationEnabled
and wordLevelTimestampsEnabled
are supported there.DifferentiateGuestSpeakers
property true, you can differentiate between unidentified speakers when using the ConversationTranscriber class.Thanks, I suspected as much.
How would I change the transcription QuickStart to work with the a single default microphone? When I attempted to do that, I got run time errors that I assumed were from not having the the right audio parameters on the mic. (I don’t have the errors handy, but could post tomorrow.) For that matter I couldn’t point to a sample .wav file either and make it work (probably for similar audio requirement reasons), so I wouldn’t mind starting there.
Is there some way to have Azure take the audio I provide (mic or .wav) and turn it into compatible audio (bitrate, PCM, monaural, 7 channel, etc)?
Also, is there a microphone one can buy that provides what is needed (i.e., implements the conversation transcription mic reference spec)?
From: pankopon @.**@.>> Date: Tuesday, Apr 19, 2022, 8:26 PM To: Azure-Samples/cognitive-services-speech-sdk @.**@.>> Cc: Levine, Daniel J. @.**@.>>, Author @.**@.>> Subject: [EXT] Re: [Azure-Samples/cognitive-services-speech-sdk] Moving from Speech Recognition to Diarization (Issue #1470)
APL external email warning: Verify sender @.*** before clicking links or attachments
Hi, diarization is not supported with SpeechRecognizer in the Speech SDK API. The current options are:
— Reply to this email directly, view it on GitHubhttps://github.com/Azure-Samples/cognitive-services-speech-sdk/issues/1470#issuecomment-1103293488, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AILSLXSPQIMW7VXUTZUDZ3TVF5FKPANCNFSM5TW2VBZQ. You are receiving this because you authored the thread.Message ID: @.***>
I Should add that I did get word level time stamps enabled using the SDK’s function call. It’s a shame the diarizarionEnabled doesn’t have suck an SDK function. However, there might be a good reason for that. Guessing maybe it’s because it would only remember the speakers in single audio snippets and forget them. Perhaps that’s what ConversationTranscription works around?
From: pankopon @.**@.>> Date: Tuesday, Apr 19, 2022, 8:26 PM To: Azure-Samples/cognitive-services-speech-sdk @.**@.>> Cc: Levine, Daniel J. @.**@.>>, Author @.**@.>> Subject: [EXT] Re: [Azure-Samples/cognitive-services-speech-sdk] Moving from Speech Recognition to Diarization (Issue #1470)
APL external email warning: Verify sender @.*** before clicking links or attachments
Hi, diarization is not supported with SpeechRecognizer in the Speech SDK API. The current options are:
— Reply to this email directly, view it on GitHubhttps://github.com/Azure-Samples/cognitive-services-speech-sdk/issues/1470#issuecomment-1103293488, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AILSLXSPQIMW7VXUTZUDZ3TVF5FKPANCNFSM5TW2VBZQ. You are receiving this because you authored the thread.Message ID: @.***>
Ok, so I tried to get the Speech Transcription quick start working. I initially was using a regular microphone headset and made a small change to get input from the default microphone instead of the 7-channel .wav I don't have lying around.
My only code changes to the quickstart (other than adding my key and setting the zone to "eastus") were:
// Line 83 in Program.cs
config.SetProperty("DifferentiateGuestSpeakers", "true"); // We only have guests. Not looking to identify who it actually was speaking, just differentiate speakers.
// Line 88:
//using (var audioInput = AudioStreamReader.OpenWavFile(conversationWaveFile))
using (var audioInput = AudioConfig.FromDefaultMicrophoneInput()) // Use microphone instead of .wav file
This found it's way to the cancelled callback: CANCELED: Reason=Error CANCELED: ErrorCode=RuntimeError CANCELED: ErrorDetails=Exception with an error code: 0x1b (SPXERR_RUNTIME_ERROR) SessionId: 0cba61aa7a08426b82d87e77bd9e03b3
I also managed to get my hands on an Azure Kinect, which has the 7-mic array built in. It also had the same problem.
So, perhaps I haven't come up with the right code to use the microphone yet in the quickstart?
I should also mention that I have successfully used both the standard headset mic and Azure Connect with the from-microphone quickstart. So the hardware seems to be working just fine with the Speech Recognizer.
@danieljlevine Is your service region among the supported regions for conversation transcription (as of now: centralus, eastasia, eastus, westeurope)?
If yes, are you using the readily available conversation transcription quickstart project in github which uses files as input? Please start with this project as is and first verify you can get the default setup working, only using your subscription key and service region, then gradually modify it. (As it appears, example wav files are not included with the C# sample but you can find them from the corresponding quickstart project for Java. The file katiesteve.wav
there has the expected input format.)
We are considering adding support for single channel input audio in conversation transcription, potentially later this year but not confirmed yet.
Yes, I am using eastus and the corresponding key. Ok, I can using that sample .wav first.
Given that the Azure Kinect has the secluding 7-channel mic, shouldn’t it just work once I get it coded right?
It depends on what the actual format of audio from Azure Kinect is, I'm not sure if the feature has been tested with that. The input to conversation transcriber should actually have 7+1 channels i.e. include a reference channel. The format of the example conversation wav file is
$ file katiesteve.wav
katiesteve.wav: RIFF (little-endian) data, WAVE audio, 8 channels 16000 Hz
Ok, I have verified that my setup is capable of doing the transcription using the KatieSteve.wav file. The ouput eventually identified one speaker (Kate) as Guest_0. It probably doesn't identify Steve as Guest_1 because he doesn't talk enough. But this is fine and exactly what I'd like to start trying to do, but with a microphone.
Here's my output: from KateSteve.wav:
Session started event. SessionId=32d530b0b7f34ce492a4fd3b10a634a8
TRANSCRIBING: Text=good morning SpeakerId=Unidentified
TRANSCRIBING: Text=good morning steve SpeakerId=Unidentified
TRANSCRIBED: Text=Good morning, Steve. SpeakerId=Unidentified
TRANSCRIBING: Text=good morning SpeakerId=Unidentified
TRANSCRIBING: Text=good morning kate SpeakerId=Unidentified
TRANSCRIBING: Text=good morning katie SpeakerId=Unidentified
TRANSCRIBED: Text=Good morning, Katie. SpeakerId=Unidentified
TRANSCRIBING: Text=have you SpeakerId=Unidentified
TRANSCRIBING: Text=have you heard SpeakerId=Unidentified
TRANSCRIBING: Text=have you heard of SpeakerId=Unidentified
TRANSCRIBING: Text=have you heard SpeakerId=Unidentified
TRANSCRIBING: Text=have you heard about SpeakerId=Unidentified
TRANSCRIBING: Text=have you heard about that SpeakerId=Unidentified
TRANSCRIBING: Text=have you heard about the new SpeakerId=Unidentified
TRANSCRIBING: Text=have you heard about the new conversation SpeakerId=Unidentified
TRANSCRIBING: Text=have you heard about the new conversation transcription SpeakerId=Unidentified
TRANSCRIBING: Text=have you heard about the new conversation transcription K SpeakerId=Unidentified
TRANSCRIBING: Text=have you heard about the new conversation transcription SpeakerId=Unidentified
TRANSCRIBING: Text=have you heard about the new conversation transcription capability SpeakerId=Unidentified
TRANSCRIBED: Text=Have you heard about the new conversation transcription capability? SpeakerId=Guest_0
TRANSCRIBED: Text= SpeakerId=Unidentified
TRANSCRIBING: Text=no tell SpeakerId=Unidentified
TRANSCRIBING: Text=no tell me SpeakerId=Unidentified
TRANSCRIBING: Text=no tell me more SpeakerId=Unidentified
TRANSCRIBED: Text=No, tell me more. SpeakerId=Unidentified
TRANSCRIBING: Text=it's the SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new feature SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new feature that SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new feature that transcribes SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new feature that transcribes our SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new feature that transcribes our discussion SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and let SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and let's SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and lets SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and lets us SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and lets us know SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and lets us know who SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and lets us know who said SpeakerId=Unidentified
TRANSCRIBING: Text=it's the new feature that transcribes our discussion and lets us know who said what SpeakerId=Unidentified
TRANSCRIBED: Text=It's the new feature that transcribes our discussion and lets us know who said what. SpeakerId=Guest_0
TRANSCRIBED: Text= SpeakerId=Unidentified
TRANSCRIBING: Text=that sounds SpeakerId=Unidentified
TRANSCRIBING: Text=that sounds interesting SpeakerId=Unidentified
TRANSCRIBING: Text=that sounds interesting i'm going SpeakerId=Unidentified
TRANSCRIBING: Text=that sounds interesting i'm going to SpeakerId=Unidentified
TRANSCRIBING: Text=that sounds interesting i'm going to give SpeakerId=Unidentified
TRANSCRIBING: Text=that sounds interesting i'm going to give this SpeakerId=Unidentified
TRANSCRIBING: Text=that sounds interesting i'm going to give this a SpeakerId=Unidentified
TRANSCRIBING: Text=that sounds interesting i'm going to give this a try SpeakerId=Unidentified
TRANSCRIBED: Text=That sounds interesting. SpeakerId=Unidentified
TRANSCRIBED: Text=I'm going to give this a try. SpeakerId=Unidentified
TRANSCRIBED: Text= SpeakerId=Unidentified
TRANSCRIBED: Text= SpeakerId=Unidentified
TRANSCRIBED: Text= SpeakerId=Unidentified
TRANSCRIBED: Text=Good morning, Steve. SpeakerId=Guest_0
CANCELED: Reason=EndOfStream
Session stopped event. SessionId=32d530b0b7f34ce492a4fd3b10a634a8
Stop recognition.
Please press <Return> to continue.
So since this worked, I guess this implies that the format from the mics I'm using (single standard mic and Azure Kinect) to test with also don't create a suitable 8-channel, 16khz audio stream? Since that appears to be the case, how would one evolve this quickstart to support either mic? I've seen demos where multiple people appear to be speaking and the conversation is transcribed. Does .Net or Azure provide a way to massage the audio stream into the format needed? Is this done by setting something in AudioConfig?
In case of Kinect, it would require adding one (silent) channel in order to have 7+1 channels in input. Which means using AudioConfig.FromStreamInput with PullAudioInputStreamCallback, and implementing the callback so that audio is captured from the microphone and the "missing" channel added, then the data is passed to the SDK. Unfortunately we probably don't have a good C# example for that at the moment.
However, there is a way to try out conversation transcription with single channel input already, although the way the configuration is done is not final. I discussed this with our program manager and it is allowed to share the information now. (In any case, we recommend you use the latest Speech SDK 1.21.0 release if not already so.)
To try it, change the following in Program.cs
// Join to the conversation.
await conversationTranscriber.JoinConversationAsync(conversation);
// Starts transcribing of the conversation. Uses StopTranscribingAsync() to stop transcribing when all participants leave.
await conversationTranscriber.StartTranscribingAsync().ConfigureAwait(false);
to
// Join to the conversation.
await conversationTranscriber.JoinConversationAsync(conversation);
// Enable single-channel conversation audio
Connection connection = Connection.FromRecognizer(conversationTranscriber);
connection.SetMessageProperty("speech.config", "DisableReferenceChannel", "\"True\"");
connection.SetMessageProperty("speech.config", "MicSpec", "\"1_0_0\"");
// Starts transcribing of the conversation. Uses StopTranscribingAsync() to stop transcribing when all participants leave.
await conversationTranscriber.StartTranscribingAsync().ConfigureAwait(false);
i.e. create a Connection object and set properties as shown (note that the values must be written exactly as above).
This way the conversation audio can be single-channel, from a file or microphone. When the support is finalized there will be no need for the property settings, but we have no ETA for that yet.
I've attached katiesteve_mono.wav
which was downmixed from the 7+1 channel katiesteve.wav
, please try with that first. The results should be similar.
katiesteve_mono.zip
.
Thanks for sharing! I’ll try that and report back. Then I might move on to mixing in the +1 channel.
Can I use the Azure Kinect with just the DisableReferenceChannel option and expect it to work?
What is the purpose of the reference channel if it’s just a blank channel? If for example there was music being played in the room, would I put that into the reference channel so that you wouldn’t try to transcribe that?
Also what happens if I don’t use this setting: config.SetProperty("ConversationTranscriptionInRoomAndOnline", "true"); What happens if I set it to false? Are there other options? My use case is in-room only.
I copied and pasted these lines verbatim into the sample application where you said:
// Enable single-channel conversation audio
Connection connection = Connection.FromRecognizer(conversationTranscriber);
connection.SetMessageProperty("speech.config", "DisableReferenceChannel", "\"True\"");
connection.SetMessageProperty("speech.config", "MicSpec", "\"1_0_0\"");
Then I went into Solution Explorer to see what Speech SDK version we were using and discovered we were using 1.20.0, so I upgraded to 1.21.0 as you recommended, which succeeded without issue.
Then I also changed the code to now use katiesteve_mon.wav instead of the original katiesteve.wav.
It failed with:
Session started event. SessionId=944827301da848eebb1718acdf36cebf
CANCELED: Reason=Error
CANCELED: ErrorCode=RuntimeError
CANCELED: ErrorDetails=Exception with an error code: 0x1b (SPXERR_RUNTIME_ERROR) SessionId: 944827301da848eebb1718acdf36cebf
CANCELED: Did you update the subscription info?
Session stopped event. SessionId=944827301da848eebb1718acdf36cebf
Stop recognition.
Please press <Return> to continue.
I switched back to using katiesteve.wav and it now fails as well the same way. I commented out the 3 lines you gave me to enable the mono capability and it fails in the same way.
So it would seem that upgrading to 1.21.0 made things stop working for me. I'm using "eastus" and the associated key that used to work with 1.20.0. Should I be using a different service region and key for 1.21.0 to work for American English transcription?
This sounds odd, I personally tested it with 1.21.0 and centralus. So does it work if you switch back to 1.20.0 release? (Hopefully you didn't misspell katiesteve_mono.wav
in code.)
Ok, tried early in the morning and must have had the microphone line still in. Switch back to using the .wav file input and here's what I found:
Thanks, this is a step in the right direction!
So I put everything back to handle katieandsteve_mon.wav and ran successfully. Then looked to change to use a mono-mic instead. This is the change I made:
//using (var audioInput = AudioStreamReader.OpenWavFile(conversationWaveFile))
using (var audioInput = AudioConfig.FromDefaultMicrophoneInput())
This still give me this error: CANCELED: Reason=Error CANCELED: ErrorCode=RuntimeError CANCELED: ErrorDetails=Exception with an error code: 0x1b (SPXERR_RUNTIME_ERROR)
Am I not making the right change?
I also verified that I could use voice recorder and my mono-mic setup to create a .m4a file with me speaking. The I converted it to a .wav file and transcribed it successfully.
So, now I just want to process straight from the mic.
@danieljlevine Could you please let us know which platform you have, Windows, Linux or Mac?
We’re, currently using Windows 10, but I believe I’d be also targeting Linux, so portability is important as well. If I could get Mac to work that would be fantastic as well, but to start I believe Windows 10 and Linux would suffice in that order of priority.
@danieljlevine First of all, an error code: 0x1b (SPXERR_RUNTIME_ERROR) is not clear error message and I have created internal workitem to make it clearer. The error message happens if e.g. Microsoft.CognitiveServices.Speech.extension.mas.dll is not found.
I personally tried https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/master/quickstart/csharp/dotnet/conversation-transcription quickstart and made modifications
Transcription worked ok with the Jabra microphone I have. Are you using Speech SDK 1.21.0 NuGet package ? If that is installed correctly, Microsoft.CognitiveServices.Speech.extension.mas.dll should be found when running the application using default microphone.
NOTE: Single channel support for CTS is not yet officially supported and information given here is experimental, where your input is valuable. Thanks
I agree, the error is not very helpful to figure out what's wrong. I believe I get the exact same error if my Azure credentials are wrong for this service.
Sounds like you did exactly what I was trying. I am using 1.21.0 and the magical 3 lines.
I'm using a different headset mic (iMicro SP-IM320). Now, I am using a remote desktop to a VM where all this speech stuff is really running. But, the speech recognition from microphone quickstart works just fine with this setup. It plugs in via USB and I'm seeing that it is monaural for the microphone online.
Perhaps there's a way to get more diagnostics?
You changed this:
using (var audioInput = AudioStreamReader.OpenWavFile(conversationWaveFile))
to this:
using (var audioInput = AudioConfig.FromDefaultMicrophoneInput())
Right?
Just tried again. It still fails the same way for me.
Would adding a line like this provide useful information to you?
config.SetProperty(PropertyId.Speech_LogFilename, "C:\Temp\SpeechSDK.log.txt");
Ok, today I installed everything locally on my Windows 10 desktop system.
Using the SDK as it comes (no changes), the mono mic and Azure Kinect 7-channel mic seem to work out if the box. As a matter of fact I moved over to my application using Conversation Transcription and it also worked. So something is going on to interfere with the mic working via Remote Desktop to a VM for Conversation Transcription but not Speech to Text.
So now I guess I need to figure out how to build my DotNet solution on Linux. Is that an easy thing to do?
@danieljlevine Thanks for info. It would be useful indeed to see the log as you proposed earlier, when the fail happens. Could you still provide that ?
Sure. I'll see if I can generate one for you.
Here is the log from the failed mono-mic via remote desktop SpeechSDK-mono-mic.log.txt .
Here is the log from the failed Azure Kinect (7 channel) mic via remote desktop. Note I commented out the "magical 3 lines", since the code works perfectly using this device without them on my desktop system SpeechSDK-azure-kinect.log.txt .
It seems like the Azure Kinect's extra channels don't find their way to the VM. I recorded something on the VM with Audacity and it came through as 2 channels as well, so that's consistent.
To build a .NET solution locally on Linux, you should have a .NET Core project. Make a copy of e.g. https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/master/quickstart/csharp/dotnetcore/from-microphone to use it as a base, and
dotnet
) for your Linux distribution as instructed in the readme of the aboveAudioStreamReader.cs
and Program.cs
from the conversation-transcription solution, overwrite the default Program.cs
helloworld.csproj
and add <PackageReference Include="Newtonsoft.Json" Version="12.0.3" />
just below the existing PackageReference linedotnet build helloworld/helloworld.csproj
dotnet helloworld/bin/Debug/netcoreapp3.1/helloworld.dll
Note that if you use wav files for input then by default they must reside in the same directory where you run the second command above.
Thanks, it turns out we have people who already know how to do this here. I would imagine they did something equivalent to this. So now the application runs on our Ubuntu 64-bit linux platform. We have verified that using the katiesteve.wav it does transcribe as expected. So everything but the mic has been checked out successfully.
However, using the Azure Kinect mic doesn't work yet. I am not using the magical 3 lines, since I didn't need them on Windows 10 with the Azure Kinect mic. arecord -L reports a number of devices we could use as mics, but none of the ones we tried seem to work. Have you ever tested an Azure Kinect with the conversation transcription demo on Linux? If so, what device did you specify? Perhaps we have been using the wrong one.
I have yet to get this working on Linux with a microphone. The katiesteve.wav is working. I have tried with and without the 3 magical monaural lines using the Azure Kinect and USB microphones without success. I always get the 0x1b runtime error. It would be great if this error was more meaningful. Then we could address the issue, as it appears to be the biggest stumbling block to getting this capability working. I'm not sure if the logs would provide more useful information.
I'm afraid we haven't tried using Azure Kinect with the Speech SDK on Linux. Does your USB microphone work with any other Speech SDK samples on Linux? Depending on your system it could require using FromMicrophoneInput
with specific parameters (see https://github.com/Azure-Samples/cognitive-services-speech-sdk/issues/1346) if the default capture device is not proper.
I tested the following with Ubuntu 20.04 LTS and Sennheiser PC 8 USB headset on Raspberry Pi 4 (ARM64):
$ arecord -l
**** List of CAPTURE Hardware Devices ****
card 1: headset [Sennheiser USB headset], device 0: USB Audio [USB Audio]
Subdevices: 1/1
Subdevice #0: subdevice #0
$ dotnet helloworld/bin/Debug/netcoreapp3.1/helloworld.dll
Session started event. SessionId=f7bc27899aaf40d4b1739d43c6f30322
TRANSCRIBING: Text=testing conversation SpeakerId=Unidentified
TRANSCRIBING: Text=testing conversation through SpeakerId=Unidentified
TRANSCRIBING: Text=testing conversation thread SpeakerId=Unidentified
TRANSCRIBING: Text=testing conversation SpeakerId=Unidentified
TRANSCRIBING: Text=testing conversation transcription SpeakerId=Unidentified
TRANSCRIBING: Text=testing conversation transcription with SpeakerId=Unidentified
TRANSCRIBING: Text=testing conversation transcription with the SpeakerId=Unidentified
TRANSCRIBING: Text=testing conversation transcription with a normal SpeakerId=Unidentified
TRANSCRIBING: Text=testing conversation transcription within normal SpeakerId=Unidentified
TRANSCRIBING: Text=testing conversation transcription with the normal SpeakerId=Unidentified
TRANSCRIBING: Text=testing conversation transcription with a normal SpeakerId=Unidentified
TRANSCRIBING: Text=testing conversation transcription with a normal head SpeakerId=Unidentified
TRANSCRIBING: Text=testing conversation transcription with a normal SpeakerId=Unidentified
TRANSCRIBING: Text=testing conversation transcription with a normal headset SpeakerId=Unidentified
TRANSCRIBED: Text=Testing conversation transcription with a normal headset. SpeakerId=Unidentified
I've attached the test project in test.zip, this was created based on what I wrote about .NET Core earlier. Please check if it matches what you have tried.
What micID value do you pass to FromMicrophoneInput()? I want to specify my micID rather than using the default one. Is conversation transcription extra picky about the microphone (compared to just speech to text)? It seems like it is (hence those 3 magical lines that tend to get mono-mics working).
hw:0,0 hw:CARD=0,DEVICE=0 hw:headset
Something else?
Do you require the 3 special mono lines to work with speech transcription?
Please run a basic speech recognition example https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/master/quickstart/csharp/dotnetcore/from-microphone and see if it works on your Linux system with default settings.
DisableReferenceChannel
and MicSpec
only need to be set when trying to use a conversation transcriber with mono input.
The mic ID for FromMicrophoneInput
depends on what microphone devices your system detects (see documentation https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-select-audio-input-devices).
Is your Linux machine local, not remote, not VM?
Can you post the full output of arecord -l
and arecord -L
?
Did you try the test project I attached and it did not work?
Ok, I've been doing a lot of rechecking where I stand because I was no longer really sure.
Is there some sort of output logs I can produce for you to help me get past this? My real target is the Ubuntu AMDx86 Linux.
Is your Linux machine local, not remote, not VM?
Local.
For testing on Linux:
arecord -f S16_LE -c1 -r16000 -t wav test.wav
to record (if a microphone is not found, try adding the device option e.g. -D hw:USB,0
) and aplay test.wav
to play.arecord -l
and arecord -L
.Ok, starting to collect this information for you.
Ok, I used a USB headset using: arecord -Dhw:CARD=Device,DEV=0 -f S16_LE -r16000 -t wav test.wav It would have recorded, but spit a message out warning saying: Warning: rate is not accurate (requested = 16000Hz, got = 44100Hz) please, try the plug plugin So, I took it's suggestion and and successfully recorded using: arecord -Dplughw:CARD=Device,DEV=0 -f S16_LE -r16000 -t wav test.wav It played back loud and clear.
So we have a working combination on Linux at the ALSA level. Now to build and test the quickstart you requested.
I got the samples and received this error when I tried to build with: dotnet build helloworld.csproj
Microsoft (R) Build Engine version 17.1.1+a02f73656 for .NET Copyright (C) Microsoft Corporation. All rights reserved.
Determining projects to restore... Restored /home/omni/levindj1/src/cognitive-services-speech-sdk-master/quickstart/csharp/dotnet/from-microphone/helloworld/helloworld.csproj (in 6.18 sec). /usr/share/dotnet/sdk/6.0.202/Microsoft.Common.CurrentVersion.targets(1220,5): error MSB3644: The reference assemblies for .NETFramew<PackageReference Include="Newtonsoft.Json" Version="12.0.3" /ork,Version=v4.6.1 were not found. To resolve this, install the Developer Pack (SDK/Targeting Pack) for this framework version or retarget your application. You can download .NET Framework Developer Packs at https://aka.ms/msbuild/developerpacks
So I went here: https://aka.ms/msbuild/developerpacks But I don't see linux stuff to download. Ultimately I see .exe files, which I'm pretty sure won't work out in Linux. So I went into my helloword.csproject and added this (was directed to do so above):
It didn't resolve the issue. So I added this:
Build now produces this output: .../cognitive-services-speech-sdk-master/quickstart/csharp/dotnet/from-microphone/helloworld/helloworld.csproj(109,3): warning MSB4011: "/usr/share/dotnet/sdk/6.0.202/Microsoft.CSharp.targets" cannot be imported again. It was already imported at ".../cognitive-services-speech-sdk-master/quickstart/csharp/dotnet/from-microphone/helloworld/helloworld.csproj (108,3)". This is most likely a build authoring error. This subsequent import will be ignored. /usr/share/dotnet/sdk/6.0.202/Microsoft.CSharp.CurrentVersion.targets(130,9): warning MSB3884: Could not find rule set file "MinimumRecommendedRules.ruleset". [.../src/cognitive-services-speech-sdk-master/quickstart/csharp/dotnet/from-microphone/helloworld/helloworld.csproj] .../cognitive-services-speech-sdk-master/quickstart/csharp/dotnet/from-microphone/helloworld/Program.cs(8,17): error CS0234: The type or namespace name 'CognitiveServices' does not exist in the namespace 'Microsoft' (are you missing an assembly reference?) [.../cognitive-services-speech-sdk-master/quickstart/csharp/dotnet/from-microphone/helloworld/helloworld.csproj]
Build FAILED.
So, I'm stuck at the moment.
Here's the output from arecord -l List of CAPTURE Hardware Devices card 0: PCH [HDA Intel PCH], device 0: ALC892 Analog [ALC892 Analog] Subdevices: 1/1 Subdevice #0: subdevice #0 card 2: Device [USB PnP Sound Device], device 0: USB Audio [USB Audio] Subdevices: 1/1 Subdevice #0: subdevice #0 List of CAPTURE Hardware Devices card 0: PCH [HDA Intel PCH], device 0: ALC892 Analog [ALC892 Analog] Subdevices: 1/1 Subdevice #0: subdevice #0 card 2: Device [USB PnP Sound Device], device 0: USB Audio [USB Audio] Subdevices: 1/1 Subdevice #0: subdevice #0
And here's the output from arecord -L default Playback/recording through the PulseAudio sound server surround21 2.1 Surround output to Front and Subwoofer speakers surround40 4.0 Surround output to Front and Rear speakers surround41 4.1 Surround output to Front, Rear and Subwoofer speakers surround50 5.0 Surround output to Front, Center and Rear speakers surround51 5.1 Surround output to Front, Center, Rear and Subwoofer speakers surround71 7.1 Surround output to Front, Center, Side, Rear and Woofer speakers null Discard all samples (playback) or generate zero samples (capture) samplerate Rate Converter Plugin Using Samplerate Library speexrate Rate Converter Plugin Using Speex Resampler jack JACK Audio Connection Kit oss Open Sound System pulse PulseAudio Sound Server upmix Plugin for channel upmix (4,6,8) vdownmix Plugin for channel downmix (stereo) with a simple spacialization sysdefault:CARD=PCH HDA Intel PCH, ALC892 Analog Default Audio Device front:CARD=PCH,DEV=0 HDA Intel PCH, ALC892 Analog Front speakers dmix:CARD=PCH,DEV=0 HDA Intel PCH, ALC892 Analog Direct sample mixing device dsnoop:CARD=PCH,DEV=0 HDA Intel PCH, ALC892 Analog Direct sample snooping device hw:CARD=PCH,DEV=0 HDA Intel PCH, ALC892 Analog Direct hardware device without any conversions plughw:CARD=PCH,DEV=0 HDA Intel PCH, ALC892 Analog Hardware device with all software conversions usbstream:CARD=PCH HDA Intel PCH USB Stream Output usbstream:CARD=NVidia HDA NVidia USB Stream Output sysdefault:CARD=Device USB PnP Sound Device, USB Audio Default Audio Device front:CARD=Device,DEV=0 USB PnP Sound Device, USB Audio Front speakers surround21:CARD=Device,DEV=0 USB PnP Sound Device, USB Audio 2.1 Surround output to Front and Subwoofer speakers surround40:CARD=Device,DEV=0 USB PnP Sound Device, USB Audio 4.0 Surround output to Front and Rear speakers surround41:CARD=Device,DEV=0 USB PnP Sound Device, USB Audio 4.1 Surround output to Front, Rear and Subwoofer speakers surround50:CARD=Device,DEV=0 USB PnP Sound Device, USB Audio 5.0 Surround output to Front, Center and Rear speakers surround51:CARD=Device,DEV=0 USB PnP Sound Device, USB Audio 5.1 Surround output to Front, Center, Rear and Subwoofer speakers surround71:CARD=Device,DEV=0 USB PnP Sound Device, USB Audio 7.1 Surround output to Front, Center, Side, Rear and Woofer speakers iec958:CARD=Device,DEV=0 USB PnP Sound Device, USB Audio IEC958 (S/PDIF) Digital Audio Output dmix:CARD=Device,DEV=0 USB PnP Sound Device, USB Audio Direct sample mixing device dsnoop:CARD=Device,DEV=0 USB PnP Sound Device, USB Audio Direct sample snooping device hw:CARD=Device,DEV=0 USB PnP Sound Device, USB Audio Direct hardware device without any conversions plughw:CARD=Device,DEV=0 USB PnP Sound Device, USB Audio Hardware device with all software conversions usbstream:CARD=Device USB PnP Sound Device USB Stream Output
So, I'm trying to get you basically same information your requested from running the SpeechRecognizer from microphone demo...
We have a version of our application in our AMDx64 environment that works that is basically the SpeechRecognizer from microphone demo. It works great with the USB microphone I specify it with "plughw:CARD=Device,DEV=0". It also works great with the Azure Kinect using "plughw:CARD=Array,DEV=0". So I know what microphone device IDs to use with the SpeechRecognizer part of the Speech SDK.
However, if I take the code and basically transform it from the SpeechRecognizer capability to the Conversation Transcription capability and use these same microphone device IDs, it either occasionally hears things I don't actually say, hears nothing, or crashes with the 0x1b error. The katiesteve.wav file works, so I know I've implemented things pretty much right or that wouldn't get transcribed and the Azure Kinect microphone works on Windows when I specify it's microphone device ID. It's almost like it's listening to the wrong channel or cancelling the voice channel on Linux somehow. So, I've been playing with the AudioProcessingOptions to see if I can get something to work differently (i.e., better). I'm thinking that perhaps ALSA is not providing all the information that the Speech SDK gets from Windows and so perhaps I need to coach it a little with these options.
But there definitely seems to be a difference in how SpeechRecognizer and ConversationTranscriber use the microphone.
I played with the AudioProcessingOptions on Linux and got it to work on all my microphones. I can't quite explain why it now works using that as the options I'm using don't make much sense., but somehow it causes it to work. I need to head home now, but will be able to explain what I'm doing now. I believe these options cause it to fail when I use them on Windows.
Great to hear you've got it working - please post details for analysis. I presume it is related to your system having multiple audio capture devices so that the proper device needs to be specified in options as you did with arecord
.
Your sample build failed because you tried to build a .NET Framework project which is Windows only:
Restored /home/omni/levindj1/src/cognitive-services-speech-sdk-master/quickstart/csharp/dotnet/from-microphone/helloworld/helloworld.csproj (in 6.18 sec).
On Linux and other non-Windows systems you need to use .NET Core projects like quickstart/csharp/dotnetcore/from-microphone.
Thanks for showing me the build issue. I may try that again with the dotnet core version later.
I can provide more details when J back to the terminal, but I’m still using the same microphone device IDs that didn’t crash, but didn’t seem to “hear” anything.
Basically, I just provided AudioProcessingOptions to the process. I did some test with all the audio processing disabled and the defaults. I don’t believe this mattered. Since I have an Azure Connect I specified the 7-mic circle configuration. I was able to use this configuration with Azure Kinect, another microphone array, and mono-USB mic successfully, despite them being the wrong configuration of microphones in latter 2 cases. I also turned off the feedback channel and some times left that parameter out. Leaving it out with audio processing off is probably like not having it on. I could swear that I had it on with default audio processing and 7 channel circle geometry and it still worked for all 3 microphone configurations. I’ll have to check. This is why I was saying I couldn’t really explain why this did the trick on Linux.
I don’t believe this helped the Windows 10 version, and I believe it may have broken it.
Here was my test code. I made it so I could prepend a number and # to the microphone so I could have our admins build our environment and then I could test a few different parameters with different microphones.
var audioProcessingOptions = AudioProcessingOptions.Create(AudioProcessingConstants.AUDIO_INPUT_PROCESSING_ENABLE_DEFAULT);
if ((microphoneDeviceId != null) || microphoneDeviceId.Length >= 2)
{
if (microphoneDeviceId.Substring(0, 2) == "1#")
{
audioProcessingOptions = AudioProcessingOptions.Create(AudioProcessingConstants.AUDIO_INPUT_PROCESSING_NONE, PresetMicrophoneArrayGeometry.Circular7);
microphoneDeviceId = microphoneDeviceId.Substring(2);
}
else if (microphoneDeviceId.Substring(0, 2) == "2#")
{
audioProcessingOptions = AudioProcessingOptions.Create(AudioProcessingConstants.AUDIO_INPUT_PROCESSING_NONE, PresetMicrophoneArrayGeometry.Circular7, SpeakerReferenceChannel.None);
microphoneDeviceId = microphoneDeviceId.Substring(2);
}
else if (microphoneDeviceId.Substring(0, 2) == "3#")
{
audioProcessingOptions = AudioProcessingOptions.Create(AudioProcessingConstants.AUDIO_INPUT_PROCESSING_ENABLE_DEFAULT, PresetMicrophoneArrayGeometry.Circular7);
microphoneDeviceId = microphoneDeviceId.Substring(2);
}
else if (microphoneDeviceId.Substring(0, 2) == "4#")
{
audioProcessingOptions = AudioProcessingOptions.Create(AudioProcessingConstants.AUDIO_INPUT_PROCESSING_ENABLE_DEFAULT, PresetMicrophoneArrayGeometry.Circular7, SpeakerReferenceChannel.None);
microphoneDeviceId = microphoneDeviceId.Substring(2);
}
_logger.LogInformation($"Really using microphoneDeviceId: {microphoneDeviceId}");
audioConfig = AudioConfig.FromMicrophoneInput(microphoneDeviceId, audioProcessingOptions);
If I recall correctly, test 3 was the one that didn't work with the USB microphone on linux. I believe it didn't crash, but just didn't hear any speech. Look at the options I'm using there, that seems to turn on the default audio processing, sets the mic geometry to the Azure Kinect geometry, and since it omits the SpeakerReferenceChannel.None, my guess is it's probably all about the speaker reference channel.
In case 1, it probably ignore the speaker reference channel because I told it not to do any audio processing. In case 2, it's the same as case 1, but I told it I didn't have a speaker reference channel, which it probably didn't use anyway, since the processing is disabled. In case 3, it's probably trying to use the speaker reference channel and it's not working out (especially since the USB mic only has one channel, so it might be ignoring the only channel with audio) in the default processing. In case 4, we are enabling default audio processing, but removing the secondary audio channel, so that probably disables the ability to do some of that processing. So, perhaps Automatic Gain Control is done, but echo cancellation and such are not. I believe this is what I'll probably change the default to in our application until I figure out how this can be performed more automatically.
@danieljlevine Hi, just to check back on this, is diarization now working in your application on target platforms? What are the final settings you used with the microphone(s), if different from what you posted previously? Please confirm.
Closed since no further updates received and it's understood a working solution was found. Please open a new issue if further support is needed.
Hi,
I have inherited some C# code that does a nice job of using Azure to convert words spoken into a microphone to text using the SpeechRecognizer class. It makes nice calls to what I believe are callbacks recognizer.recognizing(s, e) and recognizer.recognized(s, e) to report back intermediary results and successfully successful speech recognition. The parameter value e has useful information in both cases like: e.Result.Text and e.Result.Duration. I am interested in evolving this existing capability to one that also identifies different speakers (as opposed to specific speakers), so it's important to me to know when speaker1, speaker2, and speaker3 spoke and what they said, but I am not interested in identifying who speaker1, speaker2, and speaker3 are.
I was hoping that I could turn on diarizationEnabled and wordLevelTimestampsEnabled, by setting them to true like this:
speechConfig.SetProperty("diarizationEnabled", "true"); speechConfig.SetProperty("wordLevelTimestampsEanbled", "true");
As far as I could tell it didn't have any effect. I was trying to figure out how to change the result from simple to detailed, in case that made a difference, but wasn't able to figure out how to do that yet either.
Suggestions?