Azure-Samples / cognitive-services-speech-sdk

Sample code for the Microsoft Cognitive Services Speech SDK
MIT License
2.88k stars 1.85k forks source link

SpeakerRecognition fails to RecognizeOnceAsync with Enrolled VoiceProfile 0x5 (SPXERR_INVALID_ARG) #1096

Closed roninstar closed 3 years ago

roninstar commented 3 years ago

Could someone please help me? I am getting this exact same error in version 1.16 using an ARM7 Android 8.1 device with SpeakerRecognition using exactly the example that comes with it. I can enroll my speaker but whenever I run this with the same VoiceProfile that was just enrolled it fails with this error 0x5 (SPXERR_INVALID_ARG) . I feel like this was working in older versions 1.12 but now it is not. It fails on the RecognizeOnceAsync. This same stream works fine for every other Speech function like STT. byte channels = 1; byte bitsPerSample = 16; uint samplesPerSecond = 16000; var audioStreamFormat = AudioStreamFormat.GetWaveFormatPCM(samplesPerSecond, bitsPerSample, channels);

    _pushAudioInputStream = new PushAudioInputStream(audioStreamFormat);
    var config = SpeechConfig.FromSubscription(AzureSpeechResourceKey, AzureRegionWestUs);
        using (var audioInput = AudioConfig.FromStreamInput(_pushAudioInputStream))
        {
            var speakerRecognizer = new SpeakerRecognizer(config, audioInput);

            var model = SpeakerVerificationModel.FromProfile(profile);

         // "Speak the passphrase to verify: My voice is my passport, please verify me."

            var result = await speakerRecognizer.RecognizeOnceAsync(model);

To Reproduce Take the sample code for enrolling a speaker using a PushStream sending the Android Minimum Buffer Size to stream and Enroll a User with a VoiceProfile of TextIndependentVerification. Enroll a user from wav file. Once enrollment is successful verify the speaker using the same VoiceProfile. It will immediately fail on RecognizeOnceAsync.

Expected behavior It should have returned with a Result Score of Recognized or Not.

Version of the Cognitive Services Speech SDK 1.16

Platform, Operating System, and Programming Language Development Environment Xamarin Android native Windows 10 Visual Studio Community 2019 16.9.2 Xamarin Android SDK 11.2.2.1 Xamarin 16.9.000.273

Rooted Android Device Android 8.1 audio is PCM 16 bit 16000Hz mono

Additional context speechlog.txt

glharper commented 3 years ago

@roninstar Thanks for including the log! It looks like something about the SpeakerVerificationModel is causing the RecognizeOnceAsync call to throw. Could you include more of your code around how the profile variable gets created and updated?

roninstar commented 3 years ago

sure, so I am capturing all the audio spoken to a PCM 16bit Wav file and playing this audio wav file into this code. Then I verify the enrolled profile by passing this created profile into the Speaker verification code above. The audio file itself is clear but there are gaps of silence in between speech as the user is getting prompted to speak a handful of different sentences leading up to about 30 seconds of audio.

_config = SpeechConfig.FromSubscription(CognitiveServicesSettings.AzureSpeechResourceKey, CognitiveServicesSettings.AzureRegionWestUs);
            using (var client = new VoiceProfileClient(_config))
            using (var profile = await client.CreateProfileAsync(VoiceProfileType.TextIndependentVerification, "en-us"))
            {
                using (_audioInput = AudioConfig.FromWavFileInput(wavFilePath))
                {
                    VoiceProfileEnrollmentResult result = null;
                    while (result is null || result.RemainingEnrollmentsSpeechLength > TimeSpan.Zero)
                    {
                        result = await client.EnrollProfileAsync(profile, _audioInput);
                        Console.WriteLine("Remaining enrollment audio time needed: " + result?.RemainingEnrollmentsSpeechLength);   
                    }
                    if (result.Reason == ResultReason.EnrolledVoiceProfile)
                    {
                        Console.WriteLine("VoiceProfile is Enrolled " + profile.Id);
                        VerifyVoiceTraining(profile, profileMapping);
                    }
                    else if (result.Reason == ResultReason.Canceled)
                    {
                        var cancellation = VoiceProfileEnrollmentCancellationDetails.FromResult(result);
                        Console.WriteLine( $"CANCELED {profile.Id}: ErrorCode={cancellation.ErrorCode} ErrorDetails={cancellation.ErrorDetails}");

                    }
                }

            }
glharper commented 3 years ago

@roninstar Thanks for the response. Could you also share the .wav files you're using to enroll and then verify?

roninstar commented 3 years ago

sure, thanks. I started creating larger files too to see if that helped with more spoken speech and it still throws the same error. voicetraining_audio.zip

glharper commented 3 years ago

@roninstar For verifying that audio file, I'm getting a Cancelled result "ErrorDetails= Invalid audio length. Maximum allowed length is 10 seconds. Bad request (400). Please verify the provided subscription details and language information." For training and verification files, there is a max time limit of 10 seconds per file. Could you try this with files less than 10 seconds long?

roninstar commented 3 years ago

Yes I will try that now.

roninstar commented 3 years ago

I am getting the same error with a file size just under 10 seconds. It does have some gaps between speaking. Maybe that is causing a problem. An odd thing I noticed in the logs is when enrollment starts it says remaining time 14 seconds. But when I open my file in a wav editor it matches up with 16 bit 16khz mono pcm of 10 seconds. I set the limit of bytes to capture as 312000.

glharper commented 3 years ago

@roninstar That remaining time means how much more spoken audio the service requires to completely enroll a voice profile. A 10 second audio file may have 3-4 seconds of spoken audio, which is not enough on it's own. Try using more samples until the remaining time is 0 seconds (or you can re-enroll the same sample file multiple times, but the voice profile recognition may not be as accurate).

roninstar commented 3 years ago

I'm struggling to get this just right. I am still failing with this audio. So I am prompting users to speak different sentences for voice training. I am using Speech To Text Speech Recognition to know when to capture the audio and when to stop. This still leaves little gaps in between. Its hard to get people to speak one phrase that lasts 10 seconds or for them to remember what they should say that is 10 seconds long of speech. That is why I am trying to string it together. Is there anything in the Speech Recognizer that could fine tune exactly when to start capturing the audio and when to stop? When I use SpeechDetected event that cuts off too much audio and then SpeechEnd Detected never seems to get hit. But Recognized does seem to be a good spot to stop capturing the audio. Do you have any suggestions or is there a way to either get the audio right on the way in or clean the audio gaps before we submit the wav file for Enrollment?
recognizer.Recognized += (s, e) => { switch (e.Result.Reason) { case ResultReason.RecognizedSpeech: resultStr = $"RECOGNIZED: '{e.Result.Text}'"; //STOP CAPTURING AUDIO ON FIRST RECOGNITION _captureSpeech = false;

roninstar commented 3 years ago

to give you a use-case for my problem is we have no user interface so we are using Text to Speech to prompt the user to speak each phrase. We use the SpeechRecognizer to know when to capture audio and when not to. If I could capture just the users speech portion each time then I could accurately fill 10 seconds with audio and create the wav file and play that into the Enrollment process. then I hope the Speaker Identification would work without failing.

glharper commented 3 years ago

@roninstar One approach could be to record 10 seconds of audio, then send that 10 second file multiple times until you receive a EnrolledVoiceProfile result. Another could be to prompt the user multiple times, as in SpeakerVerificationWithMicrophone here, but using voice prompts instead of the console output in the sample.

roninstar commented 3 years ago

That sounds promising! I will give these a try.

roninstar commented 3 years ago

This latest wav file passes Enrollment and it has no audio gaps and is just under 10 seconds. but as soon as I go to verify my voice the RecognizeOnceAsync fails immediately with the same SPXERR_INVALID_ARG.
voicetraining_usersguid_audio.zip

glharper commented 3 years ago

@roninstar If you could zip and attach your solution files, I'll take a look. My best guess is that something about the profile isn't right (wrong type, wrong id). Looking on your end, I'd make sure in debug that the enrolled profile details when you get the EnrolledVoiceProfile match exactly the details of the profile given to SpeakerVerificationModel.FromProfile.

roninstar commented 3 years ago

sure can I email you the two classes? they are a part of a very large project so I can't send the whole thing. The Audio is getting raised from the mic using the Android Minimum Buffer OnMarkerReached using an AudioRecord.

glharper commented 3 years ago
(at)microsoft(dot)com
roninstar commented 3 years ago

thanks, email is sent.

glharper commented 3 years ago

Resolved via email