Azure-Samples / cognitive-services-speech-sdk

Sample code for the Microsoft Cognitive Services Speech SDK
MIT License
2.81k stars 1.82k forks source link

Microsoft Entra Authentication for Speech Service #2552

Closed cecheta closed 3 weeks ago

cecheta commented 4 weeks ago

I have been trying to configure Microsoft Entra authentication for the Speech Service, as explained in these docs, yet it only seems to work from the browser.

My setup is Speech Service with public access disabled, and a private endpoint from a VNet. I have then connected to this VNet. The custom domain for the speech service has the same name as the resource itself.

I have run this sample, which uses the custom domain and Microsoft Entra for authentication, and this works well.

I've then tried to copy the same form of authentication in a Node example, but using speech-to-text on a file, but it doesn't work. I get no output.

const fs = require("fs");
const { DefaultAzureCredential } = require("@azure/identity");
const sdk = require("microsoft-cognitiveservices-speech-sdk");

(async () => {
  const SUBSCRIPTION_ID = ""
  const RESOURCE_GROUP = ""
  const SPEECH_SERVICE_NAME = ""
  const SPEECH_SERVICE_KEY = ""
  const AUDIO_FILE_NAME = "audio1.wav"

  const token = (await new DefaultAzureCredential().getToken("https://cognitiveservices.azure.com/.default")).token;
  const resourceId = `/subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${RESOURCE_GROUP}/providers/Microsoft.CognitiveServices/accounts/${SPEECH_SERVICE_NAME}`;
  const speechToken = `aad#${resourceId}#${token}`;

  const speechConfig = sdk.SpeechConfig.fromEndpoint(new URL(`wss://${SPEECH_SERVICE_NAME}.cognitiveservices.azure.com/stt/speech/universal/v2`));
  speechConfig.authorizationToken = speechToken;

  speechConfig.speechRecognitionLanguage = "en-US";

  const pushStream = sdk.AudioInputStream.createPushStream();

  fs.createReadStream(AUDIO_FILE_NAME).on('data', function (arrayBuffer) {
    pushStream.write(arrayBuffer.buffer);
  }).on('end', function () {
    pushStream.close();
  });

  const audioConfig = sdk.AudioConfig.fromStreamInput(pushStream);

  const recognizer = new sdk.SpeechRecognizer(speechConfig, audioConfig);

  recognizer.recognized = function (s, e) {
    console.log(`Recognised: ${e.result.text}`);
  };

  recognizer.recognizeOnceAsync(() => {
    recognizer.close();
  });
})();

If I change

const speechConfig = sdk.SpeechConfig.fromEndpoint(new URL(`wss://${SPEECH_SERVICE_NAME}.cognitiveservices.azure.com/stt/speech/universal/v2`));
speechConfig.authorizationToken = speechToken;

to

const speechConfig = sdk.SpeechConfig.fromEndpoint(new URL(`wss://${SPEECH_SERVICE_NAME}.cognitiveservices.azure.com/stt/speech/universal/v2`), SPEECH_SERVICE_KEY);

Then it works, and I get the transcription.

I also tried in Python:

import azure.cognitiveservices.speech as speechsdk
from azure.identity import DefaultAzureCredential

SUBSCRIPTION_ID = ""
RESOURCE_GROUP = ""
SPEECH_SERVICE_NAME = ""
AUDIO_FILE_NAME = "audio1.wav"

token = DefaultAzureCredential().get_token("https://cognitiveservices.azure.com/.default").token
resource_id = f"/subscriptions/{SUBSCRIPTION_ID}/resourceGroups/{RESOURCE_GROUP}/providers/Microsoft.CognitiveServices/accounts/{SPEECH_SERVICE_NAME}"
speech_token = f"aad#{resource_id}#{token}"

speech_config = speechsdk.SpeechConfig(
    endpoint=f"wss://{SPEECH_SERVICE_NAME}.cognitiveservices.azure.com/stt/speech/universal/v2",
)
speech_config.authorization_token = speech_token
speech_config.speech_recognition_language = "en-US"

audio_config = speechsdk.audio.AudioConfig(filename=AUDIO_FILE_NAME)

recognizer = speechsdk.SpeechRecognizer(speech_config, audio_config)

result = recognizer.recognize_once_async().get()

if result.reason == speechsdk.ResultReason.RecognizedSpeech:
    print(f"Recognised: {result.text}")
elif result.reason == speechsdk.ResultReason.Canceled:
    cancellation_details = result.cancellation_details
    print(f"Cancelled: {cancellation_details.reason}")
    print(f"Error: {cancellation_details.error_details}")

Which gives the following output:

Cancelled: CancellationReason.Error
Error: WebSocket upgrade failed: Authentication error (401). Please check subscription information and region name. SessionId: 178b631d2deb4f10bf7d69de0dc7de1c

Why does it work from the browser but not when using Node/Python? Is it documented where it does and doesn't work?

cecheta commented 3 weeks ago

To add to this, I've done some more testing, particularly with enabling and disabling the disableLocalAuth property on the speech service resource, and I'm getting mixed results.

For example, it works initially, I then set disableLocalAuth: true, it stops working, I set disableLocalAuth: false, and it's still not working

Also, it seems like if you just use the Entra token as authorisation, instead of f"aad#{resource_id}#{token}", then it works?

cecheta commented 3 weeks ago

Closing, as getting mixed and inconsistent results