compulim / web-speech-cognitive-services

Polyfill Web Speech API with Cognitive Services Bing Speech for both speech-to-text and text-to-speech service.
https://compulim.github.io/web-speech-cognitive-services
MIT License
63 stars 18 forks source link

Integration with React Speech Recognition #124

Open JamesBrill opened 3 years ago

JamesBrill commented 3 years ago

Hi @compulim ! I'm the author of the React Speech Recognition hook. I recently made a new release that supports polyfills such as yours. Indeed, yours is currently the first and only polyfill that works (more or less) with react-speech-recognition. You've done a great job - for the most part, it worked smoothly while I was testing the two working together.

Some feedback on some wrinkles I encountered while testing the integration between the two libraries:

Thanks for making this polyfill and I hope some of the above is useful. If you want to donate more of your speech recognition polyfill-making skills, there is a similar WIP project for AWS Transcribe that I'd love to be able to integrate with. There's also a general discussion about web speech recognition polyfills here.

compulim commented 3 years ago

Thanks @JamesBrill. Love to see a lot of people investing into W3C APIs than building their own.

Love to see many people converting cloud-based services into the W3C Web Speech API. However, my real world job can't afford me doing another hobby projects. I will try to join if I have spare time. But do let me know when your ponyfill is ready. I will try it out in my real world project. 😄

Here is my tips when writing a polyfill for similar systems:

I would recommend this signature for ponyfilling more stuffs:

createAWSTranscribeSpeechRecognition({
  audioContext,
  credentials,
  ponyfill: {
    AudioContext, // in case the user did not pass an "audioContext" instance, you will create a new one using this class
    fetch,
    WebSocket
  } = window
})

In this way, you can enable Node.js developers to use your package as long as they provided the needed ponyfills (without pollutions). Also, it will be easier for you to write tests as you can easily mock the external system.

In my production system, we use TTS to test against STT. I.e. we use TTS to generate waveform from textual test data, then feed the waveform into STT for assertion. And also vice versa. We mocked AudioContext in a limited fashion and cross check to make sure both STT/TTS works correctly.

JamesBrill commented 2 years ago

Hi @compulim sorry for the massive delay in replying - your message got lost amongst my other GitHub notifications. Unfortunately, this means I've forgotten a lot of the context from my original message, but I'll do my best to address your questions. It looks like most of my original issues no longer apply.

We can't polyfill "subscription key -> authorization token".

Curiously, I've been able to authenticate by just passing in a subscription key to credentials. This definitely wasn't the case earlier this year, where I was forced to convert it to an authorization token like this:

const response = await fetch(TOKEN_ENDPOINT, {
  method: 'POST',
  headers: { 'Ocp-Apim-Subscription-Key': SUBSCRIPTION_KEY }
});
const authorizationToken = await response.text();
const {
  SpeechRecognition: AzureSpeechRecognition
} = createSpeechServicesPonyfill({
  credentials: {
    region: REGION,
    authorizationToken,
  }
});

So my pain point around doing the subscription key -> auth token conversion is no longer valid - maybe my Speech Service instance was misconfigured back then. It makes sense for consumers to take the burden of performing this conversion in production - this should be performed on their backend to avoid leaking the subscription key. I shall update my docs for this in react-speech-recognition as I'm currently suggesting consumers perform this conversion inside the component (i.e. on the browser), which is not the appropriate place to do that.

Do you mean result event with isFinal === true was emitted twice? Will be great if you can give more information on how to increase the chance to repro it, e.g. with shorter/longer phrases, etc.

I'm afraid I'm not able to repro this any more. There is another bug I've noticed in stop, which I'll raise an issue for shortly.

If lang is not set, other than guessing the value from navigator.language, what do you think a better default value should be?

navigator.language seems to get the language in the locale format that Azure requires. The solution may be as simple as preferring this over the lang attribute when computing the default language here (currently, the lang attribute is preferred).

I checked the createSpeechRecognitionPonyfill is a sync function

I think my point here is also no longer valid - I can see this is indeed the case (perhaps I was confused by the Promise-like then property it returns. I can see that this polyfill can be set up synchronously.

Here is my tips when writing a polyfill for similar systems

Thanks for these! I've not had to implement one of these polyfills myself yet, but will share this with the person who's doing the AWS polyfill.

In my production system, we use TTS to test against STT. I.e. we use TTS to generate waveform from textual test data, then feed the waveform into STT for assertion.

This is really cool - I thought of making some end-to-end tests like this using pre-recorded audio files, but using TTS is a good way of generating deterministic audio inputs.

paschaldev commented 11 months ago

@compulim @JamesBrill I'm Hoping anyone can help with error handling. If the authorization token is invalid, I'd like to catch the error and refresh the token but the error doesn't enter the catch block.

I checked the source code, and found out that createSpeechServicesPonyfill is an async operation that uses fetch for the network call.

i tried to wrap in a promise then/catch but that's not allowed: console.warn('web-speech-cognitive-services: This function no longer need to be called in an asynchronous fashion. Please update your code. We will remove this Promise.then function on or after 2020-08-10.');

try {
  const { SpeechRecognition: AzureSpeechRecognition } =
    createSpeechServicesPonyfill({
      credentials: {
        region: azureRegion,
        authorizationToken: azureToken,
      },
    });
  SpeechRecognition.applyPolyfill(AzureSpeechRecognition);
} catch (e) {
  console.log("Error Azure", e);
}

I also can't do this:

await createSpeechServicesPonyfill({
  credentials: {
    region: azureRegion,
    authorizationToken: azureToken,
  },
})