JamesBrill / react-speech-recognition

💬Speech recognition for your React app
https://webspeechrecognition.com/
MIT License
636 stars 116 forks source link

Feature Request: Add Google Speech-To-Text API polyfilloption #113

Open marharyta opened 2 years ago

marharyta commented 2 years ago

First of all, thank you for your work and this component is 🔥

Feature request: I like using it for my app, but I would love to be able to use Google Speech-to-text API like in this project, for example. Is it possible to create a polyfill or integration similar to the one you have with Speechly but with Google Speech-to-text API?

Thank you!

JamesBrill commented 2 years ago

Hi @marharyta Thanks for raising this!

Yes absolutely, a Google Cloud speech-to-text polyfill would be awesome - nobody has raised interest in it until now. Ensuring there are Web Speech polyfills based on all the major cloud providers is a goal of mine, even if it's not me implementing the polyfills. We've got Microsoft and Speechly so far, with AWS support on its way. Adding Google would make the provider support even more complete.

Making these polyfills is quite hard, but it appears that the author of that project has already done the hard work of interfacing with the Google API and processing the audio streams in this file. I'm happy to look into adapting that code into a polyfill that can be used more widely, but you're also welcome to set that up too - it seems like an easy win in terms of OSS contributions. I can do a quick spike soon to get a sense of what's required.

There's a discussion about Web Speech polyfills here - you can use that to either poke me into making the Google polyfill more quickly or get feedback on any polyfill you make yourself. I think I mention in that thread the parts of the W3C specification that need to be implemented in a polyfill for it to work with react-speech-recognition. You can also find them referenced here.

marharyta commented 2 years ago

Thanks a lot for the detailed answer! This seems to be a very well-maintained project and it does what I need it to it terms of functionality. I will take a look at it when I have time and maybe even contribute! Have a lovely day!

JamesBrill commented 2 years ago

A little update on this. I was able to create a working polyfill that replicates the behaviour in react-hook-speech-to-text. However, it seems the implementation uses the REST API to upload an audio file for each utterance made by the user. This is okay, but does not enable responsive voice-driven applications. There is quite a long pause between the utterance and the transcription due to the client converting the audio file into Base64 and then uploading the whole thing only after the user finishes speaking. On devices with limited CPU and network speeds, this could be quite a laggy experience. This differs from the approach taken by other APIs where audio is streamed directly to the API with "interim" results coming back while the user is speaking, enabling much more responsive transcriptions.

Indeed, Google does offer a streaming option via its gRPC API (I suspect this is what Chrome uses under the hood). However, this endpoint does not accept a simple API key and requires authentication as a service account, for which credentials should not be exposed in a browser. Using this article as an example, it seems the common practice for streaming audio to Google is to proxy it through your own backend. Ironic, given that Google streams audio directly from the browser without authentication or any charge in Chrome.

The "upload a file" approach is not ideal for the user experience, but it's better than nothing. If you think it's useful, I can tidy up the spike and publish it for public consumption, with the above caveat documented.

marharyta commented 2 years ago

This actually was one of the problems I had - the latency between stopping recording and getting the final transcript. Since this seems to be the only option for us anyway, I would be happy to have even that.

What I ideally would need is being able to subscribe to transcript update (promise?) and later clear transcript to re-start recording from a clean slate.

Again, thank you for being responsive!