Support SpeechRecognition on an audio MediaStreamTrack

Pehrsons commented 5 years ago

There is an old issue in bugzilla but it doesn't discuss much.

We should revive this, to give the application control over the source of audio.

Not letting the application be in control has several issues:

The spec needs to re-define (mediacapture-main does this too) what sources there are, today I don't see this mentioned. It seems implied that a microphone is used.
The error with code "audio-capture" groups all kinds of errors that capturing audio from a microphone could have.
There'd have to be an additional permission setting for speech, in addition to that of audio capture through getUserMedia. The spec doesn't help in clearing out how this relates to getUserMedia's permissions currently, and doing so could become complicated (if capture is already ongoing, do we ask again? if not, how does a user choose device? etc.)
Depending on implementation, if we rely on start() requesting audio from getUserMedia() (seems reasonable), doing multiple requests after each other could lead to a new permission prompt for each one, unless the user consents to giving a permanent permission. This would be an issue in Firefox as through the SpeechRecognition API an application cannot control the lifetime of the audio capture.
Probably more.

Letting the application be in control has several advantages:

It can rely on mediacapture-main and its extension specs to define sources of audio and all security and privacy aspects around them. Some language might still be needed around cross-origin tracks. There's already a concept of isolated tracks in webrtc-identity, that will move into the main spec in the future, that one could rely on for the rest.
If no backwards-compatible path is kept, the spec can be simplified by removing all text, attributes, errors, etc. related to audio-capture.
The application is in full control of the track's lifetime, and thus can avoid any permission prompts the user agent might otherwise throw at the user, when doing multiple speech recognitions.
The application can recognize speech from other sources than microphones.
Probably more.

To support a MediaStreamTrack argument to start(), we need to:

Throw in start() if the track is not of kind "audio".
Throw in start() if the track's readyState is not "live".
Throw in start() if the track is isolated.
If the track becomes isolated while recognizing, discard any pending results and fire an error.
If the track ends while recognizing, treat it as the end of speech and handle it gracefully.
If the track is muted or disabled, do nothing special as this means the track contains silence. It could become unmuted or enabled at any time.

What to throw and what to fire I leave unsaid for now.

guest271314 commented 5 years ago

@Pehrsons Not entirely certain what is being proposed here relevant to SpeechRecognition? There is no special handling needed for speech to be recorded using getUserMedia(). MediaRecorder start(), pause(), resume() and stop() should suffice for audio input. Processing the input locally instead of sending the input to an undisclosed remote web service is what is needed.

Currently Chrome, Chromium records the user voice (without notice or permission) then sends that recording to a remote web service (https://github.com/w3c/speech-api/issues/56; https://webwewant.fyi/wants/55/). The response is a transcript (text) of the input; depending on the input words, heavily censored. It is unclear what happens to the users' input (potential biometric data; their voice).

For output from speechSynthesis.speak() to be set to a MediaStreamTrack would suggest re-writing the algorithm from scratch to configure the API to communicate directly with speechd.

guest271314 commented 5 years ago

Is this issue to specify for MediaStreamTrack as input to SpeechRecognition?

Pehrsons commented 5 years ago

I don't understand. getUserMedia cannot currently be used with SpeechRecognition, and MediaRecorder is not even remotely related.

This proposal is about adding a MediaStreamTrack argument to SpeechRecognition's start method.

Avoiding using an online service for recognition is completely unrelated to this, please use a separate issue for that.

guest271314 commented 5 years ago

Is the proposal that when the argument to SpeechRecognition is a MediaStreamTrack the current implementation which captures microphone would be overridden by the MediaStreamTrack input argument?

Or is the idea that capturing the microphone input would be replaced entirely by the MediaStreamTrack input?

MediaRecorder is related to the degree that AFAIK there is no implementation which performs the STT procedure in the browser, meaning even if a MediaStreamTrack is added as argument, that MediaStreamTrack would still currently need to be recorded internally to be processed - there is no browser implementation that processes the audio input in "real-time" and produces output using code shipped in the browser (perhaps besides certain handhelds). Therefore, if a MediaStreamTrack argument is added to SpeechSynthesis there might as well be the entire MediaRecorder implementation added as well, to control all aspects of input to be recorded.

AFAICT Mozilla does not implement SpeechRecognition, even when setting media.webspeech.recognition.enable to true.

There is no benefit in adding MediaStreamTrack to SpeechRecognition without the corresponding control over the entire recording, and ideally, without recording anything at all, rather processing STT in "real-time". It would be much more advantageous to focus on actually implementing SpeechRecognition in the browser without any external service than to spend time on incorporating MediaStreamTrack as argument to an API that does not work at Mozilla and relies on an external service for result. What is there to gain in adding MediaStreamTrack as an argument to SpeechRecognition when the underlying specification and implementation of SpeechRecognition needs work?

guest271314 commented 5 years ago

I don't understand. getUserMedia cannot currently be used with SpeechRecognition, and MediaRecorder is not even remotely related.

Actually MediaRecorder can currently be used with SpeechRecognition. Either a live user voice or SpeechSynthesis can be recorded then played back for input to SpeechRecognition (https://stackoverflow.com/a/46383699 ; https://stackoverflow.com/a/47113924).

guest271314 commented 5 years ago

Since this is obviously something you have thought about would only suggest when proceeding to not omit the opportunity to concurrently or consecutively add AudioBuffer, Float32Array, and .mp3, .ogg, .opus, .webm, .wav static file input as well.

guest271314 commented 5 years ago

You might also be interested in

Pehrsons commented 5 years ago

Is the proposal that when the argument to SpeechRecognition is a MediaStreamTrack the current implementation which captures microphone would be overridden by the MediaStreamTrack input argument?

Or is the idea that capturing the microphone input would be replaced entirely by the MediaStreamTrack input?

I read your question as "Should the MediaStreamTrack argument to start() be required or optional?"

Preferably required, since if it's optional we cannot get rid of any of the language that I claimed we can in the proposal.

MediaRecorder is related to the degree that AFAIK there is no implementation which performs the STT procedure in the browser, meaning even if a MediaStreamTrack is added as argument, that MediaStreamTrack would still currently need to be recorded internally to be processed - there is no browser implementation that processes the audio input in "real-time" and produces output using code shipped in the browser (perhaps besides certain handhelds). Therefore, if a MediaStreamTrack argument is added to SpeechSynthesis there might as well be the entire MediaRecorder implementation added as well, to control all aspects of input to be recorded.

Are you suggesting allowing to run SpeechRecognition on a buffer of data in non-realtime? That seems like an orthogonal proposal, please file your own issue if you want to argue for that.

Giving SpeechRecognition the controls of MediaRecorder because the implementation happens to encode and send audio data to a server doesn't make sense. The server surely only allows specific settings to container and codec. It also locks out any future implementations that do not rely on a server, because there'd be no reason to support MediaRecorder configurations for them, yet they have to.

AFAICT Mozilla does not implement SpeechRecognition, even when setting media.webspeech.recognition.enable to true.

This issue is about the spec, not Mozilla's implementation.

There is no benefit in adding MediaStreamTrack to SpeechRecognition without the corresponding control over the entire recording

See my first post in this issue again, there are lots of benefits.

, and ideally, without recording anything at all, rather processing STT in "real-time". It would be much more advantageous to focus on actually implementing SpeechRecognition in the browser without any external service than to spend time on incorporating MediaStreamTrack as argument to an API that does not work at Mozilla and relies on an external service for result. What is there to gain in adding MediaStreamTrack as an argument to SpeechRecognition when the underlying specification and implementation of SpeechRecognition needs work?

It's part of improving the spec, so you seem to have answered your own question.

guest271314 commented 5 years ago

@Pehrsons

Are you suggesting allowing to run SpeechRecognition on a buffer of data in non-realtime? That seems like an orthogonal proposal, please file your own issue if you want to argue for that.

That is what occurs now at Chromium, Chrome.

Reading the specification the term "real-time" does not appear at all in the document.

The term would need to be clearly defined anyway, as "real-time" can have different meanings in different domains or be interpreted differently by different individuals.

SpeechRecognition() should be capable of accepting either a live MediaStreamTrack or a static file or JavaScript object.

Downloaded https://github.com/guest271314/mozilla-central/tree/libdeep yesterday. Will try to build and test.

It should not be any issue setting MediaStreamTrack as a possible parameter to start(). It should also not be an issue allowing that parameter to be a Float32Array, ArrayBuffer.

Am particularly interested in how you intend to test what you are proposing to be specified?

guest271314 commented 5 years ago

@Pehrsons Re use of static files for STT for TTS espeak-ng currently has the functionality to process static files several ways, including

espeak-ng -x -m -f input.txt

-f Text file to speak -x Write phoneme mnemonics to stdout

Pehrsons commented 5 years ago

@Pehrsons

Are you suggesting allowing to run SpeechRecognition on a buffer of data in non-realtime? That seems like an orthogonal proposal, please file your own issue if you want to argue for that.

That is what occurs now at Chromium, Chrome.

Then they're not implementing the spec. Or are they, but they're using a buffer internally? Well, then you're conflating their implementation with the spec.

Reading the specification the term "real-time" does not appear at all in the document.

The term would need to be clearly defined anyway, as "real-time" can have different meanings in different domains or be interpreted differently by different individuals.

And it doesn't have to be if we use a MediaStreamTrack, since mediacapture-streams defines what we need.

SpeechRecognition() should be capable of accepting either a live MediaStreamTrack or a static file or JavaScript object.

Downloaded https://github.com/guest271314/mozilla-central/tree/libdeep yesterday. Will try to build and test.

It should not be any issue setting MediaStreamTrack as a possible parameter to start(). It should also not be an issue allowing that parameter to be a Float32Array, ArrayBuffer.

Again, file a separate issue if you think that is the right way to go.

Am particularly interested in how you intend to test what you are proposing to be specified?

Give the API some input that you control, and observe that it gives you the expected output.

guest271314 commented 5 years ago

Then they're not implementing the spec.

Are you referring to the following language?

https://w3c.github.io/speech-api/#speechreco-methods

When the speech input is streaming live through the input media stream

which does not necessarily exclusively mean a "real-time" input media stream.

Or are they, but they're using a buffer internally?

The last time checked the audio was compiled into a single buffer (cannot recollect at the moment if conversion to WAV file occurs at the browser or not, do recollect locating some code to convert audio to a WAV in the source code) then sent to an undisclosed remote web service at a single request.

Again, file a separate issue if you think that is the right way to go.

Filed.

Pehrsons commented 5 years ago

Then they're not implementing the spec.

Are you referring to the following language?

https://w3c.github.io/speech-api/#speechreco-methods

When the speech input is streaming live through the input media stream

Not necessarily. Yes, the spec is bad and hand-wavy so it's hard to find definitions for the terms used. But it's fairly easy to understand the spec authors intent. In this case it is that start() works a lot like getUserMedia({audio: true}) but with special UI bits to tell the user that speech recognition is in progress (note that this spec originates from a time when getUserMedia was not yet available, chrome now seems to use the same UI as for getUserMedia). Read chapter 3 "Security and privacy considerations and this becomes abundantly clear.

which does not necessarily exclusively mean a "real-time" input media stream.

I'm afraid it does. Otherwise, where do you pass in this non-realtime media stream? If you're suggesting the UA would provide the means through a user interface picker or the like; well, the spec repeatedly refers to "speech input", and by reading through all those references "speech input" can only be interpreted as speech input from the user, i.e., through a microphone which the user is speaking into, i.e., real-time.

Or are they, but they're using a buffer internally?

The last time checked the audio was compiled into a single buffer (cannot recollect at the moment if conversion to WAV file occurs at the browser or not, do recollect locating some code to convert audio to a WAV in the source code) then sent to an undisclosed remote web service at a single request.

Implementation detail and irrelevant to the spec.

Again, file a separate issue if you think that is the right way to go.

Filed.

guest271314 commented 5 years ago

I'm afraid it does. Otherwise, where do you pass in this non-realtime media stream? If you're suggesting the UA would provide the means through a user interface picker or the like; well, the spec repeatedly refers to "speech input", and by reading through all those references "speech input" can only be interpreted as speech input from the user, i.e., through a microphone which the user is speaking into, i.e., real-time.

Do not agree with that assessment. Have not noticed any new UI at Chromium. The specification as-is permits gleaning or massing various plausible interpretations from the language, not of which are necessarily conclusive and binding, at least not unambiguous - thus the current state of the art is to attempt to "incubate" the specification. A user can schedule an audio file (or reading of a Float32Array at AudioWorkletNode) to play in the future. That is not "real-time" "speech input" from the user.

guest271314 commented 5 years ago

Since there is current interest in amending and adding to the specification the question must be asked why would SpeechRecognition not be specified to accept both non-"real-time" (e.g., MediaStreamTrack) audio input from an audio file or buffer and a MediaStreamTrack?

Pehrsons commented 5 years ago

Since there is current interest in amending and adding to the specification the question must be asked why would SpeechRecognition not be specified to accept both non-"real-time" (e.g., MediaStreamTrack) audio input from an audio file or buffer and a MediaStreamTrack?

IMO because that would make the spec very complicated. Unnecessarily so, since there are other specs already allowing conversions between the two.

mbayou commented 6 months ago

Hi There!

I apologize for resurrecting this discussion almost five years later.

I was wondering if a conclusion has been reached regarding whether the start() method should take an input. I am trying to allow my users to select the microphone they want to use for recognition within our app. Currently, I am forcing them to change their default device, but it would be much easier if we could let them decide in-app.

Thank you in advance for your time!

evanbliu commented 4 months ago

Hello! Chrome is planning on adding MediaStreamTrack as an optional parameter to the start() method. Does anyone have any objections to this change? If not, I'll work on sending out a PR with the proposed changes.

WICG / speech-api

Support SpeechRecognition on an audio MediaStreamTrack #66