Open Pehrsons opened 5 years ago
@Pehrsons Not entirely certain what is being proposed here relevant to SpeechRecognition
? There is no special handling needed for speech to be recorded using getUserMedia()
. MediaRecorder
start()
, pause()
, resume()
and stop()
should suffice for audio input. Processing the input locally instead of sending the input to an undisclosed remote web service is what is needed.
Currently Chrome, Chromium records the user voice (without notice or permission) then sends that recording to a remote web service (https://github.com/w3c/speech-api/issues/56; https://webwewant.fyi/wants/55/). The response is a transcript (text) of the input; depending on the input words, heavily censored. It is unclear what happens to the users' input (potential biometric data; their voice).
For output from speechSynthesis.speak()
to be set to a MediaStreamTrack
would suggest re-writing the algorithm from scratch to configure the API to communicate directly with speechd
.
Related https://github.com/WebAudio/web-audio-api/issues/1764#issuecomment-536007392.
Is this issue to specify for MediaStreamTrack
as input to SpeechRecognition
?
I don't understand. getUserMedia cannot currently be used with SpeechRecognition, and MediaRecorder is not even remotely related.
This proposal is about adding a MediaStreamTrack argument to SpeechRecognition's start method.
Avoiding using an online service for recognition is completely unrelated to this, please use a separate issue for that.
Is the proposal that when the argument to SpeechRecognition
is a MediaStreamTrack
the current implementation which captures microphone would be overridden by the MediaStreamTrack
input argument?
Or is the idea that capturing the microphone input would be replaced entirely by the MediaStreamTrack
input?
MediaRecorder
is related to the degree that AFAIK there is no implementation which performs the STT procedure in the browser, meaning even if a MediaStreamTrack
is added as argument, that MediaStreamTrack
would still currently need to be recorded internally to be processed - there is no browser implementation that processes the audio input in "real-time" and produces output using code shipped in the browser (perhaps besides certain handhelds). Therefore, if a MediaStreamTrack
argument is added to SpeechSynthesis
there might as well be the entire MediaRecorder
implementation added as well, to control all aspects of input to be recorded.
AFAICT Mozilla does not implement SpeechRecognition
, even when setting media.webspeech.recognition.enable
to true
.
There is no benefit in adding MediaStreamTrack
to SpeechRecognition
without the corresponding control over the entire recording, and ideally, without recording anything at all, rather processing STT in "real-time". It would be much more advantageous to focus on actually implementing SpeechRecognition
in the browser without any external service than to spend time on incorporating MediaStreamTrack
as argument to an API that does not work at Mozilla and relies on an external service for result. What is there to gain in adding MediaStreamTrack
as an argument to SpeechRecognition
when the underlying specification and implementation of SpeechRecognition
needs work?
I don't understand. getUserMedia cannot currently be used with SpeechRecognition, and MediaRecorder is not even remotely related.
Actually MediaRecorder
can currently be used with SpeechRecognition
. Either a live user voice or SpeechSynthesis
can be recorded then played back for input to SpeechRecognition
(https://stackoverflow.com/a/46383699 ; https://stackoverflow.com/a/47113924).
Since this is obviously something you have thought about would only suggest when proceeding to not omit the opportunity to concurrently or consecutively add AudioBuffer
, Float32Array
, and .mp3
, .ogg
, .opus
, .webm
, .wav
static file input as well.
Is the proposal that when the argument to
SpeechRecognition
is aMediaStreamTrack
the current implementation which captures microphone would be overridden by theMediaStreamTrack
input argument?Or is the idea that capturing the microphone input would be replaced entirely by the
MediaStreamTrack
input?
I read your question as "Should the MediaStreamTrack argument to start()
be required or optional?"
Preferably required, since if it's optional we cannot get rid of any of the language that I claimed we can in the proposal.
MediaRecorder
is related to the degree that AFAIK there is no implementation which performs the STT procedure in the browser, meaning even if aMediaStreamTrack
is added as argument, thatMediaStreamTrack
would still currently need to be recorded internally to be processed - there is no browser implementation that processes the audio input in "real-time" and produces output using code shipped in the browser (perhaps besides certain handhelds). Therefore, if aMediaStreamTrack
argument is added toSpeechSynthesis
there might as well be the entireMediaRecorder
implementation added as well, to control all aspects of input to be recorded.
Are you suggesting allowing to run SpeechRecognition on a buffer of data in non-realtime? That seems like an orthogonal proposal, please file your own issue if you want to argue for that.
Giving SpeechRecognition the controls of MediaRecorder because the implementation happens to encode and send audio data to a server doesn't make sense. The server surely only allows specific settings to container and codec. It also locks out any future implementations that do not rely on a server, because there'd be no reason to support MediaRecorder configurations for them, yet they have to.
AFAICT Mozilla does not implement
SpeechRecognition
, even when settingmedia.webspeech.recognition.enable
totrue
.
This issue is about the spec, not Mozilla's implementation.
There is no benefit in adding
MediaStreamTrack
toSpeechRecognition
without the corresponding control over the entire recording
See my first post in this issue again, there are lots of benefits.
, and ideally, without recording anything at all, rather processing STT in "real-time". It would be much more advantageous to focus on actually implementing
SpeechRecognition
in the browser without any external service than to spend time on incorporatingMediaStreamTrack
as argument to an API that does not work at Mozilla and relies on an external service for result. What is there to gain in addingMediaStreamTrack
as an argument toSpeechRecognition
when the underlying specification and implementation ofSpeechRecognition
needs work?
It's part of improving the spec, so you seem to have answered your own question.
@Pehrsons
Are you suggesting allowing to run SpeechRecognition on a buffer of data in non-realtime? That seems like an orthogonal proposal, please file your own issue if you want to argue for that.
That is what occurs now at Chromium, Chrome.
Reading the specification the term "real-time" does not appear at all in the document.
The term would need to be clearly defined anyway, as "real-time" can have different meanings in different domains or be interpreted differently by different individuals.
SpeechRecognition()
should be capable of accepting either a live MediaStreamTrack
or a static file or JavaScript object.
Downloaded https://github.com/guest271314/mozilla-central/tree/libdeep yesterday. Will try to build and test.
It should not be any issue setting MediaStreamTrack
as a possible parameter to start()
. It should also not be an issue allowing that parameter to be a Float32Array
, ArrayBuffer
.
Am particularly interested in how you intend to test what you are proposing to be specified?
@Pehrsons Re use of static files for STT for TTS espeak-ng
currently has the functionality to process static files several ways, including
espeak-ng -x -m -f input.txt
-f
Text file to speak -x Write phoneme mnemonics to stdout
@Pehrsons
Are you suggesting allowing to run SpeechRecognition on a buffer of data in non-realtime? That seems like an orthogonal proposal, please file your own issue if you want to argue for that.
That is what occurs now at Chromium, Chrome.
Then they're not implementing the spec. Or are they, but they're using a buffer internally? Well, then you're conflating their implementation with the spec.
Reading the specification the term "real-time" does not appear at all in the document.
The term would need to be clearly defined anyway, as "real-time" can have different meanings in different domains or be interpreted differently by different individuals.
And it doesn't have to be if we use a MediaStreamTrack, since mediacapture-streams defines what we need.
SpeechRecognition()
should be capable of accepting either a liveMediaStreamTrack
or a static file or JavaScript object.Downloaded https://github.com/guest271314/mozilla-central/tree/libdeep yesterday. Will try to build and test.
It should not be any issue setting
MediaStreamTrack
as a possible parameter tostart()
. It should also not be an issue allowing that parameter to be aFloat32Array
,ArrayBuffer
.
Again, file a separate issue if you think that is the right way to go.
Am particularly interested in how you intend to test what you are proposing to be specified?
Give the API some input that you control, and observe that it gives you the expected output.
Then they're not implementing the spec.
Are you referring to the following language?
https://w3c.github.io/speech-api/#speechreco-methods
When the speech input is streaming live through the input media stream
which does not necessarily exclusively mean a "real-time" input media stream.
Or are they, but they're using a buffer internally?
The last time checked the audio was compiled into a single buffer (cannot recollect at the moment if conversion to WAV
file occurs at the browser or not, do recollect locating some code to convert audio to a WAV
in the source code) then sent to an undisclosed remote web service at a single request.
Again, file a separate issue if you think that is the right way to go.
Filed.
Then they're not implementing the spec.
Are you referring to the following language?
https://w3c.github.io/speech-api/#speechreco-methods
When the speech input is streaming live through the input media stream
Not necessarily. Yes, the spec is bad and hand-wavy so it's hard to find definitions for the terms used. But it's fairly easy to understand the spec authors intent. In this case it is that start()
works a lot like getUserMedia({audio: true})
but with special UI bits to tell the user that speech recognition is in progress (note that this spec originates from a time when getUserMedia was not yet available, chrome now seems to use the same UI as for getUserMedia). Read chapter 3 "Security and privacy considerations and this becomes abundantly clear.
which does not necessarily exclusively mean a "real-time" input media stream.
I'm afraid it does. Otherwise, where do you pass in this non-realtime media stream? If you're suggesting the UA would provide the means through a user interface picker or the like; well, the spec repeatedly refers to "speech input", and by reading through all those references "speech input" can only be interpreted as speech input from the user, i.e., through a microphone which the user is speaking into, i.e., real-time.
Or are they, but they're using a buffer internally?
The last time checked the audio was compiled into a single buffer (cannot recollect at the moment if conversion to
WAV
file occurs at the browser or not, do recollect locating some code to convert audio to aWAV
in the source code) then sent to an undisclosed remote web service at a single request.
Implementation detail and irrelevant to the spec.
Again, file a separate issue if you think that is the right way to go.
Filed.
I'm afraid it does. Otherwise, where do you pass in this non-realtime media stream? If you're suggesting the UA would provide the means through a user interface picker or the like; well, the spec repeatedly refers to "speech input", and by reading through all those references "speech input" can only be interpreted as speech input from the user, i.e., through a microphone which the user is speaking into, i.e., real-time.
Do not agree with that assessment. Have not noticed any new UI at Chromium. The specification as-is permits gleaning or massing various plausible interpretations from the language, not of which are necessarily conclusive and binding, at least not unambiguous - thus the current state of the art is to attempt to "incubate" the specification. A user can schedule an audio file (or reading of a Float32Array
at AudioWorkletNode
) to play in the future. That is not "real-time" "speech input" from the user.
Since there is current interest in amending and adding to the specification the question must be asked why would SpeechRecognition
not be specified to accept both non-"real-time" (e.g., MediaStreamTrack
) audio input from an audio file or buffer and a MediaStreamTrack
?
Since there is current interest in amending and adding to the specification the question must be asked why would
SpeechRecognition
not be specified to accept both non-"real-time" (e.g.,MediaStreamTrack
) audio input from an audio file or buffer and aMediaStreamTrack
?
IMO because that would make the spec very complicated. Unnecessarily so, since there are other specs already allowing conversions between the two.
Hi There!
I apologize for resurrecting this discussion almost five years later.
I was wondering if a conclusion has been reached regarding whether the start() method should take an input. I am trying to allow my users to select the microphone they want to use for recognition within our app. Currently, I am forcing them to change their default device, but it would be much easier if we could let them decide in-app.
Thank you in advance for your time!
Hello! Chrome is planning on adding MediaStreamTrack as an optional parameter to the start() method. Does anyone have any objections to this change? If not, I'll work on sending out a PR with the proposed changes.
There is an old issue in bugzilla but it doesn't discuss much.
We should revive this, to give the application control over the source of audio.
Not letting the application be in control has several issues:
Letting the application be in control has several advantages:
To support a MediaStreamTrack argument to
start()
, we need to:What to throw and what to fire I leave unsaid for now.