Open gregwhitworth opened 4 years ago
Should probably mention that this is about the "Audio Stream Category" explaner document (I'm not sure how [] tags work here).
There was a long discussion about whether content-hint should be a constraint or an attribute when it was initially defined; the people who argued for an attribute won.
Perhaps ‘speech’ isnt the best label. This is a debatable but perhaps that is part of the point. As far as I know, Google is the only world class human speech recognition platform that sends raw audio to the cloud (I know of two that do not) – but there may be others that do) to preserve all signal.
From a high level – here are the two buckets for processing
• Human ear targets – MOS scoring favors certain types of noise to others, complete removal of signal is preferred to certain types of artifacting which disturb the listener • Speech recognition system targeting – target is to reduce as much noise as possible while preserving signal. Eco canceler should leave in residual echo leakage to prevent signal erosion (for example) o Additional target is the reduce the amount of audio data sent over the wire.
On web platforms today, there is no way to differentiate between these buckets (both have components for NS and AEC). Asserting that there is no other bucket seems a bit dismissive – we are providing a way to differentiate between them and the underlying platform is not required to provide a unique implementation if none is deemed warranted.
For Windows there are unique implementations and there are applications wanting to differentiate.
What is the recommendation?
I agree with hta@ but I would like to add a consideration.
While it might fit with the way echoCancellation and noiseSuppression are used, I feel this might cause confusion. For example, suppose a user wants a "speech" type of stream but wants to disable echoCancellation. In this case, assuming speech is for human understanding, disabling echo cancellation results in the session not being of "speech" type any longer. I have the impression this might result in a slippery slope for what concerns interaction between different constraints.
To come back here to touch base. I'm a bit concerned that there is a lack of understanding around the needs of a speech recognition system vs. the needs of the human ear. Whatever the category is named (speech was proposed simply because it is consistent with windows) - its about targeting a recognition system (one that recognizes human speech). The human ear prefers to listen to signal where information has been discarded if it were 'unpleasant' where a recognition system does not want any signal discarded at all. Some recognition systems prefer to raw input to avoid loss of signal - but the remainder prefer to remove as much of the noise (like echo) and isolate the talker of interest (beamforming - shared with other technics) as much as possible. Some noise suppression (CTR for instance) can be destructive to recognition systems.
The result: echocancellation (very likely) and noisesuppression (possible) may no longer be sufficient to describe the effects pipeline and a qualifier is required.
The question here: what should be the mechanism to describe the specific pipeline that is needed. Windows as a platform supports the various qualifiers, other OSs/platforms may not - but on those OSs/platforms the qualifiers can collapse as there is only the single AEC and single NS available.
A bit of history.
In his review of the content-hints specification, Jan-Ivar Bruaroey of Mozilla objected to the specification as it then existed, and laid out several requirements for content-hints to behave similarly across browsers and operating systems:
Harald has since updated the content-hints specification to address these issues.
Given this history, PR 664. was labelled "Submitter Action Needed", with the next steps being modification to address Issues 1-3 above, and presentation to the WEBRTC WG (next meeting on March 30).
thanks and makes sense. Looking forward to next steps :)
Filing this issue on behalf of HTA from Google from a side thread to have the discussion here: