[audio category] Interactions with content-hint spec

gregwhitworth commented 4 years ago

Filing this issue on behalf of HTA from Google from a side thread to have the discussion here:

Please check out the existing extension "contentHint". https://w3c.github.io/mst-content-hint/ It seems to be in the same area of application, and is already deployed in Chrome (with effects appliedfor the video case).

On reading your proposal, it seems that the additional feature you want in addition to what content-hint provides is the ability to select sources based on the kind of output they have. I'm not sure whether that makes sense; it fits with the way constraints are used to control echo cancellation and noise suppression, but that's perhaps not an important consideration.

Category selection note: "speech" for human understanding and "speech" for machine recognition may well be two different categories - it's likely that speech recognition will want to do bandpass filtering in the algorithm rather than having the audio processor do it for them, for instance. My intuition is that "unprocessed" is what most speech recognition folks want. (They also dont care much about delay, which is very important in interactive audio. But that's an orthogonal parameter.)

alvestrand commented 4 years ago

Should probably mention that this is about the "Audio Stream Category" explaner document (I'm not sure how [] tags work here).

There was a long discussion about whether content-hint should be a constraint or an attribute when it was initially defined; the people who argued for an attribute won.

keendres commented 4 years ago

Perhaps ‘speech’ isnt the best label. This is a debatable but perhaps that is part of the point. As far as I know, Google is the only world class human speech recognition platform that sends raw audio to the cloud (I know of two that do not) – but there may be others that do) to preserve all signal.

From a high level – here are the two buckets for processing

• Human ear targets – MOS scoring favors certain types of noise to others, complete removal of signal is preferred to certain types of artifacting which disturb the listener • Speech recognition system targeting – target is to reduce as much noise as possible while preserving signal. Eco canceler should leave in residual echo leakage to prevent signal erosion (for example) o Additional target is the reduce the amount of audio data sent over the wire.

On web platforms today, there is no way to differentiate between these buckets (both have components for NS and AEC). Asserting that there is no other bucket seems a bit dismissive – we are providing a way to differentiate between them and the underlying platform is not required to provide a unique implementation if none is deemed warranted.

For Windows there are unique implementations and there are applications wanting to differentiate.

What is the recommendation?

armax00 commented 4 years ago

I agree with hta@ but I would like to add a consideration.

While it might fit with the way echoCancellation and noiseSuppression are used, I feel this might cause confusion. For example, suppose a user wants a "speech" type of stream but wants to disable echoCancellation. In this case, assuming speech is for human understanding, disabling echo cancellation results in the session not being of "speech" type any longer. I have the impression this might result in a slippery slope for what concerns interaction between different constraints.

sjdallst commented 4 years ago

https://github.com/w3c/mediacapture-main/pull/664

keendres commented 4 years ago

To come back here to touch base. I'm a bit concerned that there is a lack of understanding around the needs of a speech recognition system vs. the needs of the human ear. Whatever the category is named (speech was proposed simply because it is consistent with windows) - its about targeting a recognition system (one that recognizes human speech). The human ear prefers to listen to signal where information has been discarded if it were 'unpleasant' where a recognition system does not want any signal discarded at all. Some recognition systems prefer to raw input to avoid loss of signal - but the remainder prefer to remove as much of the noise (like echo) and isolate the talker of interest (beamforming - shared with other technics) as much as possible. Some noise suppression (CTR for instance) can be destructive to recognition systems.

The result: echocancellation (very likely) and noisesuppression (possible) may no longer be sufficient to describe the effects pipeline and a qualifier is required.

The question here: what should be the mechanism to describe the specific pipeline that is needed. Windows as a platform supports the various qualifiers, other OSs/platforms may not - but on those OSs/platforms the qualifiers can collapse as there is only the single AEC and single NS available.

aboba commented 4 years ago

A bit of history.

In his review of the content-hints specification, Jan-Ivar Bruaroey of Mozilla objected to the specification as it then existed, and laid out several requirements for content-hints to behave similarly across browsers and operating systems:

Clarity: the actions taken as a result of the hint need to be defined so that they can be implemented in an interoperable way between browsers running on any operating system. This includes clarity with respect to the interaction with existing constraints.
Testability. It must be possible to test that the hint is having the desired effect. Writing a WPT test that sets and gets the hint is not sufficient.
Mutability. It must be clear whether the hint can be modified once set, and if so, it must be clear how the browser responds to the modification.

Harald has since updated the content-hints specification to address these issues.

Given this history, PR 664. was labelled "Submitter Action Needed", with the next steps being modification to address Issues 1-3 above, and presentation to the WEBRTC WG (next meeting on March 30).

keendres commented 4 years ago

thanks and makes sense. Looking forward to next steps :)

MicrosoftEdge / MSEdgeExplainers

[audio category] Interactions with content-hint spec #185