Aggregation of multiples audio I/O devices

hoch commented 5 years ago

The one of key advantages of ADC is a single callback function for input and ouptut. This is possible by combining input and output streams and serving them to user. As shown in the example, user can specify two different IDs for input and output respectively.

const constraints = {
  inputDeviceId: inputId,
  outputDeviceId: outputId
};

const client = await navigator.mediaDevices.getAudioDeviceClient(constraints);

It is common that two devices are physically separated. (i.e. different clocks, sample rate and threads) To serve these isolated streams, the system needs to re-clock/sample the audio data before sending them to a callback function. This is so-called "device aggregation" in ADC.

Problem 1. The scope of aggregation

The aggregation should only include 1 input and 1 output devices.
The aggregation should be free-for-all. (multi-input and multi-output)

For the option 2 (which is quite similar to MacOS's aggregate device), we can think of something like this:

const constraints = {
  inputDeviceId: [inputId0, inputId1, inputId2],
  outputDeviceId: [outputId0, outputId2],
};

Problem 2. The configurability of aggregation layer

The aggregation by the system will be involved with many parameters; resampling quality, options for reclocker, speed/quality trade off and etc. Should ADC expose these options at all? Or should we just say this is up to UA? Or should this be somewhere in the middle?

NOTE: @padenot mentioned in TPAC 2018 that FireFox uses this "re-clocking" mechanism to aggregate and align audio data from multiple devices.

hoch commented 5 years ago

In the teleconference today, we agreed that multi-input/output aggregation should be provided by OS. The group is in favor of the option 1 from Problem 1. The DJ app use case was brought up as a counterexample, but developers can use multiple outputs to separate audience outs and monitor outs.

For the configurability, the collective thought was to have some sorts of controls, but we have not agreed the degree or scope of it.

rtoy commented 5 years ago

Let me also add that we discussed exposing the resampler (if needed) so that the developer can trade-off quality vs latency. There was no decision to do anything about this, but something that we might want to think about.

Also want to give the rationale for doing option 1 from Problem 1:

People doing multi-input and multi-output devices are already sophisticated users.
Mac OSX already provides OS level ways to aggregate devices into one virtual device.
- But other OSes may not. We're expecting sophisticated users to be able to get the necessary software to aggregate devices.
It simplifies the API and the implementation.

Please correct me if I got these things wrong.

pmlt commented 5 years ago

There are several use cases in which an audio device client would be created with only a single input or output device, but not the other. For these use cases, having to pay for the additional latency of clock synchronization without reaping any benefits would be unfortunate.

Any sound-generating application that is sensitive to latency (like a game) will have this issue. These apps rarely need audio input, and if they do, they do not require clock synchronization. They would sooner create two separate ADC contexts, one for input and one for output, if doing so would bypass clock synchronization. Mixing input and generated audio would eventually be done via SharedArrayBuffers and Atomics instead, and only when needed (e.g. when the player enables voice chat in a multiplayer match).

Or perhaps what I'm describing is the 'raw mode' briefly mentioned in the code example? It not entirely clear to me what this feature does.

padenot commented 5 years ago

It sounds like you want to use two AudioDeviceClients, one for input, one for output.

padenot commented 4 years ago

One thing that is important and that is not being talked about here, is the fact that browser have to have another IPC boundary between the system audio input/output code and the "content" code, that runs scripts, etc., to be able to properly sandbox "content" code. This is in contrast to native programs that do the audio IO directly.

Aggregating input and output stream, re-clocking in the process that does the audio IO, and doing only a single IPC transaction to the content process is far superior than doing multiple context switches and buffering. Doing so allows using lower buffer sizes, not the opposite: more threads mean more real-time threads and more context switches, which increases scheduling hazard and scheduler pressure, and leads to needed bigger buffer size to have solid audio.

The high level nature of AudioContext and MediaStreams allows easily implementing this today: for example, round-trip latency in Firefox on OSX is limited by the the fact that the Web Audio API requires doing block processing with 128 frames buffers: we're currently sub-10ms round trip on OSX without special hardware, but the limit is arbitrary.

WebAudio / web-audio-cg

Aggregation of multiples audio I/O devices #4

Problem 1. The scope of aggregation

Problem 2. The configurability of aggregation layer