Planar-only considered harmful

As far as I could tell from the spec and the API, the design of the WebAudio API is such that it is always "planar" rather than interleaved, no matter what part of the pipeline is in play. While the efficiency of this design for a filter graph is a separate concern, the WebAudio design creates a more serious issue because it does not distinguish between the final output format and the graph processing format.

As WASM becomes more prevalent, more people will be writing their own audio subsystems. These subsystems will have to output to the browser at some point. At the moment, the only viable option is to use WebAudio. Because WebAudio only supports planar data formats, this means people's internal audio subsystems must output planar data.

This creates a substantial inefficiency. A large installed base of CPUs do not have planar scatter hardware. Since modern mixers must use SIMD to be fast, this means that scattering channel output 2-wide (stereo) or 8-wide (spatial) is extremely costly, as several instructions must be used to manually deinterleave and scatter each sample.

To add insult to injury, most hardware expects to receive interleaved sample data. This means that in many cases, after the WASM code has taken a large performance hit to deinterleave to planar, the browser will then turn around and take another large performance hit to reinterleave the samples, often (again) without gather hardware, meaning it will require several instructions to manually reinterleave.

I would like to recommend that serious consideration be given to supporting interleaved float as an output format. It could be a separate path just for direct audio output, and does not have to be part of the graph specification, if that reduces the cost of adding it to the specification. It could even be made as part of a WASM audio output specification only, with no JavaScript support, if necessary, since it would presumably only be relevant to people writing their own audio subsystems. I believe there is already consideration of WASM-specific use cases, as I have seen mentions of the need to avoid cloning memory from the WASM memory array into the JavaScript audio worklet, etc.

If I have misunderstood the intention here in some way, I would welcome explanations as to how to avoid the substantial performance penalties inherent in the current design.

- Casey

Just wanted to drop a quick note here that I wholeheartedly agree with Casey on this.

We have an Audio Worklet backend live in multiple products and this was one of the things that felt wrong when I was coding it.

All our internal processing (C++ compiled to WASM) uses interleaved data and we deinterleave when filling the buffers in the worklet processor.

As all native / low level interfaces I've used so far expect the data interleaved this seems very counterintuitive and bad for performance as Casey said.

So if we can't have interleaved everywhere due to some need by the graph processing system at least a configurable option to bypass this conversion for the simple case would be great.

The API fundamentally makes you deal with channels (input/output indices). https://developer.mozilla.org/en-US/docs/Web/API/AudioNode/connect

At a basic level, I agree that an interleaved input on a destination node, and interleaved output from media sources, would be a fantastic addition.

At the graph level, I've always wished that nodes' inputs and outputs were some kind of Signal object, rather than planar pcm. In particular, I wish the Signal object could carry planar data, fused spatial or quasi-spatial data such as ambisonic signals, spectral data, and that the signal rate was the natural signal rate for the data. Furthermore, I'd hope that there'd be some interface to test if signal-outs are compatible with signal-ins, and that in the case of incompatible signals, explicit adaptor nodes might be available, so that deplanarization, or spectralization, rate coversion, or up and down mixing, would never be heuristically applied, but explicitly supplied to support the dataflow.

The answer to most of the questions here is "for historical reasons". The Web Audio API was shipped by web browsers on the web without have been fully specified, and with not enough considerations for really advanced use-cases and high performance. The alternative proposal was direct PCM playback in interleaved format, but didn't get picked, this was about 10 years or so ago. The Web Audio API's native nodes will never work in interleaved mode, because it's fundamentally planar (as shown in previous messages here), but this doesn't mean this problem cannot be solved, so that folks with demanding workloads can use the Web.

Another API was briefly considered a few years back, but it didn't feel important enough to continue investigating, in light of the performance numbers gathered at the time. It was essentially just an audio device callback, but this is implementable today with just a single AudioWorkletNode.

That said, the first rule of any performance discussion is to gather performance dat(a) In particular, are we talking here about:

(a) Software that uses a hybrid of native audio nodes, and AudioWorkletProcessor (b) Multiple AudioWorkletProcessor, with the audio routed via regular AudioNode.connect calls (c) A single AudioWorkletProcessor, with custom audio rendering pipeline or graph in it

or some other setup, or maybe something hybrid ?

(a) and (b) suffer from lots of copies to/from the WASM heap, (c) doesn't, a single copy from the WASM heap will happen, at the end.

(a) and (b) also suffer from a minimum processing block size (referred to in the spec as a "render quantum") of (for now) 128 frames, but this will change in https://github.com/WebAudio/web-audio-api/issues/2450. (c) doesn't have this issue.

For (a) and b., the interleaving/deinterleaving operations can be folded into the copy to the WASM heap (with possible sample type conversion if e.g. the DSP works in int16 or fixed point). This lowers the real cost of the interleaving/deinterleaving operations (without eliminating it). (See links at the end for the elimination of this copy).

For (c), the interleaving / deinterleaving operations will happen exactly twice (as noted): going into the AudioContext.destination, and from the AudioContext.destination to the underlying OS API. My guess is that this is wasteful but negligible with any meaningful DSP load, at least with all workloads I've measured over the years.

Again, what is needed first and foremost is real performance numbers. Thankfully, if there is already running code implementing the approaches above ((a), (b) and (c), possibly others), it's not particular hard to get them, using https://blog.paul.cx/post/profiling-firefox-real-time-media-workloads/ and https://web.dev/profiling-web-audio-apps-in-chrome/. I assume the people in this discussion can skip most of the prose in both those articles, because they are familiar with real-time safe-code, and can skip to the part about getting/sharing the data.

Here we're mostly interested at the difference between the total time it took to render n frames audio (i.e. the AudioNode DSP methods, and the calls to each process() methods of each AudioWorkletProcessor instantiated in the graph), and the time it took for the callback to run, which would essentially be the routing overhead. It's going to be possible to have rather precise information about the per-AudioNode or per-AudioWorkletProcessor overhead, I'm happy to help if anybody is interested. Then there's going to be the overhead of the additional audio IPC that browsers have to implement because of sandboxing and other security concerns, that can be another copy depending on the architecture, but this is a fixed cost per audio device.

Some assorted links for context:

"Real" scatter/gather ops for WASM are being considered in https://github.com/WebAssembly/flexible-vectors/issues/12.
When going/coming to/from codecs, https://github.com/WICG/reducing-memory-copies/issues/1 will be of interest. Interleaved audio is already in the Web Codecs specification.
https://github.com/WebAudio/web-audio-api/issues/2442 and https://github.com/WebAudio/web-audio-api/issues/2427 are both about WASM integrations, to skip copies. In particular, the former is where the request for interleaved input/output would be made. A browser that implements this could rather easily skip the final interleaving step before sending the data to the OS audio stack: it would simply mix any other AudioNode connected to the AudioDestinationNode into the interleaved audio. These are considered as high-priority issues to fix by the Audio Working Group (see the P1 label).

At the moment, the only viable option is to use WebAudio.

Furthermore, I'd hope that there'd be some interface to test if signal-outs are compatible with signal-ins, and that in the case of incompatible signals, explicit adaptor nodes might be available, so that deplanarization, or spectralization, rate coversion, or up and down mixing, would never be heuristically applied, but explicitly supplied to support the dataflow.

You might be looking for MediaStreamTrack API for Insertable Streams of Media (also known as Breakout Box) https://github.com/alvestrand/mediacapture-transform.

a large performance hit to deinterleave to planar, the browser will then turn around and take another large performance hit to reinterleave the samples

I have not observed performance issues streaming raw 2 channel, S16LE PCM

https://wiki.multimedia.cx/index.php/PCM

Channels And Interleaving

If the PCM type is monaural, each sample will belong to that one channel. If there is more than one channel, the channels will almost always be interleaved: Left sample, right sample, left, right, etc.,

from parec converting to/from interleaved to channels using minimally modified versions of the JavaScript functions in https://stackoverflow.com/a/35248852

function floatTo16Bit(inputArray, startIndex){
    var output = new Uint16Array(inputArray.length-startIndex);
    for (var i = 0; i < inputArray.length; i++){
        var s = Math.max(-1, Math.min(1, inputArray[i]));
        output[i] = s < 0 ? s * 0x8000 : s * 0x7FFF;
    }
    return output;
}

// This is passed in an unsigned 16-bit integer array. It is converted to a 32-bit float array.
// The first startIndex items are skipped, and only 'length' number of items is converted.
function int16ToFloat32(inputArray, startIndex, length) {
    var output = new Float32Array(inputArray.length-startIndex);
    for (var i = startIndex; i < length; i++) {
        var int = inputArray[i];
        // If the high bit is on, then it is a negative number, and actually counts backwards.
        var float = (int >= 0x8000) ? -(0x10000 - int) / 0x8000 : int / 0x7FFF;
        output[i] = float;
    }
    return output;
}

MediaStreamTrackGenerator, MediaStreamTrackProcessor (solely to piggy-back on "silence" to get timestamp), MediaStreamAudioDestinationNode, WebCodecs AudioData.

https://github.com/guest271314/captureSystemAudio/blob/e7454dd04750633a33868f63d50a2552224b7226/native_messaging/capture_system_audio/audioStream.js#L132-L232

 async captureSystemAudio() { 
   this.recorder.start(1); 
   let channelData = []; 
   try { 
     await Promise.allSettled([ 
       this.stdout 
         .pipeTo( 
           new WritableStream({ 
             write: async (value, c) => { 
               let i = 0; 
               for (; i < value.buffer.byteLength; i++, this.readOffset++) { 
                 if (channelData.length === 441 * 4) { 
                   this.inputController.enqueue([...channelData]); 
                   channelData.length = 0; 
                 } 
                 channelData.push(value[i]); 
               } 
             }, 
             ...
           }) 
         ) 
         .catch(console.warn), 
       this.audioReadable 
         .pipeTo( 
           new WritableStream({ 
             abort(e) { 
               console.error(e.message); 
             }, 
             write: async ({ timestamp }) => { 
               const uint8 = new Int8Array(441 * 4); 
               const { value, done } = await this.inputReader.read(); 
               if (!done) uint8.set(new Int8Array(value)); 
               const uint16 = new Uint16Array(uint8.buffer); 
               // https://stackoverflow.com/a/35248852 
               const channels = [new Float32Array(441), new Float32Array(441)]; 
               for (let i = 0, j = 0, n = 1; i < uint16.length; i++) { 
                 const int = uint16[i]; 
                 // If the high bit is on, then it is a negative number, and actually counts backwards. 
                 const float = 
                   int >= 0x8000 ? -(0x10000 - int) / 0x8000 : int / 0x7fff; 
                 // interleave 
                 channels[(n = ++n % 2)][!n ? j++ : j - 1] = float; 
               } 
               const data = new Float32Array(882); 
               data.set(channels.shift(), 0); 
               data.set(channels.shift(), 441); 
               const frame = new AudioData({ 
                 timestamp, 
                 data, 
                 sampleRate: 44100, 
                 format: 'f32-planar', 
                 numberOfChannels: 2, 
                 numberOfFrames: 441, 
               }); 
               this.duration += frame.duration; 
               await this.audioWriter.write(frame); 
             }, 
             close: () => { 
               console.log('Done reading input stream.'); 
             }, 
           }) 
         ) 
         .catch(console.warn), 
       this.ac.resume(), 
     ]); 
     return this.promise; 
   } catch (err) { 
     console.error(err); 
   } 
 }

I'm currently working on something like this for remote browser isolation audio streaming and having to resort to using a mono stream from parec because the format is planar (instead of interleaved) on the client, and I don't know a performant way to get the planar format AudioContext expects.

I think the use case of real-time processing / playing-from-stream of audio is pretty important.

Is there a way to do this with stereo?

@crisdosyago

I'm currently working on something like this for remote browser isolation audio streaming and having to resort to using a mono stream from parec

Note, I am also getting 2 channel audio from parec and processing/recording that real-time audio; interleaved to planar.

The algorithm tht I use to convert interleaved PCM to planar - which does not impact performance in the least when run on the main thread

Create an array to store raw PCM as Uint8Array data, typically the data from a fetch() request, e.g., let channelData = [];;

Fill the array from 1. to 441 * 4, e.g.,

for (; i < value.buffer.byteLength; i++, this.readOffset++) { 
if (channelData.length === 441 * 4) { 
this.inputController.enqueue([...channelData]); 
 channelData.length = 0; 
} 
channelData.push(value[i]); 
}

Process the data in the array from 2. Set that data in a new Uint8Array or Int8Array; alternatively use subarray() method of original Uint8Array if that is how you are getting the data, by creating an array containing two (2) Float32Arrays reflecting the channels that will be written to in the modified function from https://stackoverflow.com/a/35248852, passing the buffer from the Uint8Array to new Int16Array then iterating the values therein to convert to floats

const uint8 = new Int8Array(441 * 4); 
const { value, done } = await this.inputReader.read(); 
if (!done) uint8.set(new Int8Array(value)); 
const uint16 = new Uint16Array(uint8.buffer); 
// https://stackoverflow.com/a/35248852 
const channels = [new Float32Array(441), new Float32Array(441)]; 
for (let i = 0, j = 0, n = 1; i < uint16.length; i++) { 
const int = uint16[i]; 
// If the high bit is on, then it is a negative number, and actually counts backwards. 
const float = int >= 0x8000 ? -(0x10000 - int) / 0x8000 : int / 0x7fff; 
// interleave 
channels[(n = ++n % 2)][!n ? j++ : j - 1] = float; 
}

The interleaved to planar PCM is now stored in each element of channels array.

Thank you, @guest271314! This looks awesome. I might try to use your code at some point 🙂

WebAudio / web-audio-api

Planar-only considered harmful #2458