Make AudioBuffer Transferable

chcunningham commented 3 years ago

Describe the feature Follow up stemming from WebAudio/web-audio-api-v2#111. Once AudioBuffer is exposed to DedicatedWorker, we'll want to transfer AudioBuffers created elsewhere into the DedicatedWorker.

Is there a prototype? Chromium would like to prototype ASAP and ship this alongside the rest of the WebCodecs API.

Describe the feature in more detail The pressing use case is using WebCodecs to encode audio in a worker that originated from the user's microphone (getUserMedia). The audio will be sent to the worker by transferring the MediaStreamTrackProcessor readable ReadableStream. The individual AudioFrames in the stream will themselves be transferred, which would trigger transfer of their nested AudioBuffers.

Transferring is a move operation, so we must consider what happens to the object that is left behind. I propose that we follow the model of ArrayBuffer.

transferring an ArrayBuffer detaches it
detaching sets length 0, clears underlying data, marks it "detached"
you can no longer use it for things like making a Uin8ArrayBuffer (throws TypeError)

Doing the same for AudioBuffer, would probably entail

add a "detached" slot to AudioBuffer.
update APIs where AudioBuffer is an input to throw TypeError if given a detached AudioBuffer. At a glance, those are:
- AudioBufferSourceNode constructor
- AudioBufferSourceNode.buffer setter steps
- ConvolverNode constructor
- ConvolverNode setter steps
- and the WebCodecs AudioFrame constructor
- update AudioBuffer the 3 methods to throw InvalidStateError if called when detached is true
add transfer steps for AudioBuffer
- assign true to [[detached]]
- assign 0 to [[number of channels]]
- assign 0 to [[length]]
- assign 0 to [[sample rate]]
- assign null to [[internal data]]

chcunningham commented 3 years ago

I'll take a stab at a PR for this shortly.

padenot commented 3 years ago

The AudioBuffer is slightly more complex in that it has a way to skip copies and allocations in the majority of scenario, and this has implications in terms of where the memory is and what owns it.

Say you're setting the same AudioBuffer to two distinct AudioBufferSourceNode, and start() those AudioNodes. This doesn't copy. You can also set the same buffer to a convolver, etc. The copy only happens if one calls getChannelData(n) to actually see the audio frames and the buffer has been sent to the rendering thread. https://webaudio.github.io/web-audio-api/#dom-audiobuffer-getchanneldata has some info and background.

Here, transferring the AudioBuffer will work, but it will first allocate storage, copy to this new storage, and then transfer, because the memory is being used by the audio rendering thread.

padenot commented 3 years ago

I'll also note that in addition or instead of doing this, we can also allow the creation of AudioBuffer from already-allocated storage (but not from SharedArrayBuffer, only regular ArrayBuffer).

This would allow transferring the memory owned by the AudioBuffer (which is the expensive bit), and then communication of the rate and channel count could be made "manually". I believe this would also be useful for other scenarios.

chcunningham commented 3 years ago

Here, transferring the AudioBuffer will work, but it will first allocate storage, copy to this new storage, and then transfer, because the memory is being used by the audio rendering thread.

Concept SGTM. I'm having a trouble connecting the dots from AudioBuffer's "aquire the content" to how we should implement the transfer steps. Say we've sent an AudioBuffer into an AudioBufferSourceNode and its data is now being sent to the rendering thread. Is there some state set on AudioBuffer when this occurs such that getChannelData() will now always copy? For now I'll assume we have some state, [[must copy]] = true/false.

Related: if the [[internal data]] is being used on the rendering thread as in that example, does the spec indicate that the rendering thread takes a strong reference such that it would be safe for us to detach the [[internal data]] from the AudioBuffer? For now I'll assume yes, its always safe to detach.

Given my assumptions above, here's how I imagine the transfer steps (loosely modeled on those for ImageBitmap)

Their transfer steps, given value and dataHolder, are:
1. If [[must copy]] is true, assign a copy of value's [[internal data]] to dataHolder.[[internal data]].
2. Otherwise, assign a reference of value's [[internal data]] to dataHolder.[[internal data]]
3. Release value's reference to [[internal data]]
4. Assign true value's to [[detached]]
5. Assign 0 to value's [[number of channels]]
6. Assign 0 to value's [[length]]
7. Assign 0 to value's [[sample rate]]

guest271314 commented 3 years ago

The pressing use case is using WebCodecs to encode audio in a worker that originated from the user's microphone (getUserMedia)

Technically that can already be done by passing MediaStream from getUserMedia() to MediaStreamAudioSourceNode, connecting that node to an AudioWorklet node, then using Transferable Streams in the AudioWorklet to transfer the Float32Arrays from inputs to main thread, or any other thread, minimal, complete, working example https://github.com/microphone-stream/microphone-stream/pull/54/commits/8660971284cdcc950c48a5e12c1ba4d3e4db1567.

guest271314 commented 3 years ago

We can already stream from main thread to Worker to other threads without using AudioBuffer at all, e.g., https://github.com/guest271314/AudioWorkletStream/blob/master/worker.js

let port;
onmessage = async e => {
  'use strict';
  if (!port) {
    [port] = e.ports;
    port.onmessage = event => postMessage(event.data);
  }
  const { urls } = e.data;
  // https://github.com/whatwg/streams/blob/master/transferable-streams-explainer.md
  const { readable, writable } = new TransformStream();
  (async _ => {
    for await (const _ of (async function* stream() {
      while (urls.length) {
        yield (await fetch(urls.shift(), {cache: 'no-store'})).body.pipeTo(writable, {
          preventClose: !!urls.length,
        });
      }
    })());
  })();
  port.postMessage(
    {
      readable,
    },
    [readable]
  );
};

where since GitHub restricts file size I sliced a single WAV file into several parts, request the files, transfer to AudioWorklet, process, output to headphones or speakers, or store or stream data.

One issue is this appears to be omitting the fact that timestamp is not defined at all in the WebCodecs specification, so while AudioBuffer could be specified as transferable, that does nothing for the user who is already transfering raw PCM (from microphone if required) yet now has to attempt to divine how to generate timestamp for the AudioFrames, which is not indicated how to do at the specification or implementation level.

At https://wc-audio-gen.glitch.me/ this is used

    let base_time = outputCtx.currentTime + 0.3;
    let buffers = splitBuffer(music_buffer, sampleRate / 2);
    for (let buffer of buffers) {
      let frame = new AudioFrame({
        timestamp: base_time * 1000000,
        buffer: buffer
      });  
      base_time += buffer.duration;
      encoder.encode(frame);
    }

however, we do not know what the algorithm is actually trying to produce, because no algorithm exists; and using that pattern or variations thereof for creation of user-defined AudioFrames that are not generated by MediaStreamTrackProcessor.readable.read() can result in varying playback rate at output for live-streams mid-stream, and MediaStreamTrackGeneratpr not being capable of producing quaility and consistent output. For example when I do this experiment

          let bt = ac.currentTime;
          //... 
          const frame = new AudioFrame({ timestamp: (bt  + ac.baseLatency) * 10**6, buffer });
          bt += buffer.duration;

at https://github.com/guest271314/webtransport/blob/main/webTransportBreakoutBox.js so that I can omit creating MediaStreamAudioDestinationNode and OscillatorNode solely to get an implementation-produced timestamp in an AudioFrame at read() the output has variable playback rate for a live-stream.

Again, I would suggest either defining timestamp in WebCodecs specification - with accompanying method to produce said timestamp or whatever name the attrbute will be setlled on re "microseconds", or to simply remove the timestamp altogether from AudioFrame, which will render AudioFrame useless altogether, then we just have a sinlge AudioBuffer to work with across API's.

guest271314 commented 3 years ago

Another option is adding timestamp https://github.com/WICG/web-codecs/issues/156 to AudioBuffer (and removing AudioFrame from WebCodecs, similar to how MediaStreamTracks described in other specifications refer back to MediaStreamTrack from Media Capture main), which too, renders AudioFrame useless, as AudioFrame is (currently) just from the user perspective an AudioBuffer with a timestamp attribute.

In either case timestamp needs to be demystified, clearly defined, capable of being consistently generated by the user without the need to create additional audio nodes solely to get an internally created implementation timestamp from MediaStreamTrackProcessor.readable.read().

chcunningham commented 3 years ago

In either case timestamp needs to be demystified, clearly defined, capable of being consistently generated by the user without the need to create additional audio nodes solely to get an internally created implementation timestamp from MediaStreamTrackProcessor.readable.read().

I think you can create the timestamp simply by deciding some starting point (e.g. 0) for the first packet, and then setting the next packets' timestamps using the duration (established by AudioBuffer length and sampleRate) delta from the first packet.

Please file a separate issue if you'd like to discuss this further. Lets keep this issue focused on Transferability.

guest271314 commented 3 years ago

I do not find transferability of AudioBuffer problematic, just transfer the Float32Array, or Int8Array or Int16Array representation or write data to WebAssembly.Memory, or use Transferable Streams. Using TypedArray's are considerably faster than constructing and accessing underlying data with getChannelData() https://github.com/WebAudio/web-audio-api-v2/issues/118#issuecomment-808970057. WICG and W3C banned me, thus I am restricted from addressing this concern at WebCodecs repository. I experiment with WebAudio to a modest extent. This appears to be the cart before the horse. AudioBuffer is useless outside of underlying Float32Array and timestamp is the real concern.

I think you can create the timestamp simply by deciding some starting point (e.g. 0) for the first packet, and then setting the next packets' timestamps using the duration (established by AudioBuffer length and sampleRate) delta from the first packet.

That does not work in practice. The clicks between creation of AudioBufferSourceNode beginning and ending are audible, when the tab does not crash, and eventually the drift due to inexactness increases the frequency of audible clicks between start and stop of audio nodes. The AudioFrame from output at decode() can be passed to a write() from MediaStreamTrackGenerator, however, because the AudioBuffer length is always greater than 2000 and an AudioBuffer from MediaStreamTrackProcessor.readable.read() is 220 to 400, and AudioWorklet expects Float32Arrays in 128 length, while an AudioBuffer in an AudioFrame at output of AudioDecoder can have length 2568 (2568/128 = 20.0625, which means we will need to store the overflow to avoid writing 0s, and try to avoid fragmentation of ArrayBuffers); the API's are incompatible.

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8" />
    <meta http-equiv="X-UA-Compatible" content="IE=edge" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <title>MediaStreamTrackGenerator Workaround</title>
  </head>
  <body>
    <script>
      (async () => {
        const ac = new AudioContext();
        const msd = new MediaStreamAudioDestinationNode(ac, {
          channelCount: 1,
          channelCountMode: 'explicit',
          channelInterpretation: 'discrete',
        });
        const osc = new OscillatorNode(ac, {
          channelCount: 1,
          channelCountMode: 'explicit',
          channelInterpretation: 'discrete',
        });
        osc.connect(msd);
        osc.start(ac.currentTime);
        const track = msd.stream.getTracks()[0];
        const settings = track.getSettings();
        const processor = new MediaStreamTrackProcessor(track);
        const reader = processor.readable.getReader();
        const el = document.createElement("audio");
        document.body.appendChild(el);

        let firstFrame;
        const decoder = new AudioDecoder({
          error() {},
          async output(frame) {
            if (!firstFrame) {
              firstFrame = true;
              console.log(frame.buffer, frame.buffer.length / 128, frame.buffer.length / 10);
            }
            const source = ac.createBufferSource();
            source.buffer = frame.buffer;
            source.connect(ac.destination);
            source.start(frame.timestamp / 1000000); 
            frame.close();
          }
        });

        const encoder = new AudioEncoder({
          error() {},
          output(chunk, metadata) {
            if (metadata.decoderConfig) {
              decoder.configure(metadata.decoderConfig);
            }
            decoder.decode(chunk);
          }
        });

        const config = {
          numberOfChannels: 1,
          sampleRate: settings.sampleRate,
          codec: "opus",
          bitrate: 48000
        };

        encoder.configure(config);

        let lastTimestamp;
        let baseTimestamp = ac.currentTime + 0.3;

        while (true) {
          const { value } = await reader.read();

          if (!baseTimestamp) {
            baseTimestamp = value.timestamp;
          }

          encoder.encode(
            new AudioFrame({
              timestamp: baseTimestamp * 10**6,
              buffer: value.buffer
            })
          );
          baseTimestamp += value.buffer.duration;
        };

      })();
    </script>
  </body>
</html>

I would place priority on making sure the fundamental work, and compaibility with the API's in the domain, not merely "I think" as a suggestion when no documentation exists to support that claim in the actual specifcation, before focusing on transferability of a broken API with regard to WebCodecs AudioEncoder and AudioDecoder.

chcunningham commented 3 years ago

@rtoy helped me to better understand AudioBuffer's "aquire the content", and this lead to an epiphany: we should make AudioFrame "aquire the content" of it's member AudioBuffer. This is important because we want all Frame and Chunk types in WebCodecs to be immutable to avoid toctou security bugs when encoding/decoding. VideoFrame is already immutable and the Chunk types will be soon. AudioBuffer is very much mutable, but we can use "acquire the content" upon construction of an AudioFrame to make it immutable from the POV of WebCodecs.

It is a small wart that the getChannelData() and copyToChannel() methods will still cause it to appear as mutable, but I can accept that (same subtlety already exists in other uses of "acquire the content"). We can add console warnings if folks use these methods on an AudioBuffer who's content has been acquired by an AudioFrame.

With this in mind, I now strongly favor @padenot's second proposal:

I'll also note that in addition or instead of doing this, we can also allow the creation of AudioBuffer from already-allocated storage (but not from SharedArrayBuffer, only regular ArrayBuffer).

My idea being: when transferring and AudioFrame, we would transfer the "acquired content" and use this to create a new AudioBuffer at the destination. I don't know that this even requires a spec change from WebAudio. For example, AudioBuffer is created in the decodeAudioData() steps as follows:

Let buffer be an AudioBuffer containing the final result (after possibly performing sample-rate conversion).

So perhaps we can write something similar in AudioFrame transfer steps, substituting "final result ..." with ~ "transferred acquired data"....

guest271314 commented 3 years ago

What you really only concerned about transferring here are Float32Array buffer(s). You can assign the numberOfChannels, sampleRate, and length to the AudioFrame after the transfer of the buffer.

"final result ..." with ~ "transferred underlying buffer from Float32Array(s) representing the channel data, copying numberOfChannels, sampleRate, length from original AudioBuffer, set Float32Array(s) length in original AudioBuffer to 0"....

would be a complete description of what is intended to occur; "data" is a generic term that we do not have to repeat from Web Audio API wording.

The problem you face, again, is where does the timestamp get generated from in that algorithm; ostensibly in some way from the AudioBuffer, or when the AudioBuffer is tranferred, unless that algorithm is simply omitted from the documentation, deliberately?

padenot commented 3 years ago

This is not priority-1 anymore, because Web Codecs doesn't need it as much (cc @chcunningham).

hoch commented 1 year ago

TPAC 2022 action items:

Make AudioBuffer Transferable.
Expose AudioBuffer to WorkerGlobalScope and WorkletGlobalScope.

hoch commented 10 months ago

2023 TPAC Audio WG Discussion:

The WG will not pursue this since the need from WebCodecs side has been resolved.

WebAudio / web-audio-api

Make AudioBuffer Transferable #2390