WebAudio / web-audio-api

The Web Audio API v1.0, developed by the W3C Audio WG
https://webaudio.github.io/web-audio-api/
Other
1.04k stars 166 forks source link

When exactly does stop(time) stop? #2452

Open rtoy opened 5 years ago

rtoy commented 5 years ago

Consider you have an offline context with sample rate F and with an ABSN scheduled to stop at time t. Where exactly does the ABSN stop?

If F*t lies between sample frames, it's pretty clear that the last non-zero output happens at the frame just less than F*t, i.e, floor(F*t). But what happens if F*t is exactly on a frame boundary? The spec doesn't really say, but I think the output value at that frame must be 0.

Doing it this way makes stop consistent with ABSN duration, I think. If the ABSN has a duration d, the number of frames is F*d, which is a whole number of frames, so the output is zero at frame F*d. If we did stop(d) as defined above, then this would be exactly equivalent to letting the ABSN run without calling stop.

karlt commented 5 years ago

When start, stop, and duration are whole numbers of frames, my expectation is similar about stop being consistent with duration.

Re the boundaries for non-aligned times, I've aimed to avoid the situation where a frame count converted to double would be at risk of being interpreted differently according to whether rounding due to loss of precision in double is up or down.

In the same way that an ABSN output sample at currentTime = 0 describes the intensity of a band-limited impulse response centred at currentTime = 0, the zeroth sample frame from an AudioBuffer represents a band-limited impulse response at the very start of the buffer.

My expectation is that we should aim to centre the band-limited impulse for a sample from a single-frame buffer (or the zeroth frame from any buffer) on the start time.

This is simple enough when start time is frame aligned. The band-limited impulse from the sample is centred on the start time by playing the sample at the output frame corresponding to the start time.

When the start time is not frame aligned, the sample from the buffer can be interpolated. Choosing a sample from the buffer to play in an ABSN output frame (representing a slightly different time) is a zero-order interpolation. Rounding start time to the nearest output frame provides a better zero-order interpolation than floor(F*t).

"A starting offset, which can expressed with sub-sample precision" implies that better interpolation methods are preferred but the algorithm chooses a very basic interpolation of start and stop times (and duration). https://webaudio.github.io/web-audio-api/#playback-AudioBufferSourceNode

For consistency with start time, an aligned stop time would describe the centre of an impulse that is not played. Unaligned stop times can be interpolated consistently with the start time. If one ABSN is starting at the same time as another is stopping, then the expectation is that the second could take over seamlessly from the first.

rtoy commented 5 years ago

Sorry, I'm thoroughly confused. How does a band-limited impulse come into play here?

For the start time, I think rounding the time to the nearest frame is wrong. It should "start" exactly where I say so that the frame just before the start time must be 0 and the frame after the start time is not. We implemented this approach for Chrome's AudioParam's and it fixed a huge number of issues. Previously AudioParams would round to the nearest frame, but it's much easier to reason about if params started exactly where the time said. See also WebAudio/web-audio-api#915

karlt commented 5 years ago

Consider

let context = new AudioContext();
let buffer = new AudioBuffer({length: 1, sampleRate: context.sampleRate});
buffer.getChannelData(0)[0] = 1.0;
let source = new AudioBufferSourceNode(context, {buffer: buffer});
source.loop = true;
source.start((n + epsilon) / context.sampleRate);

Assume n is a whole number and ε ≪ 1.

The samples in the buffer are there to represent a continuous function. A band limited impulse or sinc function is just a means to interpolate a series of samples to produce a continuous signal. I guess the precise interpolation mechanism is not critical here. Playing the buffer involves interpolation of buffer sample frames and then sampling at output (AudioContext) frames (with pre-sampling band-limiting as appropriate).

For subsample accuracy in start time, the zeroth sample in the buffer will be played at time (n + ε)/F. One can imagine buffer sample frames before this time that are not played and so are equivalent to playing samples of value 0. The last of these corresponds to a time (n - 1 + ε)/F. The continuous function represented by the looping buffer would be initially 0, 0 at (n - 1 + ε)/F, 1 at (n + ε)/F, and finally 1, with some transitions along the way. The details around the transition in the function, particularly between 0 at (n - 1 + ε)/F and 1 at (n + ε)/F, depend on how the samples in the buffer are interpolated to convert to a continuous signal. Let's say this interpolation is accurate enough that the continuous signal represented by the looping buffer is like a band-limited Heaviside step function centred between the sample points having values of 0 and 1. i.e. H(t - (n - 0.5 + ε)/F).

When generating the ABSN output, the continuous signal represented by the buffer is sampled. The precise output would depend on the band-limiting and sampling algorithms, but sample frame n would have a value something like 1 - ε.

What is counter-intuitive is that this is not like trimming off the leading part of the continuous function H() at t = (n + ε)/F. Doing that would generate something like (1 - ε)/2 at output frame n. If you consider the ε = 0 case, it is clear that is not what we want. It would represent cutting off half the first sample from the buffer. IOW the start time indicates when the first frame from the buffer is played in full, not when half the first sample is played.

My point was that rounding the start time to the nearest output frame would generate output of 1 at sample frame n. Setting the output to 0 at frame n because it is before the start time would be a much worse approximation.

rtoy commented 5 years ago

Thanks for the detailed explanation. I understand what you're saying. However, your argument kind of assumes the ABSN is bandlimited (because you are bandlimiting the step function). But that's not a requirement for an ABSN.

My expectation is that with epsilon > 0, then at time n/F, the output is zero and at time (n+1)/F, it is not zero, with the actual value depending on how the interpolation is done. If epsilon is zero, then I would expect a value of 1 would be output at time n/F, and 0 at time (n-1)/F. This isn't band-limited, but that's not a problem here.

karlt commented 5 years ago

The output of ABSN is band-limited because it has a finite sample rate. (e.g. it cannot precisely represent a sub-sample start time.) There is the option of not band-limiting before sampling, which will produce aliasing during the band-limiting that occurs during sampling.

But it is not really the step function that I was choosing to band-limit. The band-limited step function is just the ideal interpolation of the buffer samples.

I found https://www.psaudio.com/article/cardinal-sinc/ a helpful resource.

rtoy commented 5 years ago

I still stand by my original comments in https://github.com/WebAudio/web-audio-api/issues/1749#issue-360102561.

If an ABSN has 44100 samples in it and the sample rate is 44100, the duration is exactly 1. And the output has exactly 44100 samples so if we started the source at time 0, output frame 44099 will have the last sample in the ABSN and frame 44100 and after is 0.

So if a stop time lies on an exact frame boundary, the value of at that frame should be 0. If this is not the case, consider an ABSN with 50000 samples. I call stop(1). Conceptually this is the same as the original ABSN above. I would expect frame 44100 to have a value of 0. If we don't do this, then you you'll get a glitch if you started another ABSN at time 1 because you have a non-zero value from the ABSN that was stopped.

rtoy commented 5 years ago

Teleconf: not important enough to do for v1. Move to v.next.

chrisguttandin commented 4 years ago

I made a quick test and it looks like Chrome and Firefox already do what @rtoy said in the last comment.

const offlineAudioContext = new OfflineAudioContext({ length: 88200, sampleRate: 44100 });
const constantSourceNode = new ConstantSourceNode(offlineAudioContext);

constantSourceNode.start(0);
constantSourceNode.stop(1);

constantSourceNode.connect(offlineAudioContext.destination);

offlineAudioContext
    .startRendering()
    .then((renderedBuffer) => {
        console.log(Array.from(renderedBuffer.getChannelData(0)).slice(44099, 44101));
        // This will log [ 1, 0 ].
    });
rtoy commented 3 years ago

TPAC 2020:

Based on https://github.com/WebAudio/web-audio-api-v2/issues/38#issuecomment-642793385, both Chrome and Firefox interpret stop(t) in the same way where the sample at time t is 0. We just need to make this clear in the spec.

rtoy commented 3 years ago

Virtual F2F 2021: https://github.com/WebAudio/web-audio-api-v2/issues/38#issuecomment-713729212 still holds. Now that V1 is basically done, we can start updating the text with these changes.