Closed olivierthereaux closed 7 years ago
Original comment by Ehsan Akhgari [:ehsan] on W3C Bugzilla. Thu, 01 Aug 2013 17:24:34 GMT
I think it probably makes sense to specify a 0 playbackRate to produce silence, and for negative values to play the buffer backwards, that is, from duration to offset (or from loopEnd to loopStart).
Also, note that there is another way that this node can perform resampling, that is, when there is a doppler shift applied to it in the face of a PannerNode. I think it makes sense to specify what needs to happen based on the multiplication of these two ratios.
Another point which was brought up on today's call was handling of values larger than one, but I think that is probably non-controversial by specifying that the final computed sampling rate ratio should be multiplied by the sampling rate of the buffer for the AudioBufferSourceNode in order to determine the target sampling rate that the resampler should use.
Original comment by Olivier Thereaux on W3C Bugzilla. Fri, 02 Aug 2013 09:12:14 GMT
Per our meeting on 2013-08-01 (http://www.w3.org/2013/08/01-audio-minutes.html), this is a call for volunteers to suggest a patch to the web audio API spec to define the expected behaviour when setting negative values for AudioBufferSourceNode.playbackRate.
Original comment by Ehsan Akhgari [:ehsan] on W3C Bugzilla. Fri, 02 Aug 2013 14:25:44 GMT
(In reply to comment #2)
Per our meeting on 2013-08-01 (http://www.w3.org/2013/08/01-audio-minutes.html), this is a call for volunteers to suggest a patch to the web audio API spec to define the expected behaviour when setting negative values for AudioBufferSourceNode.playbackRate.
Hmm, this is what I was hoping to do in comment 1. :-) Do we absolutely need a patch here? I think it probably makes sense to have the basic discussion first and then move to the exact prose when everybody is on the same boat about what we want. Do you agree?
Original comment by Joe Berkovitz / NF on W3C Bugzilla. Fri, 02 Aug 2013 15:21:38 GMT
I was thinking about volunteering a patch but reached the same conclusion as Ehsan: we need to have a basic discussion first.
The basic outline of my proposal is different from Ehsan's but similar in spirit (I think):
(NB I would not include the effect of downstream resampling-like effects such as doppler shifts as I think this may lead to confusion over the behavior of graphs with branched routing. It seems harder for developers to predict what will happen.)
Original comment by Ehsan Akhgari [:ehsan] on W3C Bugzilla. Fri, 02 Aug 2013 19:18:05 GMT
Doesn't this assume a linear interpolating resampler? The resampler that we use in Gecko is much more complicated (and of higher quality as a result) than that! (It's the libspeex resampler.)
If we're going to make it possible for implementations to compete on the resampler quality, assuming the resampling algorithm seems like a mistake.
Original comment by Joe Berkovitz / NF on W3C Bugzilla. Fri, 02 Aug 2013 19:34:03 GMT
@Ehsan: I had no intention of assuming any particular algorithm (and tried to call this out -- sorry if it was unclear). Of course linear interpolation is not a preferred choice. I suppose that a literal interpretation of my proposal could suggest linear interpolation, but that was not the intention.
The proposal specifies a sequence of {data window, effective sampling rate} pairs with fractional sample-offset boundaries that form the input to an arbitrary interpolation algorithm. How the interpolator makes use of this sequence is not a concern. A nonlinear interpolator can work with as much of the sequence as it likes, processing arbitrarily large batches of data points at a time.
Of course in practice an implementor would probably not accumulate such a sequence and apply an interpolation algorithm to it, this is an idealized behavior for specification purposes.
If this approach turns out to be too naîve I welcome an improved recasting of it. I think the important aspect of it has to do with the way that playback progress through the buffer is affected by a time-varying playback rate, and I found an idealized cursor the easiest way to express this progress.
Original comment by Chris Wilson on W3C Bugzilla. Mon, 05 Aug 2013 01:49:06 GMT
+1 to Joe's general idea - I also do NOT agree that playbackRate < 0 should change where the cursor starts; other than that, I think we're all on the same page.
Original comment by Joe Berkovitz / NF on W3C Bugzilla. Mon, 05 Aug 2013 17:51:36 GMT
Just to amplify Chris's comment: apart from my attempt to tease out a more detailed spec of playbackRate, the main behavioral difference in my proposal from Ehsan's is that a negative playbackRate does not cause playback to start at a different point than it would have otherwise. playbackRate determines the time derivative of a "playback path" through the buffer, but not the origin of that path, which remains the buffer offset as specified in the start() call (which defaults to 0).
If we want the ability to start playing a buffer from the end, I think there's a clearer and more explicit way to do that: attach that interpretation to a negative "offset" parameter passed to AudioBufferSourceNode.start(). I don't feel strongly that we need that feature but I do think we should avoid overloading the meaning of playbackRate w/r/t start offsets.
Original comment by Ehsan Akhgari [:ehsan] on W3C Bugzilla. Thu, 08 Aug 2013 03:21:06 GMT
I think I was unclear about what I meant, sorry about that. In the first paragraph of comment 1, I meant to describe the cursor jump boundaries, not that the playback should start at `duration'. In other words, I meant to propose exactly the same thing as Joe described better in terms of the cursor concept. In light of comment 6, I believe we're mostly proposing the same thing (with my proposal intentionally not talking about the details of the resampling, and with Joe's proposal doing a much better job describing the cursor concept, etc.)
1) I think playbackRate of 0 (presuming the buffer is playing) should result in the output of the current sample, not "silence" (which implies zero). Otherwise, ramping to zero and then back to a very small value will cause a click.
2) I think negative rates should be treated as zero. Playing backwards complicates the model.
It came up a few times at the Web Audio Conference that page authors were going to crazy lengths to get negative playbackRate AudioBufferSourceNodes working. Whether or not "playing backwards complicates the model", this is something the API should provide.
It could be specified, and I'm fine doing so if the vendors on the WG would sign up to implement it. If so, I'd suggest this as a rough draft of rules:
If looping=false, then playing backward ceases playback when it reaches the start point. So if playbackRate is negative, buffer.start(0,0) will immediately cease playback. If the offset is >0, I think playback ceases when the start offset is reached. (Clearly, the interesting scenarios here are when the playbackRate is set to negative after proceeding forward for some time.) Or, conversely (and maybe this is more interesting), playback proceeds backward past the starting offset to the beginning of the buffer. Note that duration will need to be slightly redefined to account for negative playbackRate. (I think it just doesn't apply when playbackRate<0.
If looping is true, then playing backward will work differently if it's in the "lead-in" portion (i.e. it's before the loopStart) or the looping portion when playbackRate is set to negative. I'd suggest if it's in the lead-in portion it proceeds backward until it hits the starting offset (or buffer begin, see above), then stops (despite it being "looping"). If it's in the looping portion, it should proceed to the loop begin, then wrap to loop end and keep going (in reverse).
Marked as "Needs WG Review" so we'll discuss.
:+1:
"It could be specified, and I'm fine doing so if the vendors on the WG would sign up to implement it."
:+1: :+1:
I agree this is useful, and I agree with cwilso that we need to discuss the exact behavior.
There is also an issue with the fact that the nature of playbackRate's interpolation is not fully specified. This is especially important when working with speeds substantially slower than 1 .
Resolution: spec negative playbackRate
as have being exactly mirrored behaviour from the positive playbackRate
behaviour.
I'm going to do this by speccing the algorithm used to compute a block of audio with an AudioBufferSourceNode
so that we can kill all ambiguities in one shot.
I've started drafting this. For now, this only handles positive sample-rate, and is not super elegant. I'd be interested in feedback.
To convert a value to sample-frame from seconds is to multiply it by the sample-rate at its nominal sample-rate, and to round it to the nearest integer.
readIndex
be the value of the offset
attribute of the start
method,
converted to sample-frame time in the sample-rate of the AudioContext
.startOffset
be the value of the attribute when
converted to
sample-frame time in the sample-rate of the AudioContext
.stopPoint
be the value of the when
parameter of the stop
method
converted to sample-frame time in the sample-rate of the AudioContext
, or
+Infinity
if stop
has not been called.duration
be the value of the duration
attribute of the AudioBuffer
,
converted to sample-frame time, in the sample-rate of the AudioContext
.currentTime
be the value of AudioContext.currentTime
converted to
sample-frame time at the sample-rate of the AudioContext
.loopStart
be the value of the loopStart
attribute converted to
sample-frame time at the sample-rate of the AudioContext
.loopEnd
be the value of the loopStart
attribute converted to
sample-frame time at the sample-rate of the AudioContext
.Writing silence to a sequence s
from index start
to index end
means
setting the elements between s[start]
to s[end]
to 0.0.
Rendering a block of audio for an AudioBufferSourceNode
means executing the
following steps:
s
be a sequence of 128 float.writeIndex
be 0.input rate
be the sample-rate of the AudioBuffer
.output rate
be the sample-rate of the AudioContext
divided by the
value of computedPlaybackRate
at currentTime
.resampling ratio
be input rate
divided by output rate
.writeIndex
is not 128
currentTime
is less than startOffset
:
s
from index writeIndex
to index startOffset - currentTime
.writeIndex
by startOffset - currentTime
.looping
is True
, let outputFrames
be the minimum of count
,
stopPoint - currentTime
and loopEnd - currentTime
.outputFrames
be the minimum of count
, stopPoint - currentTime
, and duration
.input rate
to output rate
to produce outputFrames
frames, and copy them to s
starting at
writeIndex
.readIndex
by the number of frames consumed by the resampler.looping
is True
and readIndex
is equal to loopEnd
, set readIndex
to loopStart
.writeIndex
by the number of frames produced by the resampler.currentTime + writeIndex
is equal to stopPoint
:
ended
flag to true.s
from index writeIndex
to index endOffset
.ended
is True
:
"ended"
at the
AudioBufferSourceNode
, and remove the self-reference.This algorithm is intentionaly vague on the resampling process, in terms of input and output frame count, as different resampling techniques have different characteristics.
Thanks @padenot! Here are some of the content issues and questions I see. I could try to take a crack at addressing them but thought it would be better to surface these questions first.
loopStart
or loopEnd
and so it's not clear how these are to be converted into times)count
is never defined in the render algorithm.stopPoint - currentTime
could be negative since we could be past the stop pointloopEnd - currentTime
is not a number of frames because loopEnd is not a context time, but a position within the AudioBuffer (see my comment above).looping
should be loop
Also, recall https://github.com/WebAudio/web-audio-api/issues/915#issuecomment-248930220 where we decided that we should actually do sub-sample accurate sampling. Although that was for the oscillator, I think we need to do the same for a buffer source because I think the same issues will show up.
I think the algorithm would be simpler if we just first described the case where the buffer rate matched the context rate. When the rates don't match, we can say that the same algorithm applies if the buffer behaved as if it were first resampled and used in this algorithm.
I think you also need a step 8 for the case where ended
is not true so that you go back to step 2. Also probably want to reorder the initial steps so that step 2 is the step just before step 6. All of these steps are just initializations that only need to happen once.
I also don't think we need to make this process 128 frames at a time. To describe the algorithm, we don't really need the finite-sized s
buffer. We can assume unlimited length and writeIndex
keeps track of what we're doing, and it can increment forever.
Another iteration on this, with comments addressed and negative playbackRate support. Note the loop, depending on if we're looping and reached the end point of the loop, we'll loop multiple times, same if the loop is very small. I tried to perform sub-sample accurate start, but it's probably not correct for the loop end, I need to double check.
To convert a value to sample-frame from seconds is to multiply it by the sample-rate at its nominal sample-rate.
Sample-frame time is a fractional time value in frames, in the sample-rate of
the AudioContext
. It MUST be rounded to access the samples themselves. This is
used to implement sub-sampling start point, stop point and loop points.
This algorithm is described for a playbackRate
of 1.0, and when the
sample-rate of the AudioBuffer
is the same as the sample-rate of the
AudioContext
, i.e. when no resampling is necessary. If this is not the case,
execute those steps before producing a block of audio:
input rate
be the sample-rate of the AudioBuffer
.output rate
be the sample-rate of the AudioContext
divided by the
value of computedPlaybackRate
at currentTime
.resampling ratio
be input rate
divided by output rate
.resampling ratio
is negative, reverse the data of the AudioBuffer
and
let resampling ratio
be -resampling ratio
.resampling ratio
is not 1.0, resample the AudioBuffer
data, and use
this resampled audio data as the source data.AudioBuffer
.readIndex
be the value of the offset
attribute of the start
method,
converted to sample-frame time.Writing silence to a sequence s
from index start
to index end
means
setting the elements between s[start]
to s[end]
to 0.0.
Rendering a block of audio for an AudioBufferSourceNode
means executing the
following steps:
source
be the buffer containing the source data, to be indexed by
readIndex
.startOffset
be the value of the attribute when
converted to
sample-frame time.stopPoint
be the value of the when
parameter of the stop
method
converted to sample-frame time, or +Infinity
if stop
has not been called.duration
be the value of the duration
attribute of the AudioBuffer
,
converted to sample-frame time.currentTime
be the value of AudioContext.currentTime
converted to
sample-frame time.loopStart
be the value of the loopStart
attribute converted to
sample-frame time.loopEnd
be the value of the loopStart
attribute converted to
sample-frame time.s
be a sequence of 128 float.writeIndex
be 0.writeIndex
is not 128
currentTime
is less than startOffset
:
s
from index writeIndex
to the minimum of
startOffset - currentTime
and 128.writeIndex
by startOffset - currentTime
.count
be 128 - writeIndex
.loop
is True
, let outputFrames
be the minimum of count
,
stopPoint - currentTime
and startOffset + loopEnd - currentTime
.outputFrames
be the minimum of count
, stopPoint - currentTime
, and startOffset + duration
.readIndex
is not an integer, perform sub-sample interpolation:
left
be readIndex - ceil(readIndex)
and right
be 1 - left
.s[writeIndex]
to source[readIndex] * left
.writeIndex
by 1. If writeIndex
is 128, jump to the
beginning of this loop.s[writeIndex]
to source[readIndex] * right
.readIndex
be ceil(readIndex)
. If readIndex
is greater than
loopEnd
and loop
is True
, substract loopEnd - loopStart
from
readIndex
and jump to the beginning of this loop.outputFrames
frames of audio from the source data starting at index
readIndex
to s
starting at index writeIndex
.readIndex
by outputFrames
.writeIndex
by outputFrames
.loop
is True
and readIndex
is greater or equal to loopEnd
,
substract loopEnd - loopStart
to readIndex
.currentTime + writeIndex
is greater or equal to stopPoint
or
startOffset + duration
:
ended
flag to true.s
from index writeIndex
to index endOffset
.ended
is True
:
"ended"
at the
AudioBufferSourceNode
, and remove the self-reference.
- If resampling ratio is not 1.0, resample the AudioBuffer data, and use this resampled audio data as the source data.
Specify that the resampling results in new audio data whose sample rate is now output rate
.
- Let readIndex be the value of the offset attribute of the start method, converted to sample-frame time.
The phrase "sample-frame time" is ambiguous with at least 3 different rates in play. In this case I believe offset
would be specified by the caller as time units that assume the buffer's own sample rate, i.e. input rate
. But in other cases, the phrase clearly refers to AudioContext rate. I think it would be clearer always to say something explicit like, "let readIndex
be offset
multiplied by input rate
". This issue comes up a bunch of times in the algorithm.
ii. Increment writeIndex by startOffset - currentTime.
I think this needs to increment writeIndex
by the lesser of the given expression, or 128 (the same amount as the silence that was written in te previous step).
vi. Else, let outputFrames be the minimum of count, stopPoint - currentTime, and startOffset + duration.
The last expression I think should be startOffset + duration - currentTime
.
vii (sub sample interpolation)
I see a few different issues here although perhaps I've missed something important:
s[i]*left + s[i+1]*right
. It looks like you are setting either a left-weighted or right-weighted value into the output sample frame, rather than a sum of both.left
would be readIndex - floor(readIndex)
wouldn't it? subtracting the ceiling value will yield a negative interpolation weight for the left value.readIndex
is fractional, surely it should remain fractional but be incremented by 1 as it advances through a fractionally-shifted region of the buffer. To avoid glitches, interpolation must occur throughout a shifted region, not just at the endpoints of that region.Let me propose a somewhat different expression of the algorithm. I think it's a bit simpler, but a bit harder to visualize. It would be best if I drew some simple diagrams to show what happens at the buffer start, the loop end and the loop start points. Anyway, without further ado:
Let
t = context.currentTime dt = 1 / context.sampleRate
tb = ABSN buffer index dtb = playbackRate ts = start time for ABSN tls, tle = loop start and end time
b = ABSN buffer, resampled if necessary to the context sample rate such that b[0] still represents the very first value.
If ABSN is started with start(when), set ts = when, duration = length of buffer in seconds, and toff = 0.
If ABSN is started with start(when, offset, duration), set ts = when, toff = offset and duration = duration if given or infinity.
Let bufferSample(k) be a function that interpolates values from the ABSN buffer such that if k is an integer, b[k] is the result. If k is not an integer, compute an interpolated value using b[n] and b[n+1] where n = floor(k). Interpolation method is not specified and more samples of the buffer are allowed to be used.
Let output(x) be a function outputs the value x as the next output sample of the ABSN.
t = 0;
while (1) {
if (t <= ts < t+dt) {
// Buffer is starting. Compute an offset into the buffer based on
// when the current sample is being taken.
tb = (t + dt - ts + toff) * sampleRate
total = 0;
while (1) {
if (loop == false)
if (total > duration) {
// We've reached the end of the buffer and/or duration, so
// stop
break;
}
} else {
// We're looping
if (total > duration) {
// We've output duration seconds of audio. Time to stop.
break;
}
if (tle < tb) {
// We're trying to output the first sample PAST the end of
// the loop. Rewind the buffer pointer back to the loop
// start.
tb = ceil(tls / sampleRate) * sampleRate;
}
}
output(bufferSample(tb))
tb = tb + dtb;
// Wrap tb if tb goes past the end of the buffer. This wraps tb
// back near the beginning of the buffer or near the start of
// the loop point.
tb = wrapPointer(tb);
t = t + dt;
total = total + dt;
}
}
output(0);
t = t + dt;
}
@rtoy I agree that this way of expressing the algorithm is easier to understand, although it would need some preamble explaining that it ignores rendering quanta and instead describes an idealized sequence of sample frames generated by an ABSN as if nothing else were going on.
Also, this algorithm can deal with computed playback rate more gracefully. The previous version appears to treat computedPlaybackRate as a constant that can be eliminated via resampling; this version only resamples to compensate for the AudioBuffer's own sample rate, and treates computedPlaybackRate as truly dynamic (since it is k-rate).
I did identify a couple of problems with the algorithm and also felt the control structures and variable names made it a bit more obscure, so I've taken the liberty of trying to do another iteration below -- but it's really just a restatement of @rtoy's version.
I also attempted to incorporate explicit logic for the wrapping of loops as I believe this needs to be spelled out and not left to the UA.
Here we go:
while (1) {
if (currentTime <= start && start < currentTime + dt) {
// Buffer is starting. Compute an offset into the buffer based on
// when the current sample is being taken.
sampleIndex = (currentTime + dt - start + offset) * sampleRate
total = 0;
while (1) {
if (total > duration) {
// We've reached the end of the buffer and/or duration, so
// stop
break;
}
if (loop && sampleIndex > loopEndIndex) {
// We're trying to output the first sample PAST the end of
// the loop. Rewind the buffer pointer back to the loop
// start.
sampleIndex = loopStartIndex;
// @rtoy: Didn't understand rationale for the following
// sampleIndex = ceil(loopStart / sampleRate) * sampleRate;
}
output(bufferSample(sampleIndex))
sampleIndex += dtb * computedPlaybackRate; // @rtoy did not see any reference to playback rate
// Wrap tb if tb goes past the end of the buffer. This wraps tb
// back near the beginning of the buffer or near the start of
// the loop point.
while (loop && sampleIndex > loopEndIndex) {
sampleIndex -= loopEndIndex - loopStartIndex;
}
currentTime += dt;
total += dt;
}
}
output(0);
currentTime += dt;
}
Thanks for cleaning up the code. I was lazy about typing things out, so this looks much better. I think this algorithm works fine if computedPlaybackRate
is non-negative. If it's negative, we'll have to do something about the various tests because they'll have to change direction or maybe swap loop start and loop end. (I assume that's how loop points would work for negative playbackRate
.)
You don't describe how loopStartIndex
and loopEndIndex
are computed from loopStart
and loopEnd
. I think loopStartIndex
is probably ceil(loopStart / sampleRate) * sampleRate
and loopEndIndex is similar but uses floor
instead of ceil
. But maybe loopStartIndex = loopStart * sampleRate
.
If loopStart * sampleRate
lies between two sample points, what should we output? I guess I was thinking we should output the buffer sample just past that. But maybe it should be the interpolated value between the sample points?
If it's negative, we'll have to do something about the various tests because they'll have to change direction or maybe swap loop start and loop end. (I assume that's how loop points would work for negative playbackRate.)
Correct, probably that while loop should work in the opposite sense if computedPlaybackRate is negative.
You don't describe how loopStartIndex and loopEndIndex are computed from loopStart and loopEnd. I think loopStartIndex is probably ceil(loopStart / sampleRate) sampleRate and loopEndIndex is similar but uses floor instead of ceil. But maybe loopStartIndex = loopStart sampleRate.
Two points:
We can't use ceil()
or floor()
because we've agreed that loop points are not quantized. These are "indices" only in the sense that they are expressed in units of sample frames, but they can have a nonzero fractional part. So loopStart(End)Index = loopStart(End) * buffer sample rate.
We need to note in the algorithm that sampleRate
has to be the sample rate of the buffer, not the sample rate of the context or the computed playback rate.
If loopStart * sampleRate lies between two sample points, what should we output? I guess I was thinking we should output the buffer sample just past that. But maybe it should be the interpolated value between the sample points?
Since loopStart is allowed to lie between frames, on any sort of looping back to a start point we should just assign loopStartIndex
(which can be fractional as noted above) to sampleIndex
and not try to force to the frame to the left or to the right.
A similar argument applies to the loop endpoint in a negative playback rate situation.
On Mon, Nov 28, 2016 at 10:53 AM, Joe Berkovitz notifications@github.com wrote:
If it's negative, we'll have to do something about the various tests because they'll have to change direction or maybe swap loop start and loop end. (I assume that's how loop points would work for negative playbackRate.)
Correct, probably that while loop should work in the opposite sense if computedPlaybackRate is negative.
You don't describe how loopStartIndex and loopEndIndex are computed from loopStart and loopEnd. I think loopStartIndex is probably ceil(loopStart / sampleRate) sampleRate and loopEndIndex is similar but uses floor instead of ceil. But maybe loopStartIndex = loopStart sampleRate.
Two points:
-
We can't use ceil() or floor() because we've agreed that loop points are not quantized. These are "indices" only in the sense that they are expressed in units of sample frames, but they cannot be quantized. So loopStart(End)Index = loopStart(End) * buffer sample rate.
I think I've confused myself. I need to draw a picture and study it to figure out what we really want.
-
We need to note in the algorithm that sampleRate has to be the sample rate of the buffer, not the sample rate of the context or the computed playback rate.
This is also confusing because we have two sampleRates: one is for the audio context and one is for the buffer. I'd really prefer to state the algorithm as if the buffer was already resampled to the context rate. The implementation can then do whatever optimizations it wants, including doing the resampling during playback (via linear interpolation). Or actually doing the resampling first with a high quality resampler.
I think of resampling as really orthogonal to specifying how loops and playbackRate works.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/WebAudio/web-audio-api/issues/95#issuecomment-263358902, or mute the thread https://github.com/notifications/unsubscribe-auth/AAofPPlCDOEAXqSyQ7JWgCE-1KXJi-Klks5rCyMagaJpZM4A_H6- .
I'd really prefer to state the algorithm as if the buffer was already resampled to the context rate.
Yes, that makes sense to me -- my real point was that computed playback rate (i.e. modulating the rate via the playbackRate AudioParam) should have no influence on loop point location. Loop points are a stable way of referencing the buffer content.
As discussed on yesterday's call I'm going to take a crack at refining this further.
After quite a bit of thinking and much careful checking, I offer the following algorithm which makes heavy use of all the work done so far. I believe it works fine for all these requirements:
start, stop, offset, loopStart, loopEnd
Probably the key new feature of this approach is the approach to loops, in which a looped buffer is considered to be an infinite sequence of half-open ranges [0..loopStart), [loopStart..loopEnd), [loopStart+loopLength..loopEnd+loopLength), ...
. This greatly clarifies the way that interpolation behaves in the neighborhood of loop boundaries, even when these do not lie at exact sample indices; the comments explain in detail.
I've adopted the idiom of an AudioWorkletProcessor
to describe the algorithm because this helped me understand it and check it. It is thus a hybrid of @padenot's original block-rendering-oriented approach and @rtoy's approach. However, we could reduce this to a narrative description if desired.
// Description of the algorithm employed by
// an AudioBufferSourceNode to render successive
// blocks of audio. The description is framed as if
// the node was implemented using an AudioWorkletProcessor.
//
// It is designed for clarity only, not efficiency.
class AudioBufferSourceProcessor extends AudioWorkletProcessor {
static get parameterDescriptors() {
return [{
name: 'detune',
defaultValue: 0
}, {
name: 'playbackRate',
defaultValue: 1
}];
}
// Initialize processing
constructor(options) {
this.readTime = 0;
this.buffer = this.originalBuffer = options.buffer;
this.loop = options.loop;
this.loopStart = options.loopStart;
this.loopEnd = options.loopEnd;
}
// To be called when start() is applied to the node
function start(when, offset, duration) {
this.start = when;
this.offset = 0;
this.stop = Infinity;
this.bufferTime = -1;
if (arguments.length > 1) {
this.offset = offset;
}
if (arguments.length > 2) {
this.stop = this.when + duration;
}
}
// To be called when stop() is applied to the node
function stop (when) {
this.stop = when;
}
function process(inputs, outputs, parameters) {
// Set up working variables
let currentTime = contextInfo.currentTime; // current context time of next output frame
let dt = 1 / contextInfo.sampleRate; // context time per sample
let index = 0; // index of next frame to be output
let computedPlaybackRate = parameters.playbackRate[0] * Math.pow(2, parameters.detune[0] / 1200);
// Optionally resample buffer to reduce need for interpolation
optimizeBuffer(computedPlaybackRate);
// Render next block
while (index < outputs[0][0].length) {
if (currentTime < this.start || currentTime >= this.stop) {
for (let channel = 0; channel < buffer.numberOfChannels; channel++) {
outputs[0][channel][index] = 0;
}
}
else {
// set up bufferTime the first time we have a signal to put out
if (this.bufferTime < 0) {
this.bufferTime = offset + (currentTime - this.start) * computedPlaybackRate;
}
for (let channel = 0; channel < buffer.numberOfChannels; channel++) {
outputs[0][channel][index] = bufferSample(this.bufferTime, channel);
}
this.bufferTime += dt * computedPlaybackRate; // advance read time of buffer, independent of loop points
}
currentTime += dt; // advance output time
index += 1; // advance output index
}
// Consider the node to be active if we have not reached stop time yet.
return currentTime < this.stop;
}
// Returns a channel's signal value at the time offset "effectiveTime" in
// the buffer. This function takes care of enforcing silence before and after
// the buffer contents, and also folds the body of any loop.
function bufferSample(effectiveTime, channel) {
// Convert time to a playhead position with fractional part
let bufferPos = effectiveTime * this.buffer.sampleRate;
// Now get an interpolated value.
var leftFrame, rightFrame, interval;
if (this.loop) {
let loopStartPos = this.loopStart * this.buffer.sampleRate; // playhead position for loop start
let loopEndPos = this.loopEnd * this.buffer.sampleRate; // playhead position for loop end
let loopLength = loopEndPos - loopStartPos; // loop length in frame units
// Determine the first and last exact sample frame positions that lie
// within the body of the loop, which is the half-open range
// [loopStartPos, loopEndPos) -- that is, loopEndPos is exclusive.
let loopFirstFrame = Math.ceil(loopStartPos);
let loopLastFrame = Math.ceil(loopEndPos - 1); // N => N-1; (N + epsilon) => N
// At any loop wrap point, the mapping between a playhead position in units of frames
// and fractional sample indices looks like this (where N is a nonnegative loop iteration count)
//
// playhead position sample index interpolated?
//
// loopLastFrame + N*loopLength loopLastFrame no
// loopEndPos + N*loopLength - epsilon loopEndPos - epsilon yes
// loopStartPos + (N+1)*loopLength loopStartPos yes
// loopFirstFrame + (N+1)*loopLength loopFirstFrame no
//
// In any region covered by this range of values, we are potentially interpolating
// between a "left sequence" of frames ending in loopLastFrame, and a "right sequence"
// of frames beginning with loopFirstFrame.
// Fold the loop to bring the value of N to zero, by requiring that that
// the playhead poition not exceed (loopFirstFrame + loopLength).
while (bufferPos >= loopFirstFrame + loopLength) {
bufferPos -= loopFrames;
}
if (bufferPos >= loopLastFrame) {
// If after folding the playhead is after the last exact frame in the loop,
// then we'll be interpolating at the wrap boundary between a sequence of frames
// ending in loopLastFrame, and a (wrapped) sequence of frames beginning with
// loopStartFrame.
// The time interval between left and right frames may be less than 1
// frame in this case, because of fractional loop points.
leftFrame = loopLastFrame;
rightFrame = loopFirstFrame;
interval = (loopFirstFrame + loopFrames) - loopLastFrame;
}
else {
leftFrame = Math.floor(bufferPos);
rightFrame = leftFrame + 1;
interval = 1;
}
}
else {
leftFrame = Math.floor(bufferPos); // get the exact frame index before the time of interest
rightFrame = leftFrame + 1; // and the frame after that
interval = 1; // interval in sample frames between left and right
}
return interpolateValue(bufferPos, channel, leftFrame, rightFrame, interval);
}
// Return an interpolated value for the playhead position bufferPos,
// working from a sequence of exact frames at:
// leftFrame-M ... leftFrame, rightFrame ... rightFrame+N
// that map onto playhead positions:
// leftFrame-M ... leftFrame, leftFrame + interval ... leftFrame + interval + N
// with the constraint that positions outside the buffer content may not be included.
function interpolateValue(bufferPos, channel, leftFrame, rightFrame, interval) {
// The UA may employ any desired algorithm.
// It may also elect to use bufferPos as an exact index and not interpolate,
// if the difference between bufferPos and leftFrame is sufficiently small.
/*
This sample implementation uses linear interpolation between two adjacent frames.
let weight = (bufferFrame - leftFrame) / interval;
let leftValue = leftFrame >= 0 ? this.buffer.getChannelData(channel)[leftFrame] : 0;
var rightValue = rightFrame < buffer.length ? this.buffer.getChannelData(channel)[rightFrame] : 0;
return leftValue + (weight * rightValue);
*/
}
function optimizeBuffer(playbackRate) {
// This function may resample the buffer contents, entirely or partially,
// as often as desired for reasons of quality or computational efficiency.
// The results of the resampling, if carried out, are available to the
// bufferSample() method via this.buffer.
/*
Example implementation that ensures buffer is at optimal rate for the first-rendered block:
let bestSampleRate = contextInfo.sampleRate / playbackRate;
if (!this.resampled && this.buffer.sampleRate != bestSampleRate ) {
this.buffer = cubicResample(this.originalBuffer, bestSampleRate);
this.resampled = true;
}
*/
}
}
I thought it would be useful to abstract a narrative definition that's a bit more rigorous. Here it is.
Let start
correspond to the when
parameter to start()
if supplied, else 0.
Let offset
correspond to the offset
parameter to start()
if supplied, else 0.
Let stop
correspond to the sum of the when
and duration
parameters to start()
if supplied, or the when
parameter to stop()
, otherwise Infinity.
Let buffer
be the AudioBuffer
employed by the node.
Let loop
, loopStart
and loopEnd
be the loop-related attributes of the node, with the loop body clamped to the range [0, buffer.length].
A playhead position for buffer
is any quantity representing an unquantized time offset in seconds, relative to the time coordinate of the first sample frame in the buffer. Playback rate and AudioContext
sample rate are not relevant: this offset is expressed purely in terms of the buffer's audio content.
Let framePosition(index) be a many-valued function yielding a set of one or more playhead positions for the sample frame within the buffer at index. These represent the idealized times at which the frame would be rendered, given a start time of 0, a playback rate of 1 and an infinite sample rate. The function is as follows:
bufferTime
be index / buffer.sampleRate
.loop
is false, the result is bufferTime
.loop
is true,
bufferTime < loopStart
, the result is bufferTime
.bufferTime >= loopStart
and bufferTime < loopEnd
, the function has multiple results given by bufferTime + (count * (loopEnd - loopStart))
, where count
takes on non-negative integer values.bufferTime >= loopEnd
, the position is not defined.Let frameValue(index) correspond to the vector of actual signal values in buffer
at the given index, one component per channel.
Let the playback frame sequence for buffer
be the set of all tuples [framePosition(index), index, frameValue(index)], ordered by increasing values of framePosition(index). This sequence describes a signal whose values are known only at discrete, monotonically increasing playhead positions that correspond exactly to specific samples in the buffer.
For an unlooped buffer, this sequence is finite (frame values are omitted here, since they are not relevant):
framePosition(index) | index |
---|---|
0 | 0 |
1 / buffer.sampleRate |
1 |
2 / buffer.sampleRate |
2 |
... | ... |
(length - 1) / buffer.sampleRate |
(length - 1) |
For a looped buffer, this sequence is infinite. Let loopStartFrame
be ceil(loopStart * buffer.sampleRate)
(the first exact frame index within the loop body), and loopEndFrame
be ceil(loopEnd * buffer.sampleRate - 1)
(the last exact frame index within the loop body). The sequence is as follows:
framePosition(index) | index |
---|---|
0 | 0 |
1 / buffer.sampleRate |
1 |
2 / buffer.sampleRate |
2 |
... | ... |
loopStartFrame/buffer.sampleRate |
loopStartFrame |
... | ... |
loopEndFrame/buffer.sampleRate |
loopEndFrame |
loopStartFrame/buffer.sampleRate + (loopEnd - loopStart) |
loopStartFrame |
... | ... |
loopEndFrame/buffer.sampleRate + (loopEnd - loopStart) |
loopEndFrame |
loopStartFrame/buffer.sampleRate + 2*(loopEnd - loopStart) |
loopStartFrame |
... | ... |
buffer.length / buffer.sampleRate
where loop
is false, the result is a zero vector (silence).Note that this definition ignores loop points, since these are already embodied in the definition of the playback frame sequence.
Let bufferTime
be the playhead position within buffer
of the next output sample frame. Assign it the initial value -1 to indicate that the position has not yet been determined.
Optimize the buffer prior to rendering, if desired.
Let currentTime
be the current time of the AudioContext
.
Let dt
be 1 / (context sample rate).
Let index
be 0.
Let computedPlaybackRate
be playbackRate * pow(2, detune / 1200).
Optimize the buffer during rendering, if desired.
While index
is less than the length of the audio block to be rendered:
currentTime < start
or currentTime >= stop
, emit silence for the output frame at index
.bufferTime
< 0, set bufferTime
to offset + (currentTime - start) * computedPlaybackRate
.interpolateFrame(bufferTime)
as the output frame at index
.bufferTime
by dt * computedPlaybackRate
.index
by 1.currentTime
by dt
.If currentTime
< stop
, consider playback to have ended.
I have not yet read through everything, but the code looks really nice and is quite clear. Now for a few comments.
I think this is wrong:
this.bufferTime = offset + (currentTime - this.start) * computedPlaybackRate;
If computedPlaybackRate
were, say, a zillion, and currentTime > this.start, the bufferTime
would be huge so the first output would be very far along the actual buffer. I think instead of computedPlaybackRate
we want just dt
.
Having bufferSample keep track of looping seems a bit complicated. I did like your original idea of the buffer index being an time value. Then you can say I want the value of the buffer at time t, and that would implicitly take into account the sample rate of the buffer and allow the implementation to do some appropriate resampling and interpolation as needed.
So, I guess what I'm saying is let the main loop maintain the bufferIndex time value, including looping and computedPlaybackRate, and let bufferSample(t) produce the appropriate buffer value at time t.
Not sure we want to define this in terms of AudioWorklet. We haven't done that anywhere else and we don't yet have any practical experience with a working AudioWorklet implementation.
I had some misgivings about that too and perhaps the narrative description would be better (did you read it yet?) Or we could just keep the JS description and move it away from AudioWorklet.
I wonder if optimizeBuffer(computedPlaybackRate) is really needed. That seems to be an implementation optimization that isn't really needed for the algorithm.
Agreed, and this is a point that is perhaps clearer in the narrative version. I do think it's important to state that optimization (resampling, etc) at the UA's discretion is allowed.
I think this is wrong:
this.bufferTime = offset + (currentTime - this.start) * computedPlaybackRate
; If computedPlaybackRate were, say, a zillion, and currentTime > this.start, the bufferTime would be huge so the first output would be very far along the actual buffer. I think instead of computedPlaybackRate we want just dt.
I still believe this expression is correct -- if computedPlaybackRate is a crazy-large value and currentTime > start, then bufferTime
could well be past the end of the buffer. Why is that wrong? The bufferSample()
function (or equivalent language in the narrative version) takes care of checking the buffer limits and enforcing silent output in this case.
Having bufferSample keep track of looping seems a bit complicated. I did like your original idea of the buffer index being an time value. Then you can say I want the value of the buffer at time t, and that would implicitly take into account the sample rate of the buffer and allow the implementation to do some appropriate resampling and interpolation as needed.
I agree - this is the approach adopted in the narrative version of the algorithm. I think it is cleaner.
Perhaps you could go over the narrative piece and see how you find it. I think it makes the approach clearer than the code and it avoids unnecessarily prescriptions about how things work.
On Tue, Dec 6, 2016 at 10:28 AM, Joe Berkovitz notifications@github.com wrote:
Not sure we want to define this in terms of AudioWorklet. We haven't done that anywhere else and we don't yet have any practical experience with a working AudioWorklet implementation.
I had some misgivings about that too and perhaps the narrative description would be better (did you read it yet?) Or we could just keep the JS description and move it away from AudioWorklet.
I've skimmed the narrative, but not studied it in detail.
I wonder if optimizeBuffer(computedPlaybackRate) is really needed. That seems to be an implementation optimization that isn't really needed for the algorithm.
Agreed, and this is a point that is perhaps clearer in the narrative version. I do think it's important to state that optimization (resampling, etc) at the UA's discretion is allowed.
I think this is wrong: this.bufferTime = offset + (currentTime - this.start) * computedPlaybackRate; If computedPlaybackRate were, say, a zillion, and currentTime > this.start, the bufferTime would be huge so the first output would be very far along the actual buffer. I think instead of computedPlaybackRate we want just dt.
I still believe this expression is correct -- if computedPlaybackRate is a crazy-large value and currentTime > start, then bufferTime could well be past the end of the buffer. Why is that wrong? The bufferSample() function (or equivalent language in the narrative version) takes care of checking the buffer limits and enforcing silent output in this case.
Let's say bufferTime ends up being near the end of the buffer (using your computation). Why should the very first output from the buffer come from the end of the buffer? I would think the very first output should be very near the beginning. The second output sample would then be near the end, which makes sense to me.
Having bufferSample keep track of looping seems a bit complicated. I did like your original idea of the buffer index being an time value. Then you can say I want the value of the buffer at time t, and that would implicitly take into account the sample rate of the buffer and allow the implementation to do some appropriate resampling and interpolation as needed.
I agree - this is the approach adopted in the narrative version of the algorithm. I think it is cleaner.
Perhaps you could go over the narrative piece and see how you find it. I think it makes the approach clearer than the code and it avoids unnecessarily prescriptions about how things work.
I'm going to read over that very soon....
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/WebAudio/web-audio-api/issues/95#issuecomment-265230892, or mute the thread https://github.com/notifications/unsubscribe-auth/AAofPMqt_AHYwcPDOz5ReGBe3IZaWgD5ks5rFak4gaJpZM4A_H6- .
-- Ray
Let's say bufferTime ends up being near the end of the buffer (using your computation). Why should the very first output from the buffer come from the end of the buffer? I would think the very first output should be very near the beginning. The second output sample would then be near the end, which makes sense to me.
The answer has to do with sample-accurate start times as required by #915. If start
occurs at an exact sample frame, then currentTime
will equal start
on the first-rendered frame of output. Hence (currentTime - start) * computedPlaybackRate
is going to be zero, as you expect.
On Tue, Dec 6, 2016 at 11:40 AM, Joe Berkovitz notifications@github.com wrote:
Let's say bufferTime ends up being near the end of the buffer (using your computation). Why should the very first output from the buffer come from the end of the buffer? I would think the very first output should be very near the beginning. The second output sample would then be near the end, which makes sense to me.
The answer has to do with sample-accurate start times as required by #915 https://github.com/WebAudio/web-audio-api/issues/915. If start occurs at an exact sample frame, then currentTime will equal start on the first-rendered frame of output. Hence (currentTime - start) * computedPlaybackRate is going to be zero, as you expect.
Sorry, I still don't understand. If the start time isn't on a sample boundary, currentTime - start is non-zero as expected. Why would I want the first output sample to come from very near the beginning of the buffer? If computedPlaybackRate were huge, the offset will also be huge and probably not produce a sample near the beginning of the buffer.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/WebAudio/web-audio-api/issues/95#issuecomment-265251060, or mute the thread https://github.com/notifications/unsubscribe-auth/AAofPKeWGkin7q4tJF0rllDvSH6Y_JUeks5rFbotgaJpZM4A_H6- .
-- Ray
Some initial comments. I like this narrative, but I think the code is easier to follow.
On Sun, Dec 4, 2016 at 10:28 AM, Joe Berkovitz notifications@github.com wrote:
I thought it would be useful to abstract a narrative definition that's a bit more rigorous. Here it is. Definitions
1.
Let start correspond to the when parameter to start() if supplied, else 0.
I think it's better to say "is equal to" instead of "correspond to", here and below.
1. 2.
Let offset correspond to the offset parameter to start() if supplied, else 0. 3.
Let stop correspond to the sum of the when and duration parameters to start() if supplied, or the when parameter to stop(), otherwise Infinity.
This needs to be defined a bit better, but isn't really key to the overall algorithm. It's not clear which takes precedence (sum or stop), if there is any.
1. 2.
Let buffer be the AudioBuffer employed by the node. 3.
Let loop, loopStart and loopEnd be the loop-related attributes of the node, with the loop body clamped to the range [0, buffer.length].
As mentioned about the code, I really like your idea of just forgetting about buffer indices and treat everything in terms of the buffer time. I think the buffer time (playback head) is well-defined and is independent of the sample rates.
1. 2.
Define the term playhead position for buffer as an unquantized time offset in seconds, relative to the time coordinate of the first sample frame in the buffer. Playback rate and AudioContext sample rate are not relevant: this offset is expressed purely in terms of the buffer's contents and its own sample rate. 3.
Let framePosition(index) be a many-valued function yielding a set of one or more playhead positions for the sample frame within the buffer at index. These represent the idealized times at which the frame would be rendered, given a start time of 0, a playback rate of 1 and an infinite context sample rate. The function is as follows:
Let bufferTime be index / buffer.sampleRate.
- If loop is false, the result is bufferTime.
- If loop is true,
- If bufferTime < loopStart, the result is bufferTime.
- If bufferTime >= loopStart and bufferTime < loopEnd, the sample index maps onto multiple results given by bufferTime + (count * (loopEnd - loopStart)), where count is any non-negative loop iteration count.
- If bufferTime >= loopEnd, the position is not defined.
Let frameValue(index) correspond to the vector of actual signal values in buffer at the given index, one component per channel.
Let the playback frame sequence for buffer be the set of all tuples [ framePosition(index), index, frameValue(index)], ordered by increasing values of framePosition(index).
For an unlooped buffer, this sequence is finite (frame values are omitted here, since they are not relevant): framePosition(index) index 0 0 1 / buffer.sampleRate 1 2 / buffer.sampleRate 2 ... ... (length - 1) / buffer.sampleRate (length - 1)
For a looped buffer, this sequence is infinite. Let loopStartFrame be ceil(loopStart
buffer.sampleRate) (the first exact frame index within the loop body), and loopEndFrame be ceil(loopEnd buffer.sampleRate - 1) (the last exact frame index within the loop body). The sequence is as follows: framePosition(index) index 0 0 1 / buffer.sampleRate 1 2 / buffer.sampleRate 2 ... ... loopStartFrame/buffer.sampleRate loopStartFrame ... ... loopEndFrame/buffer.sampleRate loopEndFrame loopStartFrame/buffer.sampleRate + (loopEnd - loopStart) loopStartFrame ... ... loopEndFrame/buffer.sampleRate + (loopEnd - loopStart) loopEndFrame loopStartFrame/buffer.sampleRate + 2(loopEnd - loopStart) loopStartFrame ... ... Interpolation and Buffer Optimization
1.
Let the function interpolateFrame(pos) yield a vector which estimates the channel values at the given playhead position of pos, which need not map onto a exact frame position in the playback sequence. The interpolation method MUST obey all of the following constraints:
- It relies ONLY on the relationship between playhead positions and channel values supplied by the playback frame sequence.
- For pos equal to framePosition(index), the result is exactly frameValue(index).
- For pos < 0, the result is a zero vector (silence).
- For pos >= buffer.length / buffer.sampleRate where loop is false, the result is a zero vector (silence).
Let the operation optimize the buffer be any operation that alters both the buffer contents and sample rate, while attempting to minimize changes to the value of interpolateFrame(pos). The nature of the operation is up to the UA. Examples of such operations might include upsampling, downsampling, applying a subsample offset, or loop unrolling.
Initialization
1.
Let bufferTime be the playhead position within buffer of the next output sample frame. Assign it the initial value -1 to indicate that the position has not yet been determined. 2.
Optimize the buffer prior to rendering, if desired.
Rendering a Block of Audio
1.
Let currentTime be the current time of the AudioContext. 2.
Let dt be 1 / (context sample rate). 3.
Let index be 0. 4.
Let computedPlaybackRate be playbackRate pow(2, detune* / 1200). 5.
Optimize the buffer during rendering, if desired. 6.
While index is less than the length of the audio block to be rendered:
- If currentTime < start or currentTime >= stop, emit silence for the output frame at index.
- Else,
- If bufferTime < 0, set bufferTime to offset + (currentTime - start) * computedPlaybackRate.
- Emit the result of interpolateFrame(bufferTime) as the output frame at index.
- Increase bufferTime by dt * computedPlaybackRate.
- Increase index by 1.
- Increase currentTime by dt.
If currentTime < stop, consider playback to have ended.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/WebAudio/web-audio-api/issues/95#issuecomment-264721064, or mute the thread https://github.com/notifications/unsubscribe-auth/AAofPD0aBqqbWxXx0pfBr33_jjgAbtI4ks5rEwZNgaJpZM4A_H6- .
-- Ray
If the start time isn't on a sample boundary, currentTime - start is non-zero as expected. Why would I want the first output sample to come from very near the beginning of the buffer? If computedPlaybackRate were huge, the offset will also be huge and probably not produce a sample near the beginning of the buffer.
Note that if the playback rate is 1, then this means the first output sample is interpolated, to estimate the buffer value at delta * computedPlaybackRate
which of course in this case is delta
. That's your sample-accurate start behavior right there. If start = 0.99 and currentTime = 1, we want to estimate the buffer value at a time offset of 0.01.
Now assume that the playback rate is greater than 1 (and maybe a huge number). Given that at the first sample, currentTime is some quantity delta
from the current buffer, don't we have to respect that delta and multiply it by the playback rate? If start = 0.99 and currentTime = 1 and computedPlaybackRate = 2, we need to estimate the buffer value at an offset of 0.02.
Sorry for being so dense. You are, of course, exactly right.
So, back to the original discussion. I like the code version quite a bit, but I'm not sure I want it described in terms of an AudioWorklet. The text version is also quite nice. I think either would work.
I do think the main algorithm should keep track of the buffer playback head (as a float time value) for looping, and let a bufferSample
method (mostly unspecified except for what it produces) produce the correct sample based on the playback head value.
Relying on the code version is OK w me - but I think we do need to say how interpolation at loop points works, and that requires a bit of narrative footwork inside the body of something like bufferSample()
, presumably using comments. If we don't address what "given points" we are interpolating between, this opens the door to many possible interpretations of how subsample-accurate loop points work.
I'll think about how we could do this.
In the meantime, what do other implementors in the group think of these contrasting approaches?
The more I think about this, the more I think that the narrative version does a better job of describing the trickiest part of the deal, which is the way that sub-sample start, stop and loop points behave. If we moved to a code version, we'd have to somehow put this narrative into comments in order to constrain the way that UA's implement certain operations like interpolation. And if we are using code to embody a specification, it feels inappropriate to rely on comments for a key part of that specification.
Here is my latest revision to this algorithm based on the feedback I've received. It is the same in substance but I think it's more condensed and easier to understand.
Let start
correspond to the when
parameter to start()
if supplied, else 0.
Let offset
correspond to the offset
parameter to start()
if supplied, else 0.
Let stop
correspond to the sum of the when
and duration
parameters to start()
if supplied, or the when
parameter to stop()
, otherwise Infinity.
Let buffer
be the AudioBuffer
employed by the node.
Let loop
, loopStart
and loopEnd
be the loop-related attributes of the node, with the loop body clamped to the range [0, buffer.length].
Let loopStartFrame
be ceil(loopStart * buffer.sampleRate)
(the first exact frame index within the loop body), and loopEndFrame
be ceil(loopEnd * buffer.sampleRate - 1)
(the last exact frame index within the loop body).
A playhead position for buffer
is any quantity representing an unquantized time offset in seconds, relative to the time coordinate of the first sample frame in the buffer. Playback rate and AudioContext
sample rate are not relevant: this offset is expressed purely in terms of the buffer's audio content and sample rate.
Let the function playbackSignal(position)
be the playback signal function for buffer
, which is function that maps from a playhead position to a set of output signal values, one for each output channel. This function is only specified at a set of discrete playhead positions which correspond exactly to specific sample frames in the buffer. At all other positions, its value is determined by a UA-supplied algorithm that performs interpolation based on these well-defined values.
For an unlooped buffer, the specified values of this function are as follows (note that channels are ignored for purposes of clarity):
position | signal value |
---|---|
0 | channelData[0] |
1 / sampleRate |
channelData[1] |
2 / sampleRate |
channelData[2] |
... | ... |
(length - 1) / sampleRate |
channelData[length - 1] |
For a looped buffer, this sequence is infinite:
position | signal value |
---|---|
0 | channelData[0] |
1 / sampleRate |
channelData[1] |
2 / sampleRate |
channelData[2] |
... | ... |
loopStartFrame/sampleRate |
channelData[loopStartFrame] |
... | ... |
loopEndFrame/sampleRate |
channelData[loopEndFrame] |
loopStartFrame/sampleRate + (loopEnd - loopStart) |
channelData[loopStartFrame] |
... | ... |
loopEndFrame/sampleRate + (loopEnd - loopStart) |
channelData[loopEndFrame] |
loopStartFrame/sampleRate + (2 * (loopEnd - loopStart)) |
channelData[loopStartFrame] |
... | ... |
loopEndFrame/sampleRate + (2 * (loopEnd - loopStart)) |
channelData[loopEndFrame] |
loopStartFrame/sampleRate + (3 * (loopEnd - loopStart)) |
channelData[loopStartFrame] |
... | ... |
playbackSignal()
function. The nature of this operation is up to the UA. Examples of such operations might include upsampling, downsampling, applying a subsample offset, or loop unrolling.Let bufferTime
be the playhead position within buffer
of the next output sample frame. Assign it the initial value -1 to indicate that the position has not yet been determined.
Optimize the buffer prior to rendering, if desired.
Let currentTime
be the current time of the AudioContext
.
Let dt
be 1 / (context sample rate).
Let index
be 0.
Let computedPlaybackRate
be playbackRate * pow(2, detune / 1200).
Optimize the buffer during rendering, if desired.
While index
is less than the length of the audio block to be rendered:
currentTime < start
or currentTime >= stop
, emit silence for the output frame at index
.bufferTime
< 0, set bufferTime
to offset + (currentTime - start) * computedPlaybackRate
.playbackSignal(bufferTime)
as the output frame at index
.bufferTime
by dt * computedPlaybackRate
.index
by 1.currentTime
by dt
.If currentTime
< stop
, consider playback to have ended.
On Thu, Jan 5, 2017 at 5:56 AM, Joe Berkovitz notifications@github.com wrote:
Here is my latest revision to this algorithm based on the feedback I've received. It is the same in substance but I think it's more condensed and easier to understand. Definitions
1.
Let start correspond to the when parameter to start() if supplied, else 0. 2.
Let offset correspond to the offset parameter to start() if supplied, else 0. 3.
Let stop correspond to the sum of the when and duration parameters to start() if supplied, or the when parameter to stop(), otherwise Infinity. 4.
Let buffer be the AudioBuffer employed by the node. 5.
Let loop, loopStart and loopEnd be the loop-related attributes of the node, with the loop body clamped to the range [0, buffer.length]. 6.
Let loopStartFrame be ceil(loopStart * buffer.sampleRate) (the first exact frame index within the loop body), and loopEndFrame be ceil(loopEnd
- buffer.sampleRate - 1) (the last exact frame index within the loop body).
I think loopEndFrame should be floor(loopEnd*buffer.sampleRate)
1. 2.
A playhead position for buffer is any quantity representing an unquantized time offset in seconds, relative to the time coordinate of the first sample frame in the buffer. Playback rate and AudioContext sample rate are not relevant: this offset is expressed purely in terms of the buffer's audio content and sample rate. 3.
Let the function playbackSignal(position) be the playback signal function for buffer, which is function that maps from a playhead position to a set of output signal values, one for each output channel. This function is only specified at a set of discrete playhead positions which correspond exactly to specific sample frames in the buffer. At all other positions, its value is determined by a UA-supplied algorithm that performs interpolation based on these well-defined values.
For an unlooped buffer, the specified values of this function are as follows (note that channels are ignored for purposes of clarity): position signal value 0 channelData[0] 1 / sampleRate channelData[1] 2 / sampleRate channelData[2] ... ... (length - 1) / sampleRate channelData[length - 1]
If this is going to be part of the spec, you need to say this is only true if playbackRate = 1. Also for sub-sample start, I'm not quite sure this is correct either.
For a looped buffer, this sequence is infinite: position signal value 0 channelData[0] 1 / sampleRate channelData[1] 2 / sampleRate channelData[2] ... ... loopStartFrame/sampleRate channelData[loopStartFrame] ... ... loopEndFrame/sampleRate channelData[loopEndFrame] loopStartFrame/sampleRate + (loopEnd - loopStart) channelData[loopStartFrame] ... ... loopEndFrame/sampleRate + (loopEnd - loopStart) channelData[loopEndFrame] loopStartFrame/sampleRate + (2 (loopEnd - loopStart)) channelData[loopStartFrame] ... ... loopEndFrame/sampleRate + (2 (loopEnd - loopStart)) channelData[loopEndFrame] loopStartFrame/sampleRate + (3 * (loopEnd - loopStart)) channelData[loopStartFrame] ... ...
I'm not sure this is correct for sub-sample looping. I need to think about this a bit more.
Buffer Optimization
- Let the operation optimize the buffer be any operation that alters both the buffer contents and sample rate in a way that increases the efficiency or quality of rendering, while minimizing changes to the playbackSignal() function. The nature of this operation is up to the UA. Examples of such operations might include upsampling, downsampling, applying a subsample offset, or loop unrolling.
Initialization
1.
Let bufferTime be the playhead position within buffer of the next output sample frame. Assign it the initial value -1 to indicate that the position has not yet been determined. 2.
Optimize the buffer prior to rendering, if desired.
Rendering a Block of Audio
1.
Let currentTime be the current time of the AudioContext. 2.
Let dt be 1 / (context sample rate). 3.
Let index be 0. 4.
Let computedPlaybackRate be playbackRate pow(2, detune* / 1200). 5.
Optimize the buffer during rendering, if desired.
I don't think we need to say anything about optimizing. I think this is all implied by the playbackSignal function. It can do anything it wants so long as it provides the correct value. We probably do want to give some constraints on what playbackSignal does when the buffer rate and context rate are the same and the the start time is on a sample boundary. In that case playbackSignal should produce exactly the samples in the buffer.
bufferTime not initialized.
1. 2.
While index is less than the length of the audio block to be rendered:
- If currentTime < start or currentTime >= stop, emit silence for the output frame at index.
- Else,
- If bufferTime < 0, set bufferTime to offset + (currentTime - start) * computedPlaybackRate.
- Emit the result of playbackSignal(bufferTime) as the output frame at index.
- Increase bufferTime by dt * computedPlaybackRate.
- Increase index by 1.
- Increase currentTime by dt.
If currentTime < stop, consider playback to have ended.
I think you meant currentTime >= stop.
This also seems to be missing the looping points. Assuming playbackRate
0, we can say something like
if loop == true && bufferTime > loopEnd then bufferTime = loopStart
I think by doing this, we get sub-sample looping. We don't really need loopStartFrame or loopEndFrame.
I still need to write an example of sub-sample accurate start for ABSN like I did for Oscillators. Probably need one for ConstantSource when there are automations; it doesn't matter otherwise for a constant source.
-- Ray
I think loopEndFrame should be floor(loopEnd*buffer.sampleRate)
Hmmm... I still don't think so. Recall that loopEnd is exclusive of the loop body.
Let's pretend that sampleRate is 1000 Hz, for clarity, and that we have a 10-sample buffer (of duration 0.010 seconds). Take loopStart
as 0 (although this is not relevant).
Take loopEnd
as 0.010
. The loop clearly includes the entire buffer (indices 0 through 0). floor(loopEnd*buffer.sampleRate)
will be floor(10)
or 10
, which can't be the right answer: there is no frame with index 10. ceil(loopEnd*sampleRate - 1)
will be ceil(9)
which is 9
.
Or take loopEnd
as 0.0095
. The loop still includes buffer indices 0 through 9, although its duration is less than 0.010. floor(loopEnd*buffer.sampleRate)
will be floor(9.5)
or 9
. ceil(loopEnd*sampleRate - 1)
will be ceil(8.5)
which is also 9
.
If this is going to be part of the spec, you need to say this is only true if playbackRate = 1. Also for sub-sample start, I'm not quite sure this is correct either.
I have already defined the concept of "playback position" as independent of playback rate. The accounting for computed playback rate occurs at the end of this mini-spec, and happens prior to interpolating this function.
I don't think we need to say anything about optimizing. I think this is all implied by the playbackSignal function. It can do anything it wants so long as it provides the correct value.
The optimization does have an effect on where frames fall within the buffer, so I think it's important to say that it can happen at various prescribed points, because this can result in interpolation results differing during the course of playback.
We probably do want to give some constraints on what playbackSignal does when the buffer rate and context rate are the same and the the start time is on a sample boundary. In that case playbackSignal should produce exactly the samples in the buffer.
I agree that we need to say this, although I think it already follows from the definitions given.
If currentTime < stop, consider playback to have ended.
I think you meant currentTime >= stop.
Yes I did! Thanks.
This also seems to be missing the looping points. Assuming playbackRate 0, we can say something like if loop == true && bufferTime > loopEnd then bufferTime = loopStart
I think by doing this, we get sub-sample looping. We don't really need loopStartFrame or loopEndFrame.
I disagree. This approach of "wrapping around" the buffer time that you suggested, caused all kinds of spec problems in the past. This requires a lot of gymnastics for playback rates that can go zero or negative, and recall that first iteration of the loop is preceded by the "prefix" between bufferTime == 0 and loopStart, which makes running time backwards non-trivial if you just keep wrapping bufferTime around. The current approach takes care of all that, because bufferTime always increases or decreases continuously as per the computed playback rate without jumps.
Also, we still very much need the definition of loop start/end frames in order to prescribe what data points are being interpolated by playbackSignal(). Without these definitions, it becomes very difficult to say what the buffer's contents actually mean to the interpolation algorithm.
At least, that's my current 2-cent opinion :-)
Appreciate all of your comments and especially your effort on this.
I think I've confused myself many times, so can we start over with a simple example so we can agree on what should happen.
Let's assume a context and source with a sample rate of 1 Hz to keep things simple. Let the source have a buffer that has 10 samples, with values 0, 1, 2,...,9.
Let loopStart = 3.25, loopEnd = 7.5 (arbitrarily chosen).
For this case, source.start(0) and currentTime = 0. Let out[t] be the output value at time t.
Then out[0]=0, out[1]=1,out[2]=2, and so on up to out[7] = 7. Since loopEnd is 7.5, out[8] can't be 8 because that would be past the end of the loop. To get the output for time 8, we need to go back to loopStart.
The question here is what the value of out[8] should be. Should out[8] = buffer[loopStart]? Since loopStart is not on a sampling boundary, do we interpolate (somehow) and say that out[8] = buffer[3.5] = 3.5? If so, the output would then be buffer[3.5], buffer[4.5], buffer[6.5], buffer[7.5] and loop back to 3.5 again. But what do we actually output buffer[7.5] since loopEnd = 7.5?
Or should out[8] = buffer[4] = 4? Then we continue with buffer[5], buffer[6], buffer[7] and then go back to loopStart since 8 > loopEnd.
I think if we can answer these questions we'll have an appropriate algorithm that we should be able to extend easily to arbitrary source start time and playback rate.
On Mon, Jan 9, 2017 at 9:30 AM, Joe Berkovitz notifications@github.com wrote:
I think loopEndFrame should be floor(loopEnd*buffer.sampleRate)
Hmmm... I still don't think so. Recall that loopEnd is exclusive of the loop body.
Let's pretend that sampleRate is 1000 Hz, for clarity, and that we have a 10-sample buffer (of duration 0.010 seconds). Take loopStart as 0 (although this is not relevant).
Take loopEnd as 0.010. The loop clearly includes the entire buffer (indices 0 through 0). floor(loopEndbuffer.sampleRate) will be floor(10) or 10, which can't be the right answer: there is no frame with index 10. ceil(loopEndsampleRate
- 1) will be ceil(9) which is 9.
Or take loopEnd as 0.0095. The loop still includes buffer indices 0 through 9, although its duration is less than 0.010. floor(loopEndbuffer. sampleRate) will be floor(9.5) or 9. ceil(loopEndsampleRate - 1) will be ceil(8.5) which is also 9.
If this is going to be part of the spec, you need to say this is only true if playbackRate = 1. Also for sub-sample start, I'm not quite sure this is correct either.
I have already defined the concept of "playback position" as independent of playback rate. The accounting for computed playback rate occurs at the end of this mini-spec, and happens prior to interpolating this function.
I don't think we need to say anything about optimizing. I think this is all implied by the playbackSignal function. It can do anything it wants so long as it provides the correct value.
The optimization does have an effect on where frames fall within the buffer, so I think it's important to say that it can happen at various prescribed points, because this can result in interpolation results differing during the course of playback.
We probably do want to give some constraints on what playbackSignal does when the buffer rate and context rate are the same and the the start time is on a sample boundary. In that case playbackSignal should produce exactly the samples in the buffer.
I agree that we need to say this, although I think it already follows from the definitions given.
If currentTime < stop, consider playback to have ended.
I think you meant currentTime >= stop.
Yes I did! Thanks.
This also seems to be missing the looping points. Assuming playbackRate 0, we can say something like if loop == true && bufferTime > loopEnd then bufferTime = loopStart
I think by doing this, we get sub-sample looping. We don't really need loopStartFrame or loopEndFrame.
I disagree. This approach of "wrapping around" the buffer time that you suggested, caused all kinds of spec problems in the past. This requires a lot of gymnastics for playback rates that can go zero or negative, and recall that first iteration of the loop is preceded by the "prefix" between bufferTime == 0 and loopStart, which makes running time backwards non-trivial if you just keep wrapping bufferTime around. The current approach takes care of all that, because bufferTime always increases monotonically.
Also, we still very much need the definition of loop start/end frames in order to prescribe what data points are being interpolated by playbackSignal(). Without these definitions, it becomes very difficult to say what the buffer's contents actually mean to the interpolation algorithm.
At least, that's my current 2-cent opinion :-)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/WebAudio/web-audio-api/issues/95#issuecomment-271348935, or mute the thread https://github.com/notifications/unsubscribe-auth/AAofPGyAcVMw4G_vQqSiIDvW7tGOeDDNks5rQm61gaJpZM4A_H6- .
-- Ray
Let's assume a context and source with a sample rate of 1 Hz to keep things simple. Let the source have a buffer that has 10 samples, with values 0, 1, 2,...,9.
Let loopStart = 3.25, loopEnd = 7.5 (arbitrarily chosen).
I so love examples!!! Please read on...
For this case, source.start(0) and currentTime = 0. Let out[t] be the output value at time t.
Then out[0]=0, out[1]=1,out[2]=2, and so on up to out[7] = 7. Since loopEnd is 7.5, out[8] can't be 8 because that would be past the end of the loop. To get the output for time 8, we need to go back to loopStart.
The question here is what the value of out[8] should be. Should out[8] = buffer[loopStart]? Since loopStart is not on a sampling boundary, do we interpolate (somehow) and say that out[8] = buffer[3.5] = 3.5? If so, the output would then be buffer[3.5], buffer[4.5], buffer[6.5], buffer[7.5] and loop back to 3.5 again. But what do we actually output buffer[7.5] since loopEnd = 7.5?
Or should out[8] = buffer[4] = 4? Then we continue with buffer[5], buffer[6], buffer[7] and then go back to loopStart since 8 > loopEnd.
I think if we can answer these questions we'll have an appropriate algorithm that we should be able to extend easily to arbitrary source start time and playback rate.
This is a great question and I believe that the above spec language actually does answer it. In fact this is the exact variety of question that led me to the proposed approach. Let me walk through the answer in detail.
First, let me use this example to fill out the values of the table I speced, that maps from playbackPosition to signalValue -- the second table, which handles the case of loops. I will leave out all of the sampleRate divisors since sampleRate is 1 in this world. Note that in this case the signal value is the frame index in our imaginary world (i.e. the signal value at channelData[N] is N).
Let loopStartFrame
= floor(loopStart)
= ceil(3.25)
= 4
Let loopEndFrame
= ceil(loopEnd-1)
= ceil(7.5 - 1)
= 7
Note: loopEnd - loopStart
= 7.5 - 3.25 = 4.25
position expression | actual position | signal value |
---|---|---|
0 | 0 | 0 |
1 | 1 | 1 |
2 | 2 | 2 |
3 | 3 | 3 |
loopStartFrame |
4 | 4 |
... | ... | ... |
loopEndFrame |
7 | 7 |
loopStartFrame + (loopEnd - loopStart) |
8.25 | 4 |
... | ... | ... |
loopEndFrame + (loopEnd - loopStart) |
11.25 | 7 |
loopStartFrame + (2 * (loopEnd - loopStart)) |
12.5 | 4 |
... | ... | ... |
The position value of 8.25 might seem surprising you, but here's the rationale (which is encoded in the formula loopStartFrame + loopEnd - loopStart
): the end of the loop is at pos=7.5 (0.5 after the last loop sample at pos=7), and it wraps back to the start of the loop at 3.25 (0.75 before the first loop sample at pos=4). So after [7] we have a time interval of 0.5 to the loop wraparound point, and then another time interval of 0.75 to [4], the first sample in the loop body. That takes us to pos=8.25.
So... what is the value of out[8]? (or, playbackSignal(8)
in spec language)?
Let's assume that the UA is using linear interpolation. We have two adjacent data points in the neighborhood of 8 that give the signal value for exact sample frame positions: out[7] = 7, and out[8.25] = 4. The answer is therefore 7 + ((4 - 7) * (8 - 7) / (8.25 - 7))
, which comes to 4.6. Makes sense: 8 is almost (but not quite) at the position which would yield exactly 4.
Doing the exact same interpolation at the positions of 7 and 8.25 gives the expected signal values of 7 and 4 respectively.
@rtoy By the way I noticed later that you used two different values in your example: it begins with loopStart = 3.25, but then later you use the value loopStart = 3.5.
Let's look at this other case of loopStart = 3.5, loopEnd = 7.5 because it's also instructive (and maybe it's what you meant all along). In this case, the loop spans an exact number of samples and so it yields a simpler result.
All of the formulas I gave before apply, but the results are different. The table looks like this:
Let loopStartFrame
= floor(loopStart)
= ceil(3.5)
= 4
Let loopEndFrame
= ceil(loopEnd-1)
= ceil(7.5 - 1)
= 7
Note: loopEnd - loopStart
= 7.5 - 3.5 = 4
position expression | actual position | signal value |
---|---|---|
0 | 0 | 0 |
1 | 1 | 1 |
2 | 2 | 2 |
3 | 3 | 3 |
loopStartFrame |
4 | 4 |
... | ... | ... |
loopEndFrame |
7 | 7 |
loopStartFrame + (loopEnd - loopStart) |
8 | 4 |
... | ... | ... |
loopEndFrame + (loopEnd - loopStart) |
11 | 7 |
loopStartFrame + (2 * (loopEnd - loopStart)) |
12 | 4 |
... | ... | ... |
So the value of out[8] is simply 4. No interpolation required.
In fact, this result is what you get whenever 3 < loopStart <= 4
and loopEnd = loopStart + 4
. Which makes sense: when the loop encompasses an exact number of sample frames, it doesn't matter exactly where it starts and ends so long as the same sample frames are included.
Audio-ISSUE-93 (AudioBufferSourceNodePlaybackRate): AudioBufferSourceNode.playbackRate not strictly defined [Web Audio API]
http://www.w3.org/2011/audio/track/issues/93
Raised by: Philip Jägenstedt On product: Web Audio API
While it's fairly easy to guess what it should do, playbackRate is not actually well defined. In particular, it should be clear what playbackRate 0 means and whether or not negative rates are allowed. (Given an AudioParam oscillating between -1 and 1 it would be possible to remain in PLAYING_STATE perpetually.)