WebAudio / web-audio-api

The Web Audio API v1.0, developed by the W3C Audio WG
https://webaudio.github.io/web-audio-api/
Other
1.05k stars 167 forks source link

(AudioBufferSourceNodePlaybackRate): AudioBufferSourceNode.playbackRate not strictly defined #95

Closed olivierthereaux closed 7 years ago

olivierthereaux commented 11 years ago

Originally reported on W3C Bugzilla ISSUE-17378 Tue, 05 Jun 2012 12:00:25 GMT Reported by Philip Jägenstedt Assigned to

Audio-ISSUE-93 (AudioBufferSourceNodePlaybackRate): AudioBufferSourceNode.playbackRate not strictly defined [Web Audio API]

http://www.w3.org/2011/audio/track/issues/93

Raised by: Philip Jägenstedt On product: Web Audio API

While it's fairly easy to guess what it should do, playbackRate is not actually well defined. In particular, it should be clear what playbackRate 0 means and whether or not negative rates are allowed. (Given an AudioParam oscillating between -1 and 1 it would be possible to remain in PLAYING_STATE perpetually.)

olivierthereaux commented 11 years ago

Original comment by Ehsan Akhgari [:ehsan] on W3C Bugzilla. Thu, 01 Aug 2013 17:24:34 GMT

I think it probably makes sense to specify a 0 playbackRate to produce silence, and for negative values to play the buffer backwards, that is, from duration to offset (or from loopEnd to loopStart).

Also, note that there is another way that this node can perform resampling, that is, when there is a doppler shift applied to it in the face of a PannerNode. I think it makes sense to specify what needs to happen based on the multiplication of these two ratios.

Another point which was brought up on today's call was handling of values larger than one, but I think that is probably non-controversial by specifying that the final computed sampling rate ratio should be multiplied by the sampling rate of the buffer for the AudioBufferSourceNode in order to determine the target sampling rate that the resampler should use.

olivierthereaux commented 11 years ago

Original comment by Olivier Thereaux on W3C Bugzilla. Fri, 02 Aug 2013 09:12:14 GMT

Per our meeting on 2013-08-01 (http://www.w3.org/2013/08/01-audio-minutes.html), this is a call for volunteers to suggest a patch to the web audio API spec to define the expected behaviour when setting negative values for AudioBufferSourceNode.playbackRate.

olivierthereaux commented 11 years ago

Original comment by Ehsan Akhgari [:ehsan] on W3C Bugzilla. Fri, 02 Aug 2013 14:25:44 GMT

(In reply to comment #2)

Per our meeting on 2013-08-01 (http://www.w3.org/2013/08/01-audio-minutes.html), this is a call for volunteers to suggest a patch to the web audio API spec to define the expected behaviour when setting negative values for AudioBufferSourceNode.playbackRate.

Hmm, this is what I was hoping to do in comment 1. :-) Do we absolutely need a patch here? I think it probably makes sense to have the basic discussion first and then move to the exact prose when everybody is on the same boat about what we want. Do you agree?

olivierthereaux commented 11 years ago

Original comment by Joe Berkovitz / NF on W3C Bugzilla. Fri, 02 Aug 2013 15:21:38 GMT

I was thinking about volunteering a patch but reached the same conclusion as Ehsan: we need to have a basic discussion first.

The basic outline of my proposal is different from Ehsan's but similar in spirit (I think):

(NB I would not include the effect of downstream resampling-like effects such as doppler shifts as I think this may lead to confusion over the behavior of graphs with branched routing. It seems harder for developers to predict what will happen.)

olivierthereaux commented 11 years ago

Original comment by Ehsan Akhgari [:ehsan] on W3C Bugzilla. Fri, 02 Aug 2013 19:18:05 GMT

Doesn't this assume a linear interpolating resampler? The resampler that we use in Gecko is much more complicated (and of higher quality as a result) than that! (It's the libspeex resampler.)

If we're going to make it possible for implementations to compete on the resampler quality, assuming the resampling algorithm seems like a mistake.

olivierthereaux commented 11 years ago

Original comment by Joe Berkovitz / NF on W3C Bugzilla. Fri, 02 Aug 2013 19:34:03 GMT

@Ehsan: I had no intention of assuming any particular algorithm (and tried to call this out -- sorry if it was unclear). Of course linear interpolation is not a preferred choice. I suppose that a literal interpretation of my proposal could suggest linear interpolation, but that was not the intention.

The proposal specifies a sequence of {data window, effective sampling rate} pairs with fractional sample-offset boundaries that form the input to an arbitrary interpolation algorithm. How the interpolator makes use of this sequence is not a concern. A nonlinear interpolator can work with as much of the sequence as it likes, processing arbitrarily large batches of data points at a time.

Of course in practice an implementor would probably not accumulate such a sequence and apply an interpolation algorithm to it, this is an idealized behavior for specification purposes.

If this approach turns out to be too naîve I welcome an improved recasting of it. I think the important aspect of it has to do with the way that playback progress through the buffer is affected by a time-varying playback rate, and I found an idealized cursor the easiest way to express this progress.

olivierthereaux commented 11 years ago

Original comment by Chris Wilson on W3C Bugzilla. Mon, 05 Aug 2013 01:49:06 GMT

+1 to Joe's general idea - I also do NOT agree that playbackRate < 0 should change where the cursor starts; other than that, I think we're all on the same page.

olivierthereaux commented 11 years ago

Original comment by Joe Berkovitz / NF on W3C Bugzilla. Mon, 05 Aug 2013 17:51:36 GMT

Just to amplify Chris's comment: apart from my attempt to tease out a more detailed spec of playbackRate, the main behavioral difference in my proposal from Ehsan's is that a negative playbackRate does not cause playback to start at a different point than it would have otherwise. playbackRate determines the time derivative of a "playback path" through the buffer, but not the origin of that path, which remains the buffer offset as specified in the start() call (which defaults to 0).

If we want the ability to start playing a buffer from the end, I think there's a clearer and more explicit way to do that: attach that interpretation to a negative "offset" parameter passed to AudioBufferSourceNode.start(). I don't feel strongly that we need that feature but I do think we should avoid overloading the meaning of playbackRate w/r/t start offsets.

olivierthereaux commented 11 years ago

Original comment by Ehsan Akhgari [:ehsan] on W3C Bugzilla. Thu, 08 Aug 2013 03:21:06 GMT

I think I was unclear about what I meant, sorry about that. In the first paragraph of comment 1, I meant to describe the cursor jump boundaries, not that the playback should start at `duration'. In other words, I meant to propose exactly the same thing as Joe described better in terms of the cursor concept. In light of comment 6, I believe we're mostly proposing the same thing (with my proposal intentionally not talking about the details of the resampling, and with Joe's proposal doing a much better job describing the cursor concept, etc.)

cwilso commented 9 years ago

1) I think playbackRate of 0 (presuming the buffer is playing) should result in the output of the current sample, not "silence" (which implies zero). Otherwise, ramping to zero and then back to a very small value will cause a click.

2) I think negative rates should be treated as zero. Playing backwards complicates the model.

jernoble commented 9 years ago

It came up a few times at the Web Audio Conference that page authors were going to crazy lengths to get negative playbackRate AudioBufferSourceNodes working. Whether or not "playing backwards complicates the model", this is something the API should provide.

cwilso commented 9 years ago

It could be specified, and I'm fine doing so if the vendors on the WG would sign up to implement it. If so, I'd suggest this as a rough draft of rules:

If looping=false, then playing backward ceases playback when it reaches the start point. So if playbackRate is negative, buffer.start(0,0) will immediately cease playback. If the offset is >0, I think playback ceases when the start offset is reached. (Clearly, the interesting scenarios here are when the playbackRate is set to negative after proceeding forward for some time.) Or, conversely (and maybe this is more interesting), playback proceeds backward past the starting offset to the beginning of the buffer. Note that duration will need to be slightly redefined to account for negative playbackRate. (I think it just doesn't apply when playbackRate<0.

If looping is true, then playing backward will work differently if it's in the "lead-in" portion (i.e. it's before the loopStart) or the looping portion when playbackRate is set to negative. I'd suggest if it's in the lead-in portion it proceeds backward until it hits the starting offset (or buffer begin, see above), then stops (despite it being "looping"). If it's in the looping portion, it should proceed to the loop begin, then wrap to loop end and keep going (in reverse).

Marked as "Needs WG Review" so we'll discuss.

notthetup commented 9 years ago

:+1:

jernoble commented 9 years ago

"It could be specified, and I'm fine doing so if the vendors on the WG would sign up to implement it."

See: https://bugs.webkit.org/show_bug.cgi?id=140955

notthetup commented 9 years ago

:+1: :+1:

padenot commented 9 years ago

I agree this is useful, and I agree with cwilso that we need to discuss the exact behavior.

joeberkovitz commented 9 years ago

There is also an issue with the fact that the nature of playbackRate's interpolation is not fully specified. This is especially important when working with speeds substantially slower than 1 .

padenot commented 8 years ago

Resolution: spec negative playbackRate as have being exactly mirrored behaviour from the positive playbackRate behaviour.

padenot commented 8 years ago

I'm going to do this by speccing the algorithm used to compute a block of audio with an AudioBufferSourceNode so that we can kill all ambiguities in one shot.

padenot commented 8 years ago

I've started drafting this. For now, this only handles positive sample-rate, and is not super elegant. I'd be interested in feedback.

To convert a value to sample-frame from seconds is to multiply it by the sample-rate at its nominal sample-rate, and to round it to the nearest integer.

  1. Let readIndex be the value of the offset attribute of the start method, converted to sample-frame time in the sample-rate of the AudioContext.
  2. Let startOffset be the value of the attribute when converted to sample-frame time in the sample-rate of the AudioContext.
  3. Let stopPoint be the value of the when parameter of the stop method converted to sample-frame time in the sample-rate of the AudioContext, or +Infinity if stop has not been called.
  4. Let duration be the value of the duration attribute of the AudioBuffer, converted to sample-frame time, in the sample-rate of the AudioContext.
  5. Let currentTime be the value of AudioContext.currentTime converted to sample-frame time at the sample-rate of the AudioContext.
  6. Let loopStart be the value of the loopStart attribute converted to sample-frame time at the sample-rate of the AudioContext.
  7. Let loopEnd be the value of the loopStart attribute converted to sample-frame time at the sample-rate of the AudioContext.

Writing silence to a sequence s from index start to index end means setting the elements between s[start] to s[end] to 0.0.

Rendering a block of audio for an AudioBufferSourceNode means executing the following steps:

  1. Let s be a sequence of 128 float.
  2. Let writeIndex be 0.
  3. Let input rate be the sample-rate of the AudioBuffer.
  4. Let output rate be the sample-rate of the AudioContext divided by the value of computedPlaybackRate at currentTime.
  5. Let resampling ratio be input rate divided by output rate.
  6. While writeIndex is not 128
    1. If currentTime is less than startOffset:
      1. Write silence to s from index writeIndex to index startOffset - currentTime.
      2. Increment writeIndex by startOffset - currentTime.
    2. If looping is True, let outputFrames be the minimum of count, stopPoint - currentTime and loopEnd - currentTime.
    3. Else, let outputFrames be the minimum of count, stopPoint - currentTime, and duration.
    4. Resample a portion of the input buffer from input rate to output rate to produce outputFrames frames, and copy them to s starting at writeIndex.
    5. Increment readIndex by the number of frames consumed by the resampler.
    6. If looping is True and readIndex is equal to loopEnd, set readIndex to loopStart.
    7. Increment writeIndex by the number of frames produced by the resampler.
    8. If currentTime + writeIndex is equal to stopPoint:
      1. Set an ended flag to true.
      2. Write silence to s from index writeIndex to index endOffset.
  7. If ended is True:
    1. Post a task to fire a simple event named "ended" at the AudioBufferSourceNode, and remove the self-reference.

This algorithm is intentionaly vague on the resampling process, in terms of input and output frame count, as different resampling techniques have different characteristics.

joeberkovitz commented 8 years ago

Thanks @padenot! Here are some of the content issues and questions I see. I could try to take a crack at addressing them but thought it would be better to surface these questions first.

rtoy commented 7 years ago

Also, recall https://github.com/WebAudio/web-audio-api/issues/915#issuecomment-248930220 where we decided that we should actually do sub-sample accurate sampling. Although that was for the oscillator, I think we need to do the same for a buffer source because I think the same issues will show up.

rtoy commented 7 years ago

I think the algorithm would be simpler if we just first described the case where the buffer rate matched the context rate. When the rates don't match, we can say that the same algorithm applies if the buffer behaved as if it were first resampled and used in this algorithm.

I think you also need a step 8 for the case where ended is not true so that you go back to step 2. Also probably want to reorder the initial steps so that step 2 is the step just before step 6. All of these steps are just initializations that only need to happen once.

I also don't think we need to make this process 128 frames at a time. To describe the algorithm, we don't really need the finite-sized s buffer. We can assume unlimited length and writeIndex keeps track of what we're doing, and it can increment forever.

padenot commented 7 years ago

Another iteration on this, with comments addressed and negative playbackRate support. Note the loop, depending on if we're looping and reached the end point of the loop, we'll loop multiple times, same if the loop is very small. I tried to perform sub-sample accurate start, but it's probably not correct for the loop end, I need to double check.

To convert a value to sample-frame from seconds is to multiply it by the sample-rate at its nominal sample-rate.

Sample-frame time is a fractional time value in frames, in the sample-rate of the AudioContext. It MUST be rounded to access the samples themselves. This is used to implement sub-sampling start point, stop point and loop points.

This algorithm is described for a playbackRate of 1.0, and when the sample-rate of the AudioBuffer is the same as the sample-rate of the AudioContext, i.e. when no resampling is necessary. If this is not the case, execute those steps before producing a block of audio:

  1. Let input rate be the sample-rate of the AudioBuffer.
  2. Let output rate be the sample-rate of the AudioContext divided by the value of computedPlaybackRate at currentTime.
  3. Let resampling ratio be input rate divided by output rate.
  4. If resampling ratio is negative, reverse the data of the AudioBuffer and let resampling ratio be -resampling ratio.
  5. If resampling ratio is not 1.0, resample the AudioBuffer data, and use this resampled audio data as the source data.
  6. Else, the source data is the raw data of the AudioBuffer.
  7. Let readIndex be the value of the offset attribute of the start method, converted to sample-frame time.

Writing silence to a sequence s from index start to index end means setting the elements between s[start] to s[end] to 0.0.

Rendering a block of audio for an AudioBufferSourceNode means executing the following steps:

  1. Let source be the buffer containing the source data, to be indexed by readIndex.
  2. Let startOffset be the value of the attribute when converted to sample-frame time.
  3. Let stopPoint be the value of the when parameter of the stop method converted to sample-frame time, or +Infinity if stop has not been called.
  4. Let duration be the value of the duration attribute of the AudioBuffer, converted to sample-frame time.
  5. Let currentTime be the value of AudioContext.currentTime converted to sample-frame time.
  6. Let loopStart be the value of the loopStart attribute converted to sample-frame time.
  7. Let loopEnd be the value of the loopStart attribute converted to sample-frame time.
  8. Let s be a sequence of 128 float.
  9. Let writeIndex be 0.
  10. While writeIndex is not 128
    1. If currentTime is less than startOffset:
      1. Write silence to s from index writeIndex to the minimum of startOffset - currentTime and 128.
    2. Increment writeIndex by startOffset - currentTime.
    3. Jump to the beginning of this loop.
    4. Let count be 128 - writeIndex.
    5. If loop is True, let outputFrames be the minimum of count, stopPoint - currentTime and startOffset + loopEnd - currentTime.
    6. Else, let outputFrames be the minimum of count, stopPoint - currentTime, and startOffset + duration.
    7. If readIndex is not an integer, perform sub-sample interpolation:
      1. Let left be readIndex - ceil(readIndex) and right be 1 - left.
      2. Set s[writeIndex] to source[readIndex] * left.
      3. Increment writeIndex by 1. If writeIndex is 128, jump to the beginning of this loop.
      4. Set s[writeIndex] to source[readIndex] * right.
      5. Let readIndex be ceil(readIndex). If readIndex is greater than loopEnd and loop is True, substract loopEnd - loopStart from readIndex and jump to the beginning of this loop.
    8. Copy outputFrames frames of audio from the source data starting at index readIndex to s starting at index writeIndex.
    9. Increment readIndex by outputFrames.
    10. Increment writeIndex by outputFrames.
    11. If loop is True and readIndex is greater or equal to loopEnd, substract loopEnd - loopStart to readIndex.
    12. If currentTime + writeIndex is greater or equal to stopPoint or startOffset + duration:
      1. Set an ended flag to true.
      2. Write silence to s from index writeIndex to index endOffset.
  11. If ended is True:
    1. Post a task to fire a simple event named "ended" at the AudioBufferSourceNode, and remove the self-reference.
joeberkovitz commented 7 years ago
  1. If resampling ratio is not 1.0, resample the AudioBuffer data, and use this resampled audio data as the source data.

Specify that the resampling results in new audio data whose sample rate is now output rate.

  1. Let readIndex be the value of the offset attribute of the start method, converted to sample-frame time.

The phrase "sample-frame time" is ambiguous with at least 3 different rates in play. In this case I believe offset would be specified by the caller as time units that assume the buffer's own sample rate, i.e. input rate. But in other cases, the phrase clearly refers to AudioContext rate. I think it would be clearer always to say something explicit like, "let readIndex be offset multiplied by input rate". This issue comes up a bunch of times in the algorithm.

ii. Increment writeIndex by startOffset - currentTime.

I think this needs to increment writeIndex by the lesser of the given expression, or 128 (the same amount as the silence that was written in te previous step).

vi. Else, let outputFrames be the minimum of count, stopPoint - currentTime, and startOffset + duration.

The last expression I think should be startOffset + duration - currentTime.

vii (sub sample interpolation)

I see a few different issues here although perhaps I've missed something important:

rtoy commented 7 years ago

Let me propose a somewhat different expression of the algorithm. I think it's a bit simpler, but a bit harder to visualize. It would be best if I drew some simple diagrams to show what happens at the buffer start, the loop end and the loop start points. Anyway, without further ado:

Let

t = context.currentTime dt = 1 / context.sampleRate

tb = ABSN buffer index dtb = playbackRate ts = start time for ABSN tls, tle = loop start and end time

b = ABSN buffer, resampled if necessary to the context sample rate such that b[0] still represents the very first value.

If ABSN is started with start(when), set ts = when, duration = length of buffer in seconds, and toff = 0.

If ABSN is started with start(when, offset, duration), set ts = when, toff = offset and duration = duration if given or infinity.

Let bufferSample(k) be a function that interpolates values from the ABSN buffer such that if k is an integer, b[k] is the result. If k is not an integer, compute an interpolated value using b[n] and b[n+1] where n = floor(k). Interpolation method is not specified and more samples of the buffer are allowed to be used.

Let output(x) be a function outputs the value x as the next output sample of the ABSN.

t = 0;
while (1) {
  if (t <= ts < t+dt) {
    // Buffer is starting. Compute an offset into the buffer based on
    // when the current sample is being taken.
    tb = (t + dt - ts + toff) * sampleRate

    total = 0;
    while (1) {
      if (loop == false)
        if (total > duration) {
          // We've reached the end of the buffer and/or duration, so
          // stop
          break;
        }
      } else {
        // We're looping
        if (total > duration) {
          // We've output duration seconds of audio. Time to stop.
          break;
        }
        if (tle < tb) {
          // We're trying to output the first sample PAST the end of
          // the loop.  Rewind the buffer pointer back to the loop
          // start.
          tb = ceil(tls / sampleRate) * sampleRate;
        }
      }

      output(bufferSample(tb))

      tb = tb + dtb;
      // Wrap tb if tb goes past the end of the buffer.  This wraps tb
      // back near the beginning of the buffer or near the start of
      // the loop point.
      tb = wrapPointer(tb);

      t = t + dt;
      total = total + dt;
    }
  }

  output(0);
  t = t + dt;
}
joeberkovitz commented 7 years ago

@rtoy I agree that this way of expressing the algorithm is easier to understand, although it would need some preamble explaining that it ignores rendering quanta and instead describes an idealized sequence of sample frames generated by an ABSN as if nothing else were going on.

Also, this algorithm can deal with computed playback rate more gracefully. The previous version appears to treat computedPlaybackRate as a constant that can be eliminated via resampling; this version only resamples to compensate for the AudioBuffer's own sample rate, and treates computedPlaybackRate as truly dynamic (since it is k-rate).

I did identify a couple of problems with the algorithm and also felt the control structures and variable names made it a bit more obscure, so I've taken the liberty of trying to do another iteration below -- but it's really just a restatement of @rtoy's version.

I also attempted to incorporate explicit logic for the wrapping of loops as I believe this needs to be spelled out and not left to the UA.

Here we go:

while (1) {
  if (currentTime <= start && start < currentTime + dt) {
    // Buffer is starting. Compute an offset into the buffer based on
    // when the current sample is being taken.
    sampleIndex = (currentTime + dt - start + offset) * sampleRate

    total = 0;
    while (1) {
      if (total > duration) {
        // We've reached the end of the buffer and/or duration, so
        // stop
        break;
      }

      if (loop && sampleIndex > loopEndIndex) {
        // We're trying to output the first sample PAST the end of
        // the loop.  Rewind the buffer pointer back to the loop
        // start.
        sampleIndex = loopStartIndex;

        // @rtoy: Didn't understand rationale for the following
        // sampleIndex = ceil(loopStart / sampleRate) * sampleRate;
      }

      output(bufferSample(sampleIndex))

      sampleIndex += dtb * computedPlaybackRate;  // @rtoy did not see any reference to playback rate 

      // Wrap tb if tb goes past the end of the buffer.  This wraps tb
      // back near the beginning of the buffer or near the start of
      // the loop point.
      while (loop && sampleIndex > loopEndIndex) {
        sampleIndex -= loopEndIndex - loopStartIndex;
      }

      currentTime += dt;
      total += dt;
    }
  }

  output(0);
  currentTime += dt;
}
rtoy commented 7 years ago

Thanks for cleaning up the code. I was lazy about typing things out, so this looks much better. I think this algorithm works fine if computedPlaybackRate is non-negative. If it's negative, we'll have to do something about the various tests because they'll have to change direction or maybe swap loop start and loop end. (I assume that's how loop points would work for negative playbackRate.)

You don't describe how loopStartIndex and loopEndIndex are computed from loopStart and loopEnd. I think loopStartIndex is probably ceil(loopStart / sampleRate) * sampleRate and loopEndIndex is similar but uses floor instead of ceil. But maybe loopStartIndex = loopStart * sampleRate.

If loopStart * sampleRate lies between two sample points, what should we output? I guess I was thinking we should output the buffer sample just past that. But maybe it should be the interpolated value between the sample points?

joeberkovitz commented 7 years ago

If it's negative, we'll have to do something about the various tests because they'll have to change direction or maybe swap loop start and loop end. (I assume that's how loop points would work for negative playbackRate.)

Correct, probably that while loop should work in the opposite sense if computedPlaybackRate is negative.

You don't describe how loopStartIndex and loopEndIndex are computed from loopStart and loopEnd. I think loopStartIndex is probably ceil(loopStart / sampleRate) sampleRate and loopEndIndex is similar but uses floor instead of ceil. But maybe loopStartIndex = loopStart sampleRate.

Two points:

If loopStart * sampleRate lies between two sample points, what should we output? I guess I was thinking we should output the buffer sample just past that. But maybe it should be the interpolated value between the sample points?

Since loopStart is allowed to lie between frames, on any sort of looping back to a start point we should just assign loopStartIndex (which can be fractional as noted above) to sampleIndex and not try to force to the frame to the left or to the right.

A similar argument applies to the loop endpoint in a negative playback rate situation.

rtoy commented 7 years ago

On Mon, Nov 28, 2016 at 10:53 AM, Joe Berkovitz notifications@github.com wrote:

If it's negative, we'll have to do something about the various tests because they'll have to change direction or maybe swap loop start and loop end. (I assume that's how loop points would work for negative playbackRate.)

Correct, probably that while loop should work in the opposite sense if computedPlaybackRate is negative.

You don't describe how loopStartIndex and loopEndIndex are computed from loopStart and loopEnd. I think loopStartIndex is probably ceil(loopStart / sampleRate) sampleRate and loopEndIndex is similar but uses floor instead of ceil. But maybe loopStartIndex = loopStart sampleRate.

Two points:

-

We can't use ceil() or floor() because we've agreed that loop points are not quantized. These are "indices" only in the sense that they are expressed in units of sample frames, but they cannot be quantized. So loopStart(End)Index = loopStart(End) * buffer sample rate.

​I think I've confused myself. I need to draw a picture and study it to figure out what we really want.​

-

We need to note in the algorithm that sampleRate has to be the sample rate of the buffer, not the sample rate of the context or the computed playback rate.

​This is also confusing because we have two sampleRates: one is for the audio context and one is for the buffer. I'd really prefer to state the algorithm as if the buffer was already resampled to the context rate. The implementation can then do whatever optimizations it wants, including doing the resampling during playback (via linear interpolation). Or actually doing the resampling first with a high quality resampler.

I think of resampling as really orthogonal to specifying how loops and playbackRate works. ​

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/WebAudio/web-audio-api/issues/95#issuecomment-263358902, or mute the thread https://github.com/notifications/unsubscribe-auth/AAofPPlCDOEAXqSyQ7JWgCE-1KXJi-Klks5rCyMagaJpZM4A_H6- .

joeberkovitz commented 7 years ago

I'd really prefer to state the algorithm as if the buffer was already resampled to the context rate.

Yes, that makes sense to me -- my real point was that computed playback rate (i.e. modulating the rate via the playbackRate AudioParam) should have no influence on loop point location. Loop points are a stable way of referencing the buffer content.

joeberkovitz commented 7 years ago

As discussed on yesterday's call I'm going to take a crack at refining this further.

joeberkovitz commented 7 years ago

After quite a bit of thinking and much careful checking, I offer the following algorithm which makes heavy use of all the work done so far. I believe it works fine for all these requirements:

Probably the key new feature of this approach is the approach to loops, in which a looped buffer is considered to be an infinite sequence of half-open ranges [0..loopStart), [loopStart..loopEnd), [loopStart+loopLength..loopEnd+loopLength), .... This greatly clarifies the way that interpolation behaves in the neighborhood of loop boundaries, even when these do not lie at exact sample indices; the comments explain in detail.

I've adopted the idiom of an AudioWorkletProcessor to describe the algorithm because this helped me understand it and check it. It is thus a hybrid of @padenot's original block-rendering-oriented approach and @rtoy's approach. However, we could reduce this to a narrative description if desired.

// Description of the algorithm employed by
// an AudioBufferSourceNode to render successive
// blocks of audio. The description is framed as if
// the node was implemented using an AudioWorkletProcessor.
//
// It is designed for clarity only, not efficiency.

class AudioBufferSourceProcessor extends AudioWorkletProcessor {
  static get parameterDescriptors() {
    return [{
      name: 'detune',
      defaultValue: 0
    }, {
      name: 'playbackRate',
      defaultValue: 1
    }];
  }

  // Initialize processing
  constructor(options) {
    this.readTime = 0;
    this.buffer = this.originalBuffer = options.buffer;
    this.loop = options.loop;
    this.loopStart = options.loopStart;
    this.loopEnd = options.loopEnd;
  }

  // To be called when start() is applied to the node
  function start(when, offset, duration) {
    this.start = when;
    this.offset = 0;
    this.stop = Infinity;
    this.bufferTime = -1;

    if (arguments.length > 1) {
      this.offset = offset;
    }
    if (arguments.length > 2) {
      this.stop = this.when + duration;
    }
  }

  // To be called when stop() is applied to the node
  function stop (when) {
    this.stop = when;
  }

  function process(inputs, outputs, parameters) {
    // Set up working variables
    let currentTime = contextInfo.currentTime;  // current context time of next output frame
    let dt = 1 / contextInfo.sampleRate;  // context time per sample
    let index = 0;  // index of next frame to be output
    let computedPlaybackRate = parameters.playbackRate[0] * Math.pow(2, parameters.detune[0] / 1200);

    // Optionally resample buffer to reduce need for interpolation
    optimizeBuffer(computedPlaybackRate);

    // Render next block
    while (index < outputs[0][0].length) {
      if (currentTime < this.start || currentTime >= this.stop) {
        for (let channel = 0; channel < buffer.numberOfChannels; channel++) {
          outputs[0][channel][index] = 0;
        }
      }
      else {
        // set up bufferTime the first time we have a signal to put out
        if (this.bufferTime < 0) {
          this.bufferTime = offset + (currentTime - this.start) * computedPlaybackRate;
        }
        for (let channel = 0; channel < buffer.numberOfChannels; channel++) {
          outputs[0][channel][index] = bufferSample(this.bufferTime, channel);
        }
        this.bufferTime += dt * computedPlaybackRate;  // advance read time of buffer, independent of loop points
      }

      currentTime += dt;   // advance output time
      index += 1;  // advance output index
    }

    // Consider the node to be active if we have not reached stop time yet.
    return currentTime < this.stop;
  }

  // Returns a channel's signal value at the time offset "effectiveTime" in
  // the buffer. This function takes care of enforcing silence before and after
  // the buffer contents, and also folds the body of any loop.
  function bufferSample(effectiveTime, channel) {
    // Convert time to a playhead position with fractional part
    let bufferPos = effectiveTime * this.buffer.sampleRate;

    // Now get an interpolated value.
    var leftFrame, rightFrame, interval;

    if (this.loop) {
      let loopStartPos = this.loopStart * this.buffer.sampleRate;  // playhead position for loop start
      let loopEndPos = this.loopEnd * this.buffer.sampleRate;  // playhead position for loop end
      let loopLength = loopEndPos - loopStartPos; // loop length in frame units

      // Determine the first and last exact sample frame positions that lie
      // within the body of the loop, which is the half-open range
      // [loopStartPos, loopEndPos) -- that is, loopEndPos is exclusive.
      let loopFirstFrame = Math.ceil(loopStartPos);
      let loopLastFrame = Math.ceil(loopEndPos - 1);   // N => N-1; (N + epsilon) => N

      // At any loop wrap point, the mapping between a playhead position in units of frames
      // and fractional sample indices looks like this (where N is a nonnegative loop iteration count)
      //
      //   playhead position                        sample index            interpolated?
      //
      //   loopLastFrame + N*loopLength             loopLastFrame           no
      //   loopEndPos + N*loopLength - epsilon      loopEndPos - epsilon    yes
      //   loopStartPos + (N+1)*loopLength          loopStartPos            yes
      //   loopFirstFrame + (N+1)*loopLength        loopFirstFrame          no
      //
      // In any region covered by this range of values, we are potentially interpolating
      // between a "left sequence" of frames ending in loopLastFrame, and a "right sequence"
      // of frames beginning with loopFirstFrame.

      // Fold the loop to bring the value of N to zero, by requiring that that
      // the playhead poition not exceed (loopFirstFrame + loopLength).
      while (bufferPos >= loopFirstFrame + loopLength) {
        bufferPos -= loopFrames;
      }

      if (bufferPos >= loopLastFrame) {
        // If after folding the playhead is after the last exact frame in the loop,
        // then we'll be interpolating at the wrap boundary between a sequence of frames
        // ending in loopLastFrame, and a (wrapped) sequence of frames beginning with
        // loopStartFrame.

        // The time interval between left and right frames may be less than 1
        // frame in this case, because of fractional loop points.

        leftFrame = loopLastFrame;
        rightFrame = loopFirstFrame;
        interval = (loopFirstFrame + loopFrames) - loopLastFrame;
      }
      else {
        leftFrame = Math.floor(bufferPos);
        rightFrame = leftFrame + 1;
        interval = 1;
      }
    }
    else {
      leftFrame = Math.floor(bufferPos);  // get the exact frame index before the time of interest
      rightFrame = leftFrame + 1; // and the frame after that
      interval = 1; // interval in sample frames between left and right
    }

    return interpolateValue(bufferPos, channel, leftFrame, rightFrame, interval);
  }

  // Return an interpolated value for the playhead position bufferPos,
  // working from a sequence of exact frames at:
  //   leftFrame-M ... leftFrame, rightFrame ... rightFrame+N
  // that map onto playhead positions:
  //   leftFrame-M ... leftFrame, leftFrame + interval ... leftFrame + interval + N
  // with the constraint that positions outside the buffer content may not be included.
  function interpolateValue(bufferPos, channel, leftFrame, rightFrame, interval) {
    // The UA may employ any desired algorithm.
    // It may also elect to use bufferPos as an exact index and not interpolate,
    // if the difference between bufferPos and leftFrame is sufficiently small.

    /*
      This sample implementation uses linear interpolation between two adjacent frames.

      let weight = (bufferFrame - leftFrame) / interval;
      let leftValue = leftFrame >= 0 ? this.buffer.getChannelData(channel)[leftFrame] : 0;
      var rightValue = rightFrame < buffer.length ? this.buffer.getChannelData(channel)[rightFrame] : 0;
      return leftValue + (weight * rightValue);
    */
  }

  function optimizeBuffer(playbackRate) {
    // This function may resample the buffer contents, entirely or partially,
    // as often as desired for reasons of quality or computational efficiency.
    // The results of the resampling, if carried out, are available to the
    // bufferSample() method via this.buffer.

    /*
      Example implementation that ensures buffer is at optimal rate for the first-rendered block:

      let bestSampleRate = contextInfo.sampleRate / playbackRate;
      if (!this.resampled && this.buffer.sampleRate != bestSampleRate ) {
        this.buffer = cubicResample(this.originalBuffer, bestSampleRate);
        this.resampled = true;
      }
    */
  }
}
joeberkovitz commented 7 years ago

I thought it would be useful to abstract a narrative definition that's a bit more rigorous. Here it is.

Definitions

  1. Let start correspond to the when parameter to start() if supplied, else 0.

  2. Let offset correspond to the offset parameter to start() if supplied, else 0.

  3. Let stop correspond to the sum of the when and duration parameters to start() if supplied, or the when parameter to stop(), otherwise Infinity.

  4. Let buffer be the AudioBuffer employed by the node.

  5. Let loop, loopStart and loopEnd be the loop-related attributes of the node, with the loop body clamped to the range [0, buffer.length].

  6. A playhead position for buffer is any quantity representing an unquantized time offset in seconds, relative to the time coordinate of the first sample frame in the buffer. Playback rate and AudioContext sample rate are not relevant: this offset is expressed purely in terms of the buffer's audio content.

  7. Let framePosition(index) be a many-valued function yielding a set of one or more playhead positions for the sample frame within the buffer at index. These represent the idealized times at which the frame would be rendered, given a start time of 0, a playback rate of 1 and an infinite sample rate. The function is as follows:

    1. Let bufferTime be index / buffer.sampleRate.
    2. If loop is false, the result is bufferTime.
    3. If loop is true,
      1. If bufferTime < loopStart, the result is bufferTime.
      2. If bufferTime >= loopStart and bufferTime < loopEnd, the function has multiple results given by bufferTime + (count * (loopEnd - loopStart)), where count takes on non-negative integer values.
      3. If bufferTime >= loopEnd, the position is not defined.
  8. Let frameValue(index) correspond to the vector of actual signal values in buffer at the given index, one component per channel.

  9. Let the playback frame sequence for buffer be the set of all tuples [framePosition(index), index, frameValue(index)], ordered by increasing values of framePosition(index). This sequence describes a signal whose values are known only at discrete, monotonically increasing playhead positions that correspond exactly to specific samples in the buffer.

For an unlooped buffer, this sequence is finite (frame values are omitted here, since they are not relevant):

framePosition(index) index
0 0
1 / buffer.sampleRate 1
2 / buffer.sampleRate 2
... ...
(length - 1) / buffer.sampleRate (length - 1)

For a looped buffer, this sequence is infinite. Let loopStartFrame be ceil(loopStart * buffer.sampleRate) (the first exact frame index within the loop body), and loopEndFrame be ceil(loopEnd * buffer.sampleRate - 1) (the last exact frame index within the loop body). The sequence is as follows:

framePosition(index) index
0 0
1 / buffer.sampleRate 1
2 / buffer.sampleRate 2
... ...
loopStartFrame/buffer.sampleRate loopStartFrame
... ...
loopEndFrame/buffer.sampleRate loopEndFrame
loopStartFrame/buffer.sampleRate + (loopEnd - loopStart) loopStartFrame
... ...
loopEndFrame/buffer.sampleRate + (loopEnd - loopStart) loopEndFrame
loopStartFrame/buffer.sampleRate + 2*(loopEnd - loopStart) loopStartFrame
... ...

Interpolation and Buffer Optimization

  1. Let the function interpolateFrame(pos) yield a vector which estimates the channel values at the given playhead position of pos, which need not map onto a exact index in the buffer. The interpolation method MUST obey all of the following constraints:
    1. The method relies ONLY on the relationship between playhead positions and signal values supplied by the playback frame sequence. (Typically, only a limited range of tuples in the neighborhood of pos are considered.)
    2. For pos equal to framePosition(index), the result is exactly frameValue(index).
    3. For pos < 0, the result is a zero vector (silence).
    4. For pos >= buffer.length / buffer.sampleRate where loop is false, the result is a zero vector (silence).

Note that this definition ignores loop points, since these are already embodied in the definition of the playback frame sequence.

  1. Let the operation optimize the buffer be any operation that alters both the buffer contents and sample rate, while attempting to minimize changes to the value of interpolateFrame(pos). The nature of the operation is up to the UA. Examples of such operations might include upsampling, downsampling, applying a subsample offset, or loop unrolling.

Initialization

  1. Let bufferTime be the playhead position within buffer of the next output sample frame. Assign it the initial value -1 to indicate that the position has not yet been determined.

  2. Optimize the buffer prior to rendering, if desired.

Rendering a Block of Audio

  1. Let currentTime be the current time of the AudioContext.

  2. Let dt be 1 / (context sample rate).

  3. Let index be 0.

  4. Let computedPlaybackRate be playbackRate * pow(2, detune / 1200).

  5. Optimize the buffer during rendering, if desired.

  6. While index is less than the length of the audio block to be rendered:

    1. If currentTime < start or currentTime >= stop, emit silence for the output frame at index.
    2. Else,
      1. If bufferTime < 0, set bufferTime to offset + (currentTime - start) * computedPlaybackRate.
      2. Emit the result of interpolateFrame(bufferTime) as the output frame at index.
      3. Increase bufferTime by dt * computedPlaybackRate.
    3. Increase index by 1.
    4. Increase currentTime by dt.
  7. If currentTime < stop, consider playback to have ended.

rtoy commented 7 years ago

I have not yet read through everything, but the code looks really nice and is quite clear. Now for a few comments.

I think this is wrong:

this.bufferTime = offset + (currentTime - this.start) * computedPlaybackRate;

If computedPlaybackRate were, say, a zillion, and currentTime > this.start, the bufferTime would be huge so the first output would be very far along the actual buffer. I think instead of computedPlaybackRate we want just dt.

Having bufferSample keep track of looping seems a bit complicated. I did like your original idea of the buffer index being an time value. Then you can say I want the value of the buffer at time t, and that would implicitly take into account the sample rate of the buffer and allow the implementation to do some appropriate resampling and interpolation as needed.

So, I guess what I'm saying is let the main loop maintain the bufferIndex time value, including looping and computedPlaybackRate, and let bufferSample(t) produce the appropriate buffer value at time t.

joeberkovitz commented 7 years ago

Not sure we want to define this in terms of AudioWorklet. We haven't done that anywhere else and we don't yet have any practical experience with a working AudioWorklet implementation.

I had some misgivings about that too and perhaps the narrative description would be better (did you read it yet?) Or we could just keep the JS description and move it away from AudioWorklet.

I wonder if optimizeBuffer(computedPlaybackRate) is really needed. That seems to be an implementation optimization that isn't really needed for the algorithm.

Agreed, and this is a point that is perhaps clearer in the narrative version. I do think it's important to state that optimization (resampling, etc) at the UA's discretion is allowed.

I think this is wrong: this.bufferTime = offset + (currentTime - this.start) * computedPlaybackRate; If computedPlaybackRate were, say, a zillion, and currentTime > this.start, the bufferTime would be huge so the first output would be very far along the actual buffer. I think instead of computedPlaybackRate we want just dt.

I still believe this expression is correct -- if computedPlaybackRate is a crazy-large value and currentTime > start, then bufferTime could well be past the end of the buffer. Why is that wrong? The bufferSample() function (or equivalent language in the narrative version) takes care of checking the buffer limits and enforcing silent output in this case.

Having bufferSample keep track of looping seems a bit complicated. I did like your original idea of the buffer index being an time value. Then you can say I want the value of the buffer at time t, and that would implicitly take into account the sample rate of the buffer and allow the implementation to do some appropriate resampling and interpolation as needed.

I agree - this is the approach adopted in the narrative version of the algorithm. I think it is cleaner.

Perhaps you could go over the narrative piece and see how you find it. I think it makes the approach clearer than the code and it avoids unnecessarily prescriptions about how things work.

rtoy commented 7 years ago

On Tue, Dec 6, 2016 at 10:28 AM, Joe Berkovitz notifications@github.com wrote:

Not sure we want to define this in terms of AudioWorklet. We haven't done that anywhere else and we don't yet have any practical experience with a working AudioWorklet implementation.

I had some misgivings about that too and perhaps the narrative description would be better (did you read it yet?) Or we could just keep the JS description and move it away from AudioWorklet.

​I've skimmed the narrative, but not studied it in detail.​

I wonder if optimizeBuffer(computedPlaybackRate) is really needed. That seems to be an implementation optimization that isn't really needed for the algorithm.

Agreed, and this is a point that is perhaps clearer in the narrative version. I do think it's important to state that optimization (resampling, etc) at the UA's discretion is allowed.

I think this is wrong: this.bufferTime = offset + (currentTime - this.start) * computedPlaybackRate; If computedPlaybackRate were, say, a zillion, and currentTime > this.start, the bufferTime would be huge so the first output would be very far along the actual buffer. I think instead of computedPlaybackRate we want just dt.

I still believe this expression is correct -- if computedPlaybackRate is a crazy-large value and currentTime > start, then bufferTime could well be past the end of the buffer. Why is that wrong? The bufferSample() function (or equivalent language in the narrative version) takes care of checking the buffer limits and enforcing silent output in this case.

​Let's say bufferTime ends up being near the end of the buffer (using your computation). Why should the very first output from the buffer come from the end of the buffer? I would think the very first output should be very near the beginning. The second output sample would then be near the end, which makes sense to me. ​

Having bufferSample keep track of looping seems a bit complicated. I did like your original idea of the buffer index being an time value. Then you can say I want the value of the buffer at time t, and that would implicitly take into account the sample rate of the buffer and allow the implementation to do some appropriate resampling and interpolation as needed.

I agree - this is the approach adopted in the narrative version of the algorithm. I think it is cleaner.

Perhaps you could go over the narrative piece and see how you find it. I think it makes the approach clearer than the code and it avoids unnecessarily prescriptions about how things work.

​I'm going to read over that very soon....​

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/WebAudio/web-audio-api/issues/95#issuecomment-265230892, or mute the thread https://github.com/notifications/unsubscribe-auth/AAofPMqt_AHYwcPDOz5ReGBe3IZaWgD5ks5rFak4gaJpZM4A_H6- .

-- Ray

joeberkovitz commented 7 years ago

​Let's say bufferTime ends up being near the end of the buffer (using your computation). Why should the very first output from the buffer come from the end of the buffer? I would think the very first output should be very near the beginning. The second output sample would then be near the end, which makes sense to me.

The answer has to do with sample-accurate start times as required by #915. If start occurs at an exact sample frame, then currentTime will equal start on the first-rendered frame of output. Hence (currentTime - start) * computedPlaybackRate is going to be zero, as you expect.

rtoy commented 7 years ago

On Tue, Dec 6, 2016 at 11:40 AM, Joe Berkovitz notifications@github.com wrote:

​Let's say bufferTime ends up being near the end of the buffer (using your computation). Why should the very first output from the buffer come from the end of the buffer? I would think the very first output should be very near the beginning. The second output sample would then be near the end, which makes sense to me.

The answer has to do with sample-accurate start times as required by #915 https://github.com/WebAudio/web-audio-api/issues/915. If start occurs at an exact sample frame, then currentTime will equal start on the first-rendered frame of output. Hence (currentTime - start) * computedPlaybackRate is going to be zero, as you expect.

​Sorry, I still don't understand. If the start time isn't on a sample boundary, currentTime - start is non-zero as expected. Why would I want the first output sample to come from very near the beginning of the buffer? If computedPlaybackRate were huge, the offset will also be huge and probably not produce a sample near the beginning of the buffer.​

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/WebAudio/web-audio-api/issues/95#issuecomment-265251060, or mute the thread https://github.com/notifications/unsubscribe-auth/AAofPKeWGkin7q4tJF0rllDvSH6Y_JUeks5rFbotgaJpZM4A_H6- .

-- Ray

rtoy commented 7 years ago

Some initial comments. I like this narrative, but I think the code is easier to follow.

On Sun, Dec 4, 2016 at 10:28 AM, Joe Berkovitz notifications@github.com wrote:

I thought it would be useful to abstract a narrative definition that's a bit more rigorous. Here it is. Definitions

1.

Let start correspond to the when parameter to start() if supplied, else 0.

​I think it's better to say "is equal to" instead of "correspond to", here and below.​

1. 2.

Let offset correspond to the offset parameter to start() if supplied, else 0. 3.

Let stop correspond to the sum of the when and duration parameters to start() if supplied, or the when parameter to stop(), otherwise Infinity.

​This needs to be defined a bit better, but isn't really key to the overall algorithm. It's not clear which takes precedence (sum or stop), if there is any.​

1. 2.

Let buffer be the AudioBuffer employed by the node. 3.

Let loop, loopStart and loopEnd be the loop-related attributes of the node, with the loop body clamped to the range [0, buffer.length].

​As mentioned about the code, I really like your idea of just forgetting about buffer indices and treat everything in terms of the buffer time. I think the buffer time (playback head) is well-defined and is independent of the sample rates.

1. 2.

Define the term playhead position for buffer as an unquantized time offset in seconds, relative to the time coordinate of the first sample frame in the buffer. Playback rate and AudioContext sample rate are not relevant: this offset is expressed purely in terms of the buffer's contents and its own sample rate. 3.

Let framePosition(index) be a many-valued function yielding a set of one or more playhead positions for the sample frame within the buffer at index. These represent the idealized times at which the frame would be rendered, given a start time of 0, a playback rate of 1 and an infinite context sample rate. The function is as follows:

  1. Let bufferTime be index / buffer.sampleRate.

    1. If loop is false, the result is bufferTime.
    2. If loop is true,
      1. If bufferTime < loopStart, the result is bufferTime.
      2. If bufferTime >= loopStart and bufferTime < loopEnd, the sample index maps onto multiple results given by bufferTime + (count * (loopEnd - loopStart)), where count is any non-negative loop iteration count.
      3. If bufferTime >= loopEnd, the position is not defined.

    Let frameValue(index) correspond to the vector of actual signal values in buffer at the given index, one component per channel.

  2. Let the playback frame sequence for buffer be the set of all tuples [ framePosition(index), index, frameValue(index)], ordered by increasing values of framePosition(index).

For an unlooped buffer, this sequence is finite (frame values are omitted here, since they are not relevant): framePosition(index) index 0 0 1 / buffer.sampleRate 1 2 / buffer.sampleRate 2 ... ... (length - 1) / buffer.sampleRate (length - 1)

For a looped buffer, this sequence is infinite. Let loopStartFrame be ceil(loopStart

  • buffer.sampleRate) (the first exact frame index within the loop body), and loopEndFrame be ceil(loopEnd buffer.sampleRate - 1) (the last exact frame index within the loop body). The sequence is as follows: framePosition(index) index 0 0 1 / buffer.sampleRate 1 2 / buffer.sampleRate 2 ... ... loopStartFrame/buffer.sampleRate loopStartFrame ... ... loopEndFrame/buffer.sampleRate loopEndFrame loopStartFrame/buffer.sampleRate + (loopEnd - loopStart) loopStartFrame ... ... loopEndFrame/buffer.sampleRate + (loopEnd - loopStart) loopEndFrame loopStartFrame/buffer.sampleRate + 2(loopEnd - loopStart) loopStartFrame ... ... Interpolation and Buffer Optimization

    1.

    Let the function interpolateFrame(pos) yield a vector which estimates the channel values at the given playhead position of pos, which need not map onto a exact frame position in the playback sequence. The interpolation method MUST obey all of the following constraints:

    1. It relies ONLY on the relationship between playhead positions and channel values supplied by the playback frame sequence.
      1. For pos equal to framePosition(index), the result is exactly frameValue(index).
      2. For pos < 0, the result is a zero vector (silence).
      3. For pos >= buffer.length / buffer.sampleRate where loop is false, the result is a zero vector (silence).

    Let the operation optimize the buffer be any operation that alters both the buffer contents and sample rate, while attempting to minimize changes to the value of interpolateFrame(pos). The nature of the operation is up to the UA. Examples of such operations might include upsampling, downsampling, applying a subsample offset, or loop unrolling.

Initialization

1.

Let bufferTime be the playhead position within buffer of the next output sample frame. Assign it the initial value -1 to indicate that the position has not yet been determined. 2.

Optimize the buffer prior to rendering, if desired.

Rendering a Block of Audio

1.

Let currentTime be the current time of the AudioContext. 2.

Let dt be 1 / (context sample rate). 3.

Let index be 0. 4.

Let computedPlaybackRate be playbackRate pow(2, detune* / 1200). 5.

Optimize the buffer during rendering, if desired. 6.

While index is less than the length of the audio block to be rendered:

  1. If currentTime < start or currentTime >= stop, emit silence for the output frame at index.
    1. Else,
      1. If bufferTime < 0, set bufferTime to offset + (currentTime - start) * computedPlaybackRate.
      2. Emit the result of interpolateFrame(bufferTime) as the output frame at index.
      3. Increase bufferTime by dt * computedPlaybackRate.
    2. Increase index by 1.
    3. Increase currentTime by dt.
  2. If currentTime < stop, consider playback to have ended.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/WebAudio/web-audio-api/issues/95#issuecomment-264721064, or mute the thread https://github.com/notifications/unsubscribe-auth/AAofPD0aBqqbWxXx0pfBr33_jjgAbtI4ks5rEwZNgaJpZM4A_H6- .

-- Ray

joeberkovitz commented 7 years ago

If the start time isn't on a sample boundary, currentTime - start is non-zero as expected. Why would I want the first output sample to come from very near the beginning of the buffer? If computedPlaybackRate were huge, the offset will also be huge and probably not produce a sample near the beginning of the buffer.​

Note that if the playback rate is 1, then this means the first output sample is interpolated, to estimate the buffer value at delta * computedPlaybackRate which of course in this case is delta. That's your sample-accurate start behavior right there. If start = 0.99 and currentTime = 1, we want to estimate the buffer value at a time offset of 0.01.

Now assume that the playback rate is greater than 1 (and maybe a huge number). Given that at the first sample, currentTime is some quantity delta from the current buffer, don't we have to respect that delta and multiply it by the playback rate? If start = 0.99 and currentTime = 1 and computedPlaybackRate = 2, we need to estimate the buffer value at an offset of 0.02.

rtoy commented 7 years ago

Sorry for being so dense. You are, of course, exactly right.

So, back to the original discussion. I like the code version quite a bit, but I'm not sure I want it described in terms of an AudioWorklet. The text version is also quite nice. I think either would work.

I do think the main algorithm should keep track of the buffer playback head (as a float time value) for looping, and let a bufferSample method (mostly unspecified except for what it produces) produce the correct sample based on the playback head value.

joeberkovitz commented 7 years ago

Relying on the code version is OK w me - but I think we do need to say how interpolation at loop points works, and that requires a bit of narrative footwork inside the body of something like bufferSample(), presumably using comments. If we don't address what "given points" we are interpolating between, this opens the door to many possible interpretations of how subsample-accurate loop points work.

I'll think about how we could do this.

In the meantime, what do other implementors in the group think of these contrasting approaches?

joeberkovitz commented 7 years ago

The more I think about this, the more I think that the narrative version does a better job of describing the trickiest part of the deal, which is the way that sub-sample start, stop and loop points behave. If we moved to a code version, we'd have to somehow put this narrative into comments in order to constrain the way that UA's implement certain operations like interpolation. And if we are using code to embody a specification, it feels inappropriate to rely on comments for a key part of that specification.

joeberkovitz commented 7 years ago

Here is my latest revision to this algorithm based on the feedback I've received. It is the same in substance but I think it's more condensed and easier to understand.

Definitions

  1. Let start correspond to the when parameter to start() if supplied, else 0.

  2. Let offset correspond to the offset parameter to start() if supplied, else 0.

  3. Let stop correspond to the sum of the when and duration parameters to start() if supplied, or the when parameter to stop(), otherwise Infinity.

  4. Let buffer be the AudioBuffer employed by the node.

  5. Let loop, loopStart and loopEnd be the loop-related attributes of the node, with the loop body clamped to the range [0, buffer.length].

  6. Let loopStartFrame be ceil(loopStart * buffer.sampleRate) (the first exact frame index within the loop body), and loopEndFrame be ceil(loopEnd * buffer.sampleRate - 1) (the last exact frame index within the loop body).

  7. A playhead position for buffer is any quantity representing an unquantized time offset in seconds, relative to the time coordinate of the first sample frame in the buffer. Playback rate and AudioContext sample rate are not relevant: this offset is expressed purely in terms of the buffer's audio content and sample rate.

  8. Let the function playbackSignal(position) be the playback signal function for buffer, which is function that maps from a playhead position to a set of output signal values, one for each output channel. This function is only specified at a set of discrete playhead positions which correspond exactly to specific sample frames in the buffer. At all other positions, its value is determined by a UA-supplied algorithm that performs interpolation based on these well-defined values.

For an unlooped buffer, the specified values of this function are as follows (note that channels are ignored for purposes of clarity):

position signal value
0 channelData[0]
1 / sampleRate channelData[1]
2 / sampleRate channelData[2]
... ...
(length - 1) / sampleRate channelData[length - 1]

For a looped buffer, this sequence is infinite:

position signal value
0 channelData[0]
1 / sampleRate channelData[1]
2 / sampleRate channelData[2]
... ...
loopStartFrame/sampleRate channelData[loopStartFrame]
... ...
loopEndFrame/sampleRate channelData[loopEndFrame]
loopStartFrame/sampleRate + (loopEnd - loopStart) channelData[loopStartFrame]
... ...
loopEndFrame/sampleRate + (loopEnd - loopStart) channelData[loopEndFrame]
loopStartFrame/sampleRate + (2 * (loopEnd - loopStart)) channelData[loopStartFrame]
... ...
loopEndFrame/sampleRate + (2 * (loopEnd - loopStart)) channelData[loopEndFrame]
loopStartFrame/sampleRate + (3 * (loopEnd - loopStart)) channelData[loopStartFrame]
... ...

Buffer Optimization

  1. Let the operation optimize the buffer be any operation that alters both the buffer contents and sample rate in a way that increases the efficiency or quality of rendering, while minimizing changes to the playbackSignal() function. The nature of this operation is up to the UA. Examples of such operations might include upsampling, downsampling, applying a subsample offset, or loop unrolling.

Initialization

  1. Let bufferTime be the playhead position within buffer of the next output sample frame. Assign it the initial value -1 to indicate that the position has not yet been determined.

  2. Optimize the buffer prior to rendering, if desired.

Rendering a Block of Audio

  1. Let currentTime be the current time of the AudioContext.

  2. Let dt be 1 / (context sample rate).

  3. Let index be 0.

  4. Let computedPlaybackRate be playbackRate * pow(2, detune / 1200).

  5. Optimize the buffer during rendering, if desired.

  6. While index is less than the length of the audio block to be rendered:

    1. If currentTime < start or currentTime >= stop, emit silence for the output frame at index.
    2. Else,
      1. If bufferTime < 0, set bufferTime to offset + (currentTime - start) * computedPlaybackRate.
      2. Emit the result of playbackSignal(bufferTime) as the output frame at index.
      3. Increase bufferTime by dt * computedPlaybackRate.
    3. Increase index by 1.
    4. Increase currentTime by dt.
  7. If currentTime < stop, consider playback to have ended.

rtoy commented 7 years ago

On Thu, Jan 5, 2017 at 5:56 AM, Joe Berkovitz notifications@github.com wrote:

Here is my latest revision to this algorithm based on the feedback I've received. It is the same in substance but I think it's more condensed and easier to understand. Definitions

1.

Let start correspond to the when parameter to start() if supplied, else 0. 2.

Let offset correspond to the offset parameter to start() if supplied, else 0. 3.

Let stop correspond to the sum of the when and duration parameters to start() if supplied, or the when parameter to stop(), otherwise Infinity. 4.

Let buffer be the AudioBuffer employed by the node. 5.

Let loop, loopStart and loopEnd be the loop-related attributes of the node, with the loop body clamped to the range [0, buffer.length]. 6.

Let loopStartFrame be ceil(loopStart * buffer.sampleRate) (the first exact frame index within the loop body), and loopEndFrame be ceil(loopEnd

  • buffer.sampleRate - 1) (the last exact frame index within the loop body).

​I think loopEndFrame should be floor(loopEnd*buffer.sampleRate)​

1. 2.

A playhead position for buffer is any quantity representing an unquantized time offset in seconds, relative to the time coordinate of the first sample frame in the buffer. Playback rate and AudioContext sample rate are not relevant: this offset is expressed purely in terms of the buffer's audio content and sample rate. 3.

Let the function playbackSignal(position) be the playback signal function for buffer, which is function that maps from a playhead position to a set of output signal values, one for each output channel. This function is only specified at a set of discrete playhead positions which correspond exactly to specific sample frames in the buffer. At all other positions, its value is determined by a UA-supplied algorithm that performs interpolation based on these well-defined values.

For an unlooped buffer, the specified values of this function are as follows (note that channels are ignored for purposes of clarity): position signal value 0 channelData[0] 1 / sampleRate channelData[1] 2 / sampleRate channelData[2] ... ... (length - 1) / sampleRate channelData[length - 1]

​If this is going to be part of the spec, you need to say this is only true if playbackRate = 1.​ Also for sub-sample start, I'm not quite sure this is correct either.

For a looped buffer, this sequence is infinite: position signal value 0 channelData[0] 1 / sampleRate channelData[1] 2 / sampleRate channelData[2] ... ... loopStartFrame/sampleRate channelData[loopStartFrame] ... ... loopEndFrame/sampleRate channelData[loopEndFrame] loopStartFrame/sampleRate + (loopEnd - loopStart) channelData[loopStartFrame] ... ... loopEndFrame/sampleRate + (loopEnd - loopStart) channelData[loopEndFrame] loopStartFrame/sampleRate + (2 (loopEnd - loopStart)) channelData[loopStartFrame] ... ... loopEndFrame/sampleRate + (2 (loopEnd - loopStart)) channelData[loopEndFrame] loopStartFrame/sampleRate + (3 * (loopEnd - loopStart)) channelData[loopStartFrame] ... ...

​I'm not sure this is correct for sub-sample looping. I need to think about this a bit more.

Buffer Optimization

  1. Let the operation optimize the buffer be any operation that alters both the buffer contents and sample rate in a way that increases the efficiency or quality of rendering, while minimizing changes to the playbackSignal() function. The nature of this operation is up to the UA. Examples of such operations might include upsampling, downsampling, applying a subsample offset, or loop unrolling.

Initialization

1.

Let bufferTime be the playhead position within buffer of the next output sample frame. Assign it the initial value -1 to indicate that the position has not yet been determined. 2.

Optimize the buffer prior to rendering, if desired.

Rendering a Block of Audio

1.

Let currentTime be the current time of the AudioContext. 2.

Let dt be 1 / (context sample rate). 3.

Let index be 0. 4.

Let computedPlaybackRate be playbackRate pow(2, detune* / 1200). 5.

Optimize the buffer during rendering, if desired.

​I don't think we need to say anything about optimizing. I think this is all implied by the playbackSignal function. It can do anything it wants so long as it provides the correct value. We probably do want to give some constraints on what playbackSignal does when the buffer rate and context rate are the same and the the start time is on a sample boundary. In that case playbackSignal should produce exactly the samples in the buffer. ​

​bufferTime not initialized.​

1. 2.

While index is less than the length of the audio block to be rendered:

  1. If currentTime < start or currentTime >= stop, emit silence for the output frame at index.
    1. Else,
      1. If bufferTime < 0, set bufferTime to offset + (currentTime - start) * computedPlaybackRate.
      2. Emit the result of playbackSignal(bufferTime) as the output frame at index.
      3. Increase bufferTime by dt * computedPlaybackRate.
    2. Increase index by 1.
    3. Increase currentTime by dt.
  2. If currentTime < stop, consider playback to have ended.

​I think you meant currentTime >= stop.​

​This also seems to be missing the looping points.​ Assuming playbackRate

0, we can say something like

if loop == true && bufferTime > loopEnd then bufferTime = loopStart

I think by doing this, we get sub-sample looping. We don't really need loopStartFrame or loopEndFrame.

I still need to write an example of sub-sample accurate start for ABSN like I did for Oscillators. Probably need one for ConstantSource when there are automations; it doesn't matter otherwise for a constant source.

-- Ray

joeberkovitz commented 7 years ago

​I think loopEndFrame should be floor(loopEnd*buffer.sampleRate)​

Hmmm... I still don't think so. Recall that loopEnd is exclusive of the loop body.

Let's pretend that sampleRate is 1000 Hz, for clarity, and that we have a 10-sample buffer (of duration 0.010 seconds). Take loopStart as 0 (although this is not relevant).

Take loopEnd as 0.010. The loop clearly includes the entire buffer (indices 0 through 0). floor(loopEnd*buffer.sampleRate)​ will be floor(10) or 10, which can't be the right answer: there is no frame with index 10. ceil(loopEnd*sampleRate - 1) will be ceil(9) which is 9.

Or take loopEnd as 0.0095. The loop still includes buffer indices 0 through 9, although its duration is less than 0.010. floor(loopEnd*buffer.sampleRate)​ will be floor(9.5) or 9. ceil(loopEnd*sampleRate - 1) will be ceil(8.5) which is also 9.

​If this is going to be part of the spec, you need to say this is only true if playbackRate = 1.​ Also for sub-sample start, I'm not quite sure this is correct either.

I have already defined the concept of "playback position" as independent of playback rate. The accounting for computed playback rate occurs at the end of this mini-spec, and happens prior to interpolating this function.

​I don't think we need to say anything about optimizing. I think this is all implied by the playbackSignal function. It can do anything it wants so long as it provides the correct value.

The optimization does have an effect on where frames fall within the buffer, so I think it's important to say that it can happen at various prescribed points, because this can result in interpolation results differing during the course of playback.

We probably do want to give some constraints on what playbackSignal does when the buffer rate and context rate are the same and the the start time is on a sample boundary. In that case playbackSignal should produce exactly the samples in the buffer.

I agree that we need to say this, although I think it already follows from the definitions given.

If currentTime < stop, consider playback to have ended.

​I think you meant currentTime >= stop.​

Yes I did! Thanks.

​This also seems to be missing the looping points.​ Assuming playbackRate 0, we can say something like if loop == true && bufferTime > loopEnd then bufferTime = loopStart

I think by doing this, we get sub-sample looping. We don't really need loopStartFrame or loopEndFrame.

I disagree. This approach of "wrapping around" the buffer time that you suggested, caused all kinds of spec problems in the past. This requires a lot of gymnastics for playback rates that can go zero or negative, and recall that first iteration of the loop is preceded by the "prefix" between bufferTime == 0 and loopStart, which makes running time backwards non-trivial if you just keep wrapping bufferTime around. The current approach takes care of all that, because bufferTime always increases or decreases continuously as per the computed playback rate without jumps.

Also, we still very much need the definition of loop start/end frames in order to prescribe what data points are being interpolated by playbackSignal(). Without these definitions, it becomes very difficult to say what the buffer's contents actually mean to the interpolation algorithm.

At least, that's my current 2-cent opinion :-)

rtoy commented 7 years ago

Appreciate all of your comments and especially your effort on this.

I think I've confused myself many times, so can we start over with a simple example so we can agree on what should happen.

Let's assume a context and source with a sample rate of 1 Hz to keep things simple. Let the source have a buffer that has 10 samples, with values 0, 1, 2,...,9.

Let loopStart = 3.25, loopEnd = 7.5 (arbitrarily chosen).

​For this case, source.start(0) and currentTime = 0. Let out[t] be the output value at time t.

Then out[0]=0, out[1]=1,out[2]=2, and so on up to out[7] = 7. Since loopEnd is 7.5, out[8] can't be 8 because that would be past the end of the loop. To get the output for time 8, we need to go back to loopStart.

The question here is what the value of out[8] should be. Should out[8] = buffer[loopStart]? Since loopStart is not on a sampling boundary, do we interpolate (somehow) and say that out[8] = buffer[3.5] = 3.5? If so, the output would then be buffer[3.5], buffer[4.5], buffer[6.5], buffer[7.5] and loop back to 3.5 again. But what do we actually output buffer[7.5] since loopEnd = 7.5?

Or should out[8] = buffer[4] = 4? Then we continue with buffer[5], buffer[6], buffer[7] and then go back to loopStart since 8 > loopEnd.

I think if we can answer these questions we'll have an appropriate algorithm that we should be able to extend easily to arbitrary source start time and playback rate. ​

On Mon, Jan 9, 2017 at 9:30 AM, Joe Berkovitz notifications@github.com wrote:

​I think loopEndFrame should be floor(loopEnd*buffer.sampleRate)​

Hmmm... I still don't think so. Recall that loopEnd is exclusive of the loop body.

Let's pretend that sampleRate is 1000 Hz, for clarity, and that we have a 10-sample buffer (of duration 0.010 seconds). Take loopStart as 0 (although this is not relevant).

Take loopEnd as 0.010. The loop clearly includes the entire buffer (indices 0 through 0). floor(loopEndbuffer.sampleRate)​ will be floor(10) or 10, which can't be the right answer: there is no frame with index 10. ceil(loopEndsampleRate

  • 1) will be ceil(9) which is 9.

Or take loopEnd as 0.0095. The loop still includes buffer indices 0 through 9, although its duration is less than 0.010. floor(loopEndbuffer. sampleRate)​ will be floor(9.5) or 9. ceil(loopEndsampleRate - 1) will be ceil(8.5) which is also 9.

​If this is going to be part of the spec, you need to say this is only true if playbackRate = 1.​ Also for sub-sample start, I'm not quite sure this is correct either.

I have already defined the concept of "playback position" as independent of playback rate. The accounting for computed playback rate occurs at the end of this mini-spec, and happens prior to interpolating this function.

​I don't think we need to say anything about optimizing. I think this is all implied by the playbackSignal function. It can do anything it wants so long as it provides the correct value.

The optimization does have an effect on where frames fall within the buffer, so I think it's important to say that it can happen at various prescribed points, because this can result in interpolation results differing during the course of playback.

We probably do want to give some constraints on what playbackSignal does when the buffer rate and context rate are the same and the the start time is on a sample boundary. In that case playbackSignal should produce exactly the samples in the buffer.

I agree that we need to say this, although I think it already follows from the definitions given.

If currentTime < stop, consider playback to have ended.

​I think you meant currentTime >= stop.​

Yes I did! Thanks.

​This also seems to be missing the looping points.​ Assuming playbackRate 0, we can say something like if loop == true && bufferTime > loopEnd then bufferTime = loopStart

I think by doing this, we get sub-sample looping. We don't really need loopStartFrame or loopEndFrame.

I disagree. This approach of "wrapping around" the buffer time that you suggested, caused all kinds of spec problems in the past. This requires a lot of gymnastics for playback rates that can go zero or negative, and recall that first iteration of the loop is preceded by the "prefix" between bufferTime == 0 and loopStart, which makes running time backwards non-trivial if you just keep wrapping bufferTime around. The current approach takes care of all that, because bufferTime always increases monotonically.

Also, we still very much need the definition of loop start/end frames in order to prescribe what data points are being interpolated by playbackSignal(). Without these definitions, it becomes very difficult to say what the buffer's contents actually mean to the interpolation algorithm.

At least, that's my current 2-cent opinion :-)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/WebAudio/web-audio-api/issues/95#issuecomment-271348935, or mute the thread https://github.com/notifications/unsubscribe-auth/AAofPGyAcVMw4G_vQqSiIDvW7tGOeDDNks5rQm61gaJpZM4A_H6- .

-- Ray

joeberkovitz commented 7 years ago

Let's assume a context and source with a sample rate of 1 Hz to keep things simple. Let the source have a buffer that has 10 samples, with values 0, 1, 2,...,9.

Let loopStart = 3.25, loopEnd = 7.5 (arbitrarily chosen).

I so love examples!!! Please read on...

​For this case, source.start(0) and currentTime = 0. Let out[t] be the output value at time t.

Then out[0]=0, out[1]=1,out[2]=2, and so on up to out[7] = 7. Since loopEnd is 7.5, out[8] can't be 8 because that would be past the end of the loop. To get the output for time 8, we need to go back to loopStart.

The question here is what the value of out[8] should be. Should out[8] = buffer[loopStart]? Since loopStart is not on a sampling boundary, do we interpolate (somehow) and say that out[8] = buffer[3.5] = 3.5? If so, the output would then be buffer[3.5], buffer[4.5], buffer[6.5], buffer[7.5] and loop back to 3.5 again. But what do we actually output buffer[7.5] since loopEnd = 7.5?

Or should out[8] = buffer[4] = 4? Then we continue with buffer[5], buffer[6], buffer[7] and then go back to loopStart since 8 > loopEnd.

I think if we can answer these questions we'll have an appropriate algorithm that we should be able to extend easily to arbitrary source start time and playback rate.

This is a great question and I believe that the above spec language actually does answer it. In fact this is the exact variety of question that led me to the proposed approach. Let me walk through the answer in detail.

First, let me use this example to fill out the values of the table I speced, that maps from playbackPosition to signalValue -- the second table, which handles the case of loops. I will leave out all of the sampleRate divisors since sampleRate is 1 in this world. Note that in this case the signal value is the frame index in our imaginary world (i.e. the signal value at channelData[N] is N).

Let loopStartFrame = floor(loopStart) = ceil(3.25) = 4 Let loopEndFrame = ceil(loopEnd-1) = ceil(7.5 - 1) = 7 Note: loopEnd - loopStart = 7.5 - 3.25 = 4.25

position expression actual position signal value
0 0 0
1 1 1
2 2 2
3 3 3
loopStartFrame 4 4
... ... ...
loopEndFrame 7 7
loopStartFrame + (loopEnd - loopStart) 8.25 4
... ... ...
loopEndFrame + (loopEnd - loopStart) 11.25 7
loopStartFrame + (2 * (loopEnd - loopStart)) 12.5 4
... ... ...

The position value of 8.25 might seem surprising you, but here's the rationale (which is encoded in the formula loopStartFrame + loopEnd - loopStart): the end of the loop is at pos=7.5 (0.5 after the last loop sample at pos=7), and it wraps back to the start of the loop at 3.25 (0.75 before the first loop sample at pos=4). So after [7] we have a time interval of 0.5 to the loop wraparound point, and then another time interval of 0.75 to [4], the first sample in the loop body. That takes us to pos=8.25.

So... what is the value of out[8]? (or, playbackSignal(8) in spec language)?

Let's assume that the UA is using linear interpolation. We have two adjacent data points in the neighborhood of 8 that give the signal value for exact sample frame positions: out[7] = 7, and out[8.25] = 4. The answer is therefore 7 + ((4 - 7) * (8 - 7) / (8.25 - 7)), which comes to 4.6. Makes sense: 8 is almost (but not quite) at the position which would yield exactly 4.

Doing the exact same interpolation at the positions of 7 and 8.25 gives the expected signal values of 7 and 4 respectively.

joeberkovitz commented 7 years ago

@rtoy By the way I noticed later that you used two different values in your example: it begins with loopStart = 3.25, but then later you use the value loopStart = 3.5.

Let's look at this other case of loopStart = 3.5, loopEnd = 7.5 because it's also instructive (and maybe it's what you meant all along). In this case, the loop spans an exact number of samples and so it yields a simpler result.

All of the formulas I gave before apply, but the results are different. The table looks like this:

Let loopStartFrame = floor(loopStart) = ceil(3.5) = 4 Let loopEndFrame = ceil(loopEnd-1) = ceil(7.5 - 1) = 7 Note: loopEnd - loopStart = 7.5 - 3.5 = 4

position expression actual position signal value
0 0 0
1 1 1
2 2 2
3 3 3
loopStartFrame 4 4
... ... ...
loopEndFrame 7 7
loopStartFrame + (loopEnd - loopStart) 8 4
... ... ...
loopEndFrame + (loopEnd - loopStart) 11 7
loopStartFrame + (2 * (loopEnd - loopStart)) 12 4
... ... ...

So the value of out[8] is simply 4. No interpolation required.

In fact, this result is what you get whenever 3 < loopStart <= 4 and loopEnd = loopStart + 4. Which makes sense: when the loop encompasses an exact number of sample frames, it doesn't matter exactly where it starts and ends so long as the same sample frames are included.