Handling of real-time, low latency signals without discontinuities

zmerp commented 1 year ago

My project currently uses ad-hoc code to gracefully handle playback of an inconsistent stream of frames. It can handle buffer overflows, underflows and discontinuities (caused by packet drops) https://github.com/alvr-org/ALVR/blob/master/alvr/audio/src/lib.rs Now I would like to add support for mixing different audio sources. Oddio seems suited for this task, and at the same time I would use it to refactor and simplify the existing code. The problem is that the Stream signal API seems lacking for my use-case. It seems it cannot gracefully handle interruptions caused by buffer underflows, instead it would stop abruptly causing a “pop”, and there is no way of resuming the stream or detecting when the stream buffer has emptied.

The most sensible solution for me (idea n.1) is to make Stream never return true for is_finished(). Integrate support for a ramp down when the buffer has fewer than N frames, then resume with a rump up when the buffer has been filled enough (halfway?). Optionally support interrupting a ramp down and resume it with a ramp up if frames become available soon enough. Other types of discontinuities such as buffer overflow can be detected by the current API and can be handled with the help of a Fader.

Another option (idea n.2) is to add support for polling when a Stream is going to run out of frames and let the user do a cross-fade with a 0 signal. But actually it would be better if this is handled with a callback. This might be more complex to implement and might not fit right with the current API.

Which option is better? I would be available to make a PR.

On another note, I think the method of resampling used inside Stream might distort the signal, especially high frequencies.

Ralith commented 1 year ago

Interesting problem, and one I expect to face eventually as well. My instinct is that this would be best solved by a separate, specialized Signal which automatically ramps volume to 0 at the end of the buffer if the buffer size drops below a critical threshold, and ramps back up if the buffer is refilled above the threshold. That should be simple, robust, and reusable enough that I'd be happy to maintain it as part of oddio proper. Perhaps this could be extended to handle discontinuities by allowing for explicit insertion of gaps.

Handling overflow is an interesting problem. A few approaches come to mind:

Vary playback rate continuously (within bounds, say ±10%) to try to maintain a target buffer size.
Skip a chunk of the buffer by crossfading when it fills past a threshold
Leave it to the higher level to create a new stream and crossfade to it (as you described)

Option 1 would give the best behavior when your sender is running at a slightly different sample rate, which is always true in principle, but I'm not sure the size of difference you'll tend to see in practice is large enough to justify the complexity. No other solution gracefully handles a sender that's producing samples slightly slower than expected. Resampling does impact quality some, but not much; oddio's 3D spatialization uses linear resampling pervasively to implement propagation delay and it hasn't caused any issues. I've considered offering pluggable resamplers of various quality vs. cost tradeoffs, but there hasn't been a practical need.

Options 2 and 3 could look identical to the user if 3 was implemented as an opaque newtype around a Fade<Stream> or similar. Doing it in-place inside a single buffer would be cute but probably not worth the effort.

We should not involve non-realtime threads in the loop for underruns as in your option 2, as there's no guarantee they'll get around to taking action before the underrun occurs.

zmerp commented 1 year ago

For buffer underruns there is also the option of repeating a chunk of frames (with a crossfade). It might be better than using a fade to 0 for some applications. Even better a dynamic approach could be used, repeat the last chunk for a small number of times and then do a fade to 0 in case the frame source is still stuck, to make it less annoying. There are many knobs to adjust and the more modular approach the better. There is the possibility of making this completely modular without modifying or creating any primitive. Each time a new batch of frames is available do in order: keep the old Stream, write as many samples minus the samples needed for a fade, push a new Cycle that contains a crossfade baked in (which is repeated N times), then use a Fader to fade to 0. For the next batch of frames the queue is discarded and a new one is created a few frames further. A better solution could be implemented if Stream was able to return the available number of frames, or if Fader was aware of the remaining time of the previous Signal so it can fade in response to the signal ending; this will also fix sampling errors at signal boundaries. This also requires your Cycle::with_crossfade() PR.

For buffer overflows I think the modular approach (with a Fader) is better. Your option (1) with changing the speed is the most noticeable as it would change the pitch, especially when music is playing, and so for me the least favorite. Also to note, most often, any sample production rate discrepancies are dwarfed by the rate of disruptions of late/dropped packets.

What do you think about discarding the queue of signals and recreating it every time there is a new batch (roughly every 10ms)?

At the end I can contribute an example where everything is put together.

Ralith commented 1 year ago

I'm leery of schemes that rely on juggling high-level Signals because, by design, those operations are not real-time. You can mitigate this (and perhaps this is your intention) by arranging for every possible future state in advance, but that involves extra work (computing crossfades, etc) that will in the typical case go unused, especially if built on top of independent primitives that don't share storage. I think a dedicated Signal impl addressing this specific case will be lighter and easier to reason about, and quite possibly simpler to implement.

Your option (1) with changing the speed is the most noticeable as it would change the pitch

The intention there is to address cases where e.g. audio is being streamed at 48001Hz, but played back at 47999Hz. This is practically guaranteed in real-time streaming, since no two audio devices' clocks will be perfectly synchronized. Resampling to mitigate this will not noticeably impact pitch. A different mechanism (fade-outs or repeating) is needed to handle lost data, but this should invisibly prevent all genuine real-time overruns. However, it's possible that desync in practice is so small that we can ignore the risk of overruns entirely. Research is needed to confirm, but my pessimistic assumption is that clocks will vary enough for this to matter.

Also to note, most often, any sample production rate discrepancies are dwarfed by the rate of disruptions of late/dropped packets.

Those manifest as underruns, which can be handled more simply, as discussed.

What do you think about discarding the queue of signals and recreating it every time there is a new batch (roughly every 10ms)?

We should be able to do better.

zmerp commented 1 year ago

I'm leery of schemes that rely on juggling high-level Signals because, by design, those operations are not real-time. You can mitigate this (and perhaps this is your intention) by arranging for every possible future state in advance, but that involves extra work (computing crossfades, etc) that will in the typical case go unused, especially if built on top of independent primitives that don't share storage. I think a dedicated Signal impl addressing this specific case will be lighter and easier to reason about, and quite possibly simpler to implement.

Yes, that was my intention. All the operations are deferred, it can work without acting in real-time. Only the underrun "coda" (cycle + fade out) is pre-queued, overrun crossfades can be applied without issue in the signal record thread at any time.

Also to note, most often, any sample production rate discrepancies are dwarfed by the rate of disruptions of late/dropped packets.

Those manifest as underruns, which can be handled more simply, as discussed.

Maybe I should have been more specific, I was mostly talking about late packets. in ALVR (my project), late packets are as likely as dropped packets; late packets usually manifest as underrun + overrun shortly after. The overrun is handled by crossfading with a more recent segment of the buffer and the middle frames are discarded.

So, I could create a new Signal that renders a cycle + fade out in case of underrun, but given its complexity and that the cycle idea was met with hesitation in the ALVR Discord server, I will make multiple tests before finalizing the PR.

Ralith commented 1 year ago

Only the underrun "coda" (cycle + fade out) is pre-queued

Right; I'm not convinced that constantly re-queueing underrun signals that we rarely actually reach is the way to go.

late packets usually manifest as underrun + overrun shortly after

I see what you mean. In the specific case of real-time streaming, I think it's an implementation error for this to overrun. The logical playback cursor should advance continuously whether or not there's data, because we know in advance that the sender is doing the same, regardless of communication issues. If a block of audio arrives late, we therefore must not play it from the start, since that would correspond to moving the playback cursor backwards in time. Instead, it might be played from halfway through, or even skipped entirely if it's old enough. This ensures that a later timely block won't overrun the buffer because we didn't fill up the buffer with obsolete data to begin with, and only requires fade-out/fade-in across missing data.

zmerp commented 1 year ago

That's a good point. Currently my application does not have any timestamp for audio data, there are just packet indices that are hidden inside the socket wrapper implementation; we can only get a notification of a packet drop (because it was truly dropped by the network or if a packet with a later index arrived first). I think the underrun + overrun happens when multiple packets get queued up in the network. So to implement your suggestion, we should send timestamps with each packet, make a history-averaged prediction of the time origin of the stream, and discard packets with timestamp outliers. Probably choosing a time origin at the beginning of the stream might be problematic in case latency fluctuates during streaming.

Ralith commented 1 year ago

Something like that, yeah. A good timestamp format is a u64 sample number. Maybe that's what the API should work in terms of.

Probably choosing a time origin at the beginning of the stream might be problematic in case latency fluctuates during streaming.

Yeah, ideally a change in latency should be handled gracefully. Perhaps by tracking the buffer size (normalized by sample rate) and crossfading or slightly adjusting playback rate if it passes a certain threshold above or below the target size, except at startup when playback should just pause until the desired size is reached.

Ralith commented 1 year ago

An easy way to adjust playback rate would be to skip or double every 1 in, say, 20 samples. Empirically this sounds okay, especially since we're only doing it in the very rare case where latency changes significantly.

zmerp commented 1 year ago

Thank you for all the feedback and suggestions. If you already did some tests, when you remove 1 sample every 20 does it sound generally higher pitch or does it create some high pitch noise, keeping roughly the same pitch for the rest of the frequency spectrum?

Ralith commented 1 year ago

It sounds slightly sped up, i.e. higher pitch, while doubling makes it lower. At 1 in 100 the difference is inaudible, but it's slightly noticeable still at 1 in 50. There's no clicking/popping.

Assuming we want this to be impossible to notice, at 1% speedup/slowdown by this mechanism a 10ms change in latency is corrected for within 1 second. That seems pretty good to me, but I'm not strongly opposed to crossfades either, which have the advantage of resolving in constant time at the cost of outright skipping some (probably inconsequential) amount of data. I feel like skipping/doubling samples is probably a little cheaper and easier to implement, but it's a slow path anyway.

zmerp commented 1 year ago

I might add an option for skipping/doubling samples to match the target average buffer usage, but crossfades cannot be removed entirely, since network disruptions happen in bursts and should be resolved rather quickly (<1s) for a good user experience.

Anyway, I'm putting this on hold since I'm working on something else at the moment (OpenXR) but I'll return to this right after that.

Ralith commented 1 year ago

I guess I'm thinking of fadeout/in as not quite the same thing as a crossfade, but I suppose the logic could be mostly shared. :+1:

Ralith / oddio

Handling of real-time, low latency signals without discontinuities #97