Lock-free processing - Githubissues

raphlinus commented 7 years ago

For real, glitch-free low latency audio performance, the audio processing path needs to be lock-free. That, in turn, implies no allocations or deallocations. The music-synthesizer-for-android project (which was a testbed for high performance audio on Android) dealt with this by doing fixed-size ring buffers, and preallocating the state for all DSP. That meet the lock-free needs, but is limiting.

Thinking about this more and starting to prototype, I believe the answer is lock-free queues such as Treiber stacks (though a number of other lock-free queue data structures would also work). That way, a main thread allocates nodes at will, sends them over a lock-free queue to the realtime audio thread, which at that point basically has &mut access to the nodes. For garbage collection, the realtime thread can send them back over a return channel. Control and status can be multiplexed over the same channels (which has the additional benefit of ordering guarantees between these and graph mutation).

This is something of an exploratory issue. Is this a good repo to develop such ideas? The architectural changes are likely to be far-reaching. It may be less work to just develop a new graph runner and modules from scratch than retrofitting the RustAudio stack. But I'd like to at least sound out working with existing projects.

mitchmindtree commented 7 years ago

Thanks a lot for opening this @raphlinus, performing allocations/deallocations on a separate thread is something I've had in mind for a while now but have not yet had the chance to explore. If you'd find it beneficial to experiment and improve upon dsp-chain I'd be more than happy to discuss and review changes.

The architectural changes are likely to be far-reaching.

I think the kind of changes you are discussing will be beneficial enough to warrant how far reaching they will be! I'd be happy to see them implemented here if you are still interested. Whatever you decide, I'd love to be involved.

I believe the answer is lock-free queues such as Treiber stacks (though a number of other lock-free queue data structures would also work)

I wonder if the std::sync::mpsc::channel could be a suitable, simple solution for this? I use multiple of these for communicating between the main, generator and real-time audio threads in a generative music project of mine and they've never come close to showing up as any sort of bottleneck in the profiling I've performed. That said, I have not gone out of my way to benchmark these against other options - there are probably others much better suited to optimal performance!

Rough Design Ideas

I can picture the graph constructor API looking something like this:

let (api, audio) = dsp::Graph::new();

where the api side would be kept on the main thread and the audio side would be moved to the audio thread where something along the lines of audio.fill_slice(slice, sample_hz) would be called at the rate of the OS' provided audio callback.

The user could call methods on the api end which would send Messages to the audio end, where Message might be an enum along the lines of:

enum Message<N, F> {
    AddNode(NodeIndex, N),
    AddConnection {
        index: EdgeIndex,
        buffer: Vec<F>,
        source: NodeIndex,
        destination: NodeIndex,
    },
    RemoveNode(NodeIndex),
    // etc
}

where N is the user's node type and F is the audio frame type stored within the buffer. The audio end could receive these messages at the beginning of each audio callback.

When RemoveNode messages are received by the audio end, it could send the node along with any buffers that are no longer necessary back to the api thread which could call something like api.collect_audio_garbage(); toward the end of an update. Alternatively, we could spawn a thread upon the Graphs construction which waits on a std::sync::mpsc::Receiver and solely does garbage collection. Perhaps this thread could do the allocations requested by the api end too?

Questions that come to mind

I wonder if we could work out a nice way of cloning the graph (or at least nodes from the graph)? E.g. I can picture a case where a user might be using a DAW and would like to spawn an asynchronous offline render of the current state while continuing to work. I guess most DAWs today don't let you do this due to the difficulty of safely cloning all the nodes while not blocking the audio thread, but it seems like something that maybe rust's type system could assist with? I've come across the desire for this a lot while making music - rendering stems can take minutes to hours and is very jarring for the creative process.
How would controlling the nodes work? I imagine the api end could have a method that allowed sending something like Message::UpdateNode(NodeIndex, Box<FnOnce(&mut N)>) to the audio end. This could probably be more efficient using a type implementing a NodeUpdate trait instead of a boxed closure.

Thanks again for the issue, I'll keep thinking on this.

raphlinus commented 7 years ago

Thanks for the response. I've started prototyping on my end, but still not sure whether it's better to push that to release or work here.

The problem with std::sync::mpsc is that it does allocations. For realtime audio, you don't care about throughput benchmarks, the only thing that matters is worst case. Allocators tend to have locks internally, so if a lower-priority thread is holding an allocation lock, it can block the realtime thread. I think it's worth exploring whether it's possible to implement this with no allocation and no locking whatsoever, and my prototyping is encouraging.

Other than that, I think what you're saying is reasonable, especially having garbage collection done by the api object. There are some details I would do differently, for example I would atomically update a predecessor list rather than adding and removing edges one by one. I also have buffers on nodes rather than edges (and Sum is just another node, rather than having that be magical behavior in the graph runner). Each node has a fixed number (can be >1) of output buffers, but a variable number of input edges, reconfigurable dynamically.

I haven't gone too far into the api yet, but yes, cloning is important. I'm kinda thinking along the lines of a "factory" which basically replicates a source graph into the audio graph; the source graph can be in a more friendly format. The factory would also take care of maintaining invariants, such as building the graph in topological sort order so that no node has an edge from a nonexistent node.

Your UpdateNode idea is powerful, but Box<FnOnce(...)> would cause a deallocation of the box on invoking the function. Changing it to FnMut would solve that particular problem. The other problem is that the graph is holding Module trait objects, which your function would need to downcast to the concrete type of your processing module. I actually do think downcasting is feasible, the Module trait could have this method:

fn to_any(&mut self) -> &mut Any { self }

(Note: in playing with this, it's not quite so easy, the compiler complains about Sized bounds, but I think it can be made to work)

My original thought was just sending an enum of popular data types to an update method on the module trait. That's workable, I think, but less exciting.

raphlinus commented 7 years ago

I released my prototype: https://github.com/google/synthesizer-io. I'm not sure how much I'm going to continue developing it, whether it's primarily useful as a source of ideas or as a real codebase. It does address a lot of the issues I outlined; it's fully lock-free, allows combinations other than sum, and has some other interesting properties.

mitchmindtree commented 7 years ago

Excellent, thanks for sharing! I'm looking forward to having a closer look.

RustAudio / dsp-chain

Lock-free processing #141

Rough Design Ideas

Questions that come to mind