Plan for implementing an ALSA duplex API

nyanpasu64 commented 2 years ago

This is a draft.

Overview and speculation

I'm told that JACK clients are fed input and output buffers synchronously, by jackd (the audio server), and that JACK's application-facing API abstracts away buffer size management from the app, and instead jackd (the server) handles routing and hardware IO/buffering. ALSA clients are not like that. You open independent input and output streams, and you have to align their block sizes, sampling rates, open them at the same time, read and write the same amount from both streams, etc. I hear that Apple's Core Audio exposes a JACK-like synchronous duplex API that communicates with physical hardware. (On Linux, you can have JACK's synchronous buffers, ALSA's direct hardware access, or neither when an ALSA app talks to pulseaudio-alsa or pipewire-alsa.)

On Linux, I get the impression that the only apps designed to be routable in a graph are JACK apps. PipeWire lets you route the inputs and outputs of Pulse/ALSA apps arbitrarily as well (in a patchbay app), but the apps were not written with this in mind. Worse, in ALSA's case, the application-facing API was written around timing being determined by hardware in real time, and the app managing data buffering itself. As a result, I think ALSA duplex can achieve the same round-trip latency on physical hardware (from hardware line in to speaker out) as a JACK client, but I'd be surprised if you can chain 3 ALSA duplex apps in a PipeWire graph and not get 1-2 periods of added latency per app, whereas 3 JACK duplex apps on pipewire-jack (see below) add zero latency compared to 1 app.

Helvum showing three cpal duplex clients chained in a row, in the PipeWire graph

jackd never changes buffer/period sizes. pipewire changes buffer/period sizes when you open and close apps. I'm not sure if/how it changes the period size of an ALSA device, but it seems buggy. Canberra notification sounds set the period to 8192 samples (absurdly high latency) after they start playing, there is/was audio glitches when periods get longer (https://gitlab.freedesktop.org/pipewire/pipewire/-/issues/1436 ?), and the round-trip latency of jack_iodelay fed through speaker output, a physical aux cable, and line-in input can change (when input/output period sizes diverge? upon xrun?).

jackd (JACK server) and rtaudio (an audio library for apps, with Linux/Windows/Mac backends) are ooold, and both use threads but predate C++11 atomics. jackd uses volatile variables, rtaudio just uses data races. RtAudio has real-world race conditions as well (which I could trigger if I wanted with a crafted test app), and incomprehensible data ownership/sharing that I'd have to rewrite to fix.

This was a learning experience. But I'm really not the most qualified person to talk about ALSA. Sadly I don't know who else understands ALSA vs. JACK well, and is willing to share their insights publicly.

jackd notes

this is a summary (given my current understanding) of how jack2 (didn't look into jack1) handles ALSA duplex, and how I'd implement it in cpal #553, or improve RtAudio's duplex, etc:

Threads:

unsure how many there are, TODO I should find out someday...

setup (alsa_driver_new() -> alsa_driver_set_parameters()):

let the user pick the period size and count
- [ ] cpal doesn't currently, I wish cpal did
use mmap-based audio access
- i'd probably advise against mmap? RtAudio and cpal doesn't use it, the safe ALSA bindings have very limited mmap support (you have to pass in a closure, only interleaved supported, so it's doubly impossible to port jackd's code to mmap). I don't know what portaudio/miniaudio/cubeb/dolphin-emu do
Use non-interleaved buffers
- Unsure, cpal uses non-interleaved and snd_pcm_writei, RtAudio supports both and (iirc) uses snd_pcm_write{i,n}
call snd_pcm_hw_params_set_periods_integer, snd_pcm_hw_params_set_period_size (exact), snd_pcm_hw_params_set_periods_min (we can tolerate the hardware forcing more periods than requested by the user) and snd_pcm_hw_params_set_periods_near (abort if the hardware forces less periods than requested), then snd_pcm_hw_params_set_buffer_size
- I like this approach (especially snd_pcm_hw_params_set_periods_integer). Though in an application-level audio library like cpal (where users may not be professionals micromanaging their buffer sizes), it's not strictly necessary to get exactly the period size the user requested, so we could potentially make "buffer size/count doesn't match user request" not a hard error. What does RtAudio do?
call snd_pcm_sw_params_set_start_threshold(0)
- With/without this call, the output doesn't start upon jackd calling snd_pcm_mmap_commit. Whereas in my sample app, with/without this call (or if I set it to the default of 1, or even (snd_pcm_uframes_t)-1) ALSA outputs do start upon my sample ALSA app calling snd_pcm_writei. To match jackd's behavior in my sample app, I have to set snd_pcm_sw_params_set_start_threshold() greater than the total buffer size (eg. the value of snd_pcm_sw_params_get_boundary(), or buffer size * 2, or buffer size + 1). Personally I'd use the boundary.
- IMO this is a latent bug in jackd, a behavior mismatch between writei and mmap, or a discrepancy between jackd and my sample app somewhere else I haven't located.
Attempt snd_pcm_link(). If it fails, note it down and keep going.
- Copy this I guess? RtAudio fails to open ALSA in duplex mode if snd_pcm_link() fails, so I can't use it to open a duplex connection to pulseaudio-alsa or pipewire-alsa, which is broken IMO.

beginning playback (alsa_driver_start()):

Fill the entire play buffer with silence. (fill the user-selected number of periods, ignoring extra periods if the hardware allocates them for you without being told to).
- Pretty sure we have to copy this. I'm worried about how jackd assumes that (in case of noninterleaved mmap) you can fill the entire buffer (all periods) of a single channel with a single memset, and that snd_pcm_mmap_begin() returns the entire buffer when asked to. (Is this the case in all modern hardware supporting mmap?) And if snd_pcm_mmap_begin() instead returns only 1 period (IDK if this can happen, it doesn't on my motherboard audio or USB audio UAC1 FiiO E10K), jackd will silently overwrite memory out of bounds, instead of erroring out.
Start the play stream. If snd_pcm_link() failed, start the capture stream too.
- The two streams should run in lockstep. When the capture stream has 1 period of data ready (written by the hardware), the playback stream should have around 1 period of room to write more data, and this is when a callback happens. If they desync, jackd doesn't attempt to resample data to accommodate desynced duplex streams; the slow stream will hold the fast stream back, and eventually the fast stream will xrun.
- [ ] TODO should writing silence upfront to the output be conditional, and only enabled when an input stream is active?

in the main loop (JackAudioDriver::Process()):

in jack2 synchronous mode (JackAudioDriver::ProcessSync()), the main loop waits for both input/output to be ready, then reads input from hardware, computes output, and writes output to hardware.

I assume we should copy jack2 sync mode, not async (which introduces an extra period of latency to avoid bad clients producing noise), for cpal.

Details:

Repeatedly poll input and output in a loop (alsa_driver_wait()), until both are ready. If one falls behind (so by the time it's ready, the other device has already reached xrun), report a xrun etc. (see below)
- [ ] One alternative is to snd_pcm_wait() on both streams, then afterwards verify both aren't in xrun. It may be simpler than polling, and I find it less confusing than polling (maybe because I'm not experienced with it, though alsa exposes a safe wrapper for poll()). But it's less powerful; if you're blocked on a stream that never becomes ready, you can't pickup on xrun events from the other stream (which would abort the polling loop). You could instead use a 10ms timeout loop I guess? (Do fast timers impair hardware timer power management?)
- [ ] IIRC RtAudio doesn't poll or wait. I think RtAudio should, and not doing so increases an extra period of latency (a zero-length queue) when using RtAudio in the output-only case, and a bit of latency (if the capture stream is ahead of the playback stream) in the duplex case.
Presumably read and write audio? I didn't look into this, jackd uses mmap, we'd probably use snd_pcm_readi/writei?

upon xrun:

I took a quick glance at what happens during xrun, and I think this is what happens: Stop and start both capture and playback streams, regardless of which one hit xrun. Don't close or recreate streams or any other state, though.

In jack2, alsa_driver_wait() on the audio(?) thread calls alsa_driver_xrun_recovery() calls global function Restart()
- [inlined] JackAlsaDriver::Stop()
  - alsa_driver_stop()
    - ClearOutput() (this is related to jackd's audio graph, not ALSA, so we probably don't care for cpal)
    - Calls snd_pcm_drop() on the playback stream. If snd_pcm_link() failed, call it on the capture stream too.
  - JackDriver::Stop()
    - Jack::JackDriver::StopSlaves(). I don't know what slaves are. I know that cpal doesn't have them.
- [inlined] JackAlsaDriver::Start()
  - JackDriver::Start()
  - alsa_driver_start() (aka goto "beginning playback")

How does cpal handle xruns? Does it handle them at all? (TODO look into it)

nyanpasu64 commented 2 years ago

I'm probably not implementing alsa duplex until cpal can actually properly detect and open my hw devices. pipewire-alsa (and likely pulseaudio-alsa) are terrible apis for low latency output and duplex, because both the app and the audio server buffer audio (there are possible workarounds, which pipewire-alsa doesn't do, and handling duplex correctly is especially tricky and situational, and it's difficult to get a general solution).

Right now cpal doesn't detect hw out, and crashes trying to read from hw in (#630). Does cpal seek to target professional DAW use (jackd or alsa hw devices), or mainstream users without a spare audio interface (pulseaudio/pipewire servers, any protocol they support)? If the latter, I think adding a pulse backend is more important, at least until pipewire becomes mainstream (at which point cpal can use the jack backend or add a pipewire one).

bartkrak commented 1 year ago

What's the current state of cpal support for duplex streams? I'm working on an advanced audio app using cpal where i need to process audio from input and send it back to the same hardware device for output. I have input and output as separate streams and ringbuff in between, it kinda works but sometimes (totally random) i get "backend specific error: broken pipe" and input stream starts giving me silence, I have to restart my app to make it work again. Sometimes it works for a few hours, sometimes only few seconds. Any ideas how to handle this problem?

RustAudio / cpal

Plan for implementing an ALSA duplex API #628

Overview and speculation

jackd notes