HEnquist / camilladsp

A flexible cross-platform IIR and FIR engine for crossovers, room correction etc.
https://henquist.github.io/
GNU General Public License v3.0
564 stars 49 forks source link

Use of Performance cores on macOS Apple Silicon (Mx series CPUs) #357

Closed siraaris closed 4 weeks ago

siraaris commented 2 months ago

It would be good to have a mechanism for camilladsp on macOS to utilise Performance cores on the M series Apple Silicon CPU's.

Under some loads, eg high channel count, high sample rate and reasonably large FIR filters, the Efficiency cores may not be suitable, e.g. I've observed glitches, high load reported from camilladsp when running on Efficiency cores that seem to go away when the OS promotes the process to Performance cores (investigation is ongoing!).

I think thread priority must be set in code, as there's no option the I'm aware of for this to be set in user land.

siraaris commented 2 months ago

To reduce load I created a set of unified FIR filters to get my config down to the most efficient possible, so now I use single FIR filters per channel incorporating XO, DRC and other DSP etc.

Glitches (restarts) are observed even when CamillaDSP is quiescent. The buffer levels are also stable.

It would seem to suggest that the OS, BlackHole or CoreAudio is doing something here.

Any suggestions where to look? I'm running out of ideas (until you get time to look at thread prioritisation :), no drama just sharing what I've noticed.

siraaris commented 2 months ago

Hmmm. Looks like Restarts/Buffer Levels occur hourly. Screams cronjob, will investigate, or could be when macOS maintains processes. In the graph below I've normalised buffers to 1, and restarts as counts in 5 minute bucket windows for a 24h period.

Screenshot 2024-09-04 at 12 55 55 AM
HEnquist commented 2 months ago

I have made some changes to improve the buffer level measurement, not sure if it helps for this issue or not, but you could the the latest next30 to see if it helps. I also added a script for plotting the processing load and buffer level while running: https://github.com/HEnquist/camilladsp/blob/next30/testscripts/log_load_and_level.py I have let it run for a couple of hours on my M1 Air, hoping to get results similar to yours. But it behaves well without buffer underruns or spikes in the level.

siraaris commented 1 month ago

That's a useful utility :)

I think there are a number of factors in my case. My audio devices are AVB based, that are probably uncommon for most users of CamillaDSP.

For example, the Playback device is presented to CoreAudio from a USB connected device (RME Digitface AVB) that bridges into network transport to connect to the DAC (RME M32 DA Pro, a 32 ch DAC), managed by RME's AVB Controller.

This should be transparent to CamillaDSP if everything else is well behaved.

Things I've noticed that result in CamillaDSP glitches:

  1. RME have moved the driver for the Digiface AVB from a Kernel Extension to a System Extension (DriverKit), which moves runtime from kernel space to user space. I don't think this is the direct cause of any of my observed issues - but it's worth noting.
  2. RME have also moved the ethernet driver for the Digiface AVB (netifc) into user space. Again, I don't think directly an issue when netifc is behaving well (see next point), other than another part of the process that's changed over time with Apple deprecating kernel extensions over time.
  3. The RME netifc process continually segmentation fault crashes. I've filed a support request with RME. From (admittedly limited) data over the last few days or so. I think there is a correlation between observed (audible) disturbances in CamillaDSP and when netifc crashes and restarts.
  4. The Mx series Apple Silicon CPU's have a mixture of Efficiency and Performance cores. The M1 for example has 4 of each. Based on (once again, limited) data over the last week or so, I think there may be an effect on CamillaDSP when the running process is moved to/from different type cores - but I need to look further, and maybe what I'm observing is directly because of core swapping.

I still think it's worth implementing thread prioritisation / core affinity or similar for macOS - given the move by Apple to user space for areas that have traditionally been in kernel.

Because of the above I've spent some time tonight getting an AVB configuration setup only using the RME DAC and Apple built-in AVB Controller (avbutil), see how that goes.

siraaris commented 1 month ago

Removing the RME devices as above, the behaviour overnight still suggests an hourly incident.

Screenshot 2024-09-05 at 3 18 28 PM

The three blips in the centre (around 20000) are at 7:27, 8:27, and 9:27, followed by a long stretch of stability, until I remote screened in, restarted looking at this on console.

At these times there are lots of:

2024-09-05 07:27:02.757878 INFO [src/coreaudiodevice.rs:446] Restarting playback after buffer underrun 2024-09-05 07:27:03.951268 WARN [src/coreaudiodevice.rs:455] Playback interrupted, no data available ... 2024-09-05 08:27:19.185577 WARN [src/coreaudiodevice.rs:455] Playback interrupted, no data available 2024-09-05 08:27:19.189584 INFO [src/coreaudiodevice.rs:446] Restarting playback after buffer underrun ... 2024-09-05 09:27:19.021137 INFO [src/coreaudiodevice.rs:446] Restarting playback after buffer underrun 2024-09-05 09:27:20.214530 WARN [src/coreaudiodevice.rs:455] Playback interrupted, no data available ...

siraaris commented 1 month ago

Oh my dear lord.

https://eclecticlight.co/2023/01/21/how-macos-schedules-background-activities/

siraaris commented 1 month ago

A bit of progress. I removed the Presonus AVB switch, leaving only the Mac mini connected directly to the DAC via ethernet.

Lo-and behold, things seem to settle down, see screen grab.

Screenshot 2024-09-06 at 3 35 06 AM

And the RME networking (netifc) is not crashing.

Will let it run this weekend, and provide an update.

HEnquist commented 1 month ago

That looks very similar to how it looks on my M1 Air when using just Blackhole and the built in speakers.

HEnquist commented 1 month ago

It looks like macOS offers very limited control over thread priorities. There is a concept of quality of service, but not sure if it's applicable here. Then there is audio workgroups, as described here: https://www.bluecataudio.com/Blog/announcements/realtime-audio-multicore-issues-for-apple-silicon-end-of-the-story/ That looks interesting, but much more difficult to use than just adjusting some thread priorities. Not sure how feasible it is to use it in camilladsp.

HEnquist commented 1 month ago

Found this that I will try: https://crates.io/crates/audio_thread_priority

siraaris commented 1 month ago
Screenshot 2024-09-06 at 9 19 45 AM

Behaviour overnight. Just FYI.

siraaris commented 1 month ago

It looks like macOS offers very limited control over thread priorities. There is a concept of quality of service, but not sure if it's applicable here. Then there is audio workgroups, as described here: https://www.bluecataudio.com/Blog/announcements/realtime-audio-multicore-issues-for-apple-silicon-end-of-the-story/ That looks interesting, but much more difficult to use than just adjusting some thread priorities. Not sure how feasible it is to use it in camilladsp.

Seems clear that using audio workgroups addresses the issue. It's interesting that Intel CPU's have introduced varying core performance; maybe this issue exhibits on non-macOS as well.

What's not clear is detail on the "hack" that BlueCatAudio refer to. Maybe it's what audio_thread_priority Rust crate utilises?

HEnquist commented 1 month ago

What's not clear is detail on the "hack" that BlueCatAudio refer to. Maybe it's what audio_thread_priority Rust crate utilises?

The audio_thread_priority crate doesn't seem to be using audio workgroups, so probably not. But we don't know what that hack is (only that it's supposedly obvious if you look somewhere in the apple open source code 😒) so this is only guess.

I added audio_thread_priority to in the processing thread and the CoreAudio capture and playback threads in branch audio_thread_prio. Can you try it on your system?

siraaris commented 1 month ago

I am running the audio_thread_prio branch now - will report back in a few hours, which should be enough to see how it goes.

aris@pollen ~ % ~/Projects/camilladsp-audio-thread-prio/target/release/camilladsp --address 192.168.1.169 --port 1234 ~/Projects/keystone-bedrock-v5-Consolidated.yml --gain=-50.0 2024-09-07 21:52:21.271423 INFO [src/bin.rs:742] CamillaDSP version 3.0.0 2024-09-07 21:52:21.271442 INFO [src/bin.rs:743] Running on macos, aarch64 2024-09-07 21:52:21.374445 INFO [src/coreaudiodevice.rs:1246] The capture device supports pitch control 2024-09-07 21:52:21.480098 INFO [/Users/aris/.cargo/registry/src/index.crates.io-6f17d22bba15001f/audio_thread_priority-0.32.0/src/rt_mach.rs:158] thread 5635 bumped to real time priority. 2024-09-07 21:52:21.489947 INFO [/Users/aris/.cargo/registry/src/index.crates.io-6f17d22bba15001f/audio_thread_priority-0.32.0/src/rt_mach.rs:158] thread 8451 bumped to real time priority. 2024-09-07 21:52:21.679741 INFO [/Users/aris/.cargo/registry/src/index.crates.io-6f17d22bba15001f/audio_thread_priority-0.32.0/src/rt_mach.rs:158] thread 8195 bumped to real time priority. 2024-09-07 21:52:21.681181 WARN [src/coreaudiodevice.rs:459] Playback interrupted, no data available 2024-09-07 21:52:21.687903 INFO [src/coreaudiodevice.rs:450] Restarting playback after buffer underrun

siraaris commented 1 month ago

Initial observation - buffers are very stable. I've tried spiking the CPU with various actions, which previously would trigger buffers to rise and crash - so far looking good.

Screenshot 2024-09-07 at 10 10 04 PM
siraaris commented 1 month ago

Well, the thread priority change has resulted in super stability for camilladsp. No dropouts/restarts, stable buffers, really nice. There's a bit of lagginess on the UI when I remote in, but the Mac mini is dedicated for camilladsp so that's a small price to pay for audio stability and performance.

I'll keep it running, and report back. If there's anything you need me to look at specifically, just shout out.

Screenshot 2024-09-08 at 1 58 19 AM
HEnquist commented 1 month ago

Looks great so far! I don't have any specific things I want tested, just curious about how it behaves when kept running for a while.

siraaris commented 1 month ago

I think you can probably treat this as "done". Not really sure that Performance cores are actually used, but regardless makes no difference - the result is stable and performant behaviour.

Screenshot 2024-09-08 at 3 36 36 PM
HEnquist commented 1 month ago

Would you be interested in trying multithreaded processing? The branch "with_rayon" supports splitting filter tasks among several threads. This is enabled via a new optional boolean multithreaded in the devices section of the config (that defaults to off).

It hasn't gotten much testing, so please start with amplifiers powered off :)

The idea is that between mixers and processors, each channel can be filtered independently from the others. So it collects the filters to apply to each channel, and then uses the really smart rayon library to process the channels in parallel in a set of worker threads. It needs quite heavy filter tasks to actually help, with too "easy" filters the overhead of passing things back and forth between threads gets larger than the actual processing time. I think your config could potentially benefit.

siraaris commented 1 month ago

I tried the with_rayon branch.

Seems to run ok, but now get underruns - I think because the threads you create per channel need also to be real-time?

For example - with multithreaded: false, I can stop/start Safari (ie cause CPU spikes), and camilladsp is unaffected.

With multithreaded: true, stop/start Safari causes camilladsp to hiccup.

HEnquist commented 1 month ago

now get underruns - I think because the threads you create per channel need also to be real-time?

Yes those also need to have their priority raised, just didn't get to that yet. But did you see any change to the processing load? Hopefully it should be lower.

siraaris commented 1 month ago

I'll run for a while with graphing and let you know.

siraaris commented 1 month ago

The load is lower yes.

Screenshot 2024-09-10 at 3 51 48 PM
siraaris commented 1 month ago

Longer timeframe. You can see the spike at the end when I remote in a screen capture.

Screenshot 2024-09-10 at 4 22 17 PM
HEnquist commented 1 month ago

Ok! Thanks for testing. I would expect the threading to make it more sensitive to interference from other loads. Raising priorities should help, but there may still be delays when waking up the worker threads, and when they notify the main processing thread that they are finished.

siraaris commented 1 month ago

I'll happily test! I think it's worth pursuing and having the support there?

HEnquist commented 1 month ago

The with_rayon branch is updated. Now it raises the priority of the workers, and the number of workers can be set by the worker_threads parameter in devices. Leave it out or set to 0 to let rayon decide, which becomes one thread per hardware thread of the machine. On the Windows laptop I'm using at the moment (12-core Snapdragon X Elite cpu), anything above 4 threads gives the same processing load.

siraaris commented 1 month ago

Initial observation is that setting worker_threads manually is required, as the default number (when I set to 0) may be too high, and deterimental.

On my Mac mini, I've set worker_threads to 4 and that seems to be ok.

Will leave it running for a while and report back.

siraaris commented 1 month ago

Final check in, only stable at 192khz, 32 ch with 2 threads.

But with that it's rock solid.

siraaris commented 4 weeks ago

Without creating a new issue/suggestion, I've been experimenting - on Linux - on pinning camilladsp to specific CPU's as I'm still experiencing xrun issues.

Summary of steps I've taken:

Disabled Hyperthreading in BIOS Installed RT kernel (6.10.11-rt-amd64) tuned-adm latency-performance I have an 8-core i7, so: Boot kernel with isolcpus=0,1,2,3,4,5 (on Debian, edit /etc/default/grub and run update-grub) Set CamillaDSP config: multithreaded: true worker_threads: 4 Start camilladsp with taskset --cpu-list 0-5

HEnquist commented 4 weeks ago

Does pinning make any difference?

siraaris commented 4 weeks ago

I'm not sure yet, but for the above configuration there are still underruns reported:

aris@controller:~$ tail -f proj/log/camilladsp.log 2024-10-06 16:37:35.801437 INFO [src/bin.rs:781] CamillaDSP version 3.0.0 2024-10-06 16:37:35.801443 INFO [src/bin.rs:782] Running on linux, x86_64 2024-10-06 16:37:35.924265 INFO [src/alsadevice.rs:789] Capture device supports rate adjust 2024-10-06 16:37:36.036685 INFO [src/alsadevice.rs:117] PB: Starting playback from Prepared state 2024-10-06 17:21:36.039657 WARN [src/alsadevice.rs:113] PB: Prepare playback after buffer underrun 2024-10-06 17:21:51.833882 WARN [src/alsadevice.rs:113] PB: Prepare playback after buffer underrun 2024-10-06 18:31:57.670345 WARN [src/alsadevice.rs:113] PB: Prepare playback after buffer underrun 2024-10-06 18:45:02.588972 WARN [src/alsadevice.rs:113] PB: Prepare playback after buffer underrun

I don't think it's a system issue per-se, as the same filters, and pipeline setup on Brutefir doesn't exhibit xruns (well, at least for 12 hours at a stretch). For the "same" configuration, CamillaDSP xrun's every hour or two.

siraaris commented 4 weeks ago

Cognisant that this isn't macOS related, do you want me to create a new issue - we can probably close this one, as macOS on M1 Mac with the changes you introduced were solid (provided that worker_threads wasn't too high.

HEnquist commented 4 weeks ago

Yeas that fits better in a new issue. Please attach the config file, and the output of aplay -l and arecord -l.