Optimize audio node processing for simd avoiding bytemuck

rhizoome commented 2 months ago

EDIT: Although I do not like how the code looks, it gives a 10% to 22% performance improvement. Maybe you could give it a thought. Also as far as I can tell, the results are still correct, although the tests do not like it.

I was confused by the difference in execution-profile of tick() and process() based AudioNodes . So I looked deeper into the profile: The largest identifiable chunk that was spent outside process() calculating things, was 13% in a crate called bytemuck. I found out that this happens, because you need some tricks to access a f32 in a avx- or neon-value directly. So I thought what happens if we just access the complete data directly in one go? It turns out in that case it is just a transmute(), which should be a no-op.

As usual with optimisation, I am not sure if the ugly code is actually worth it. So please feel free to reject or heavily modify what I did. It is just a proposal.

tick()-based

optimized » /usr/bin/time target/release/fundsp_player bank_simd dummy --seconds 4000
        2.42 real         2.41 user         0.00 sys

original » /usr/bin/time target/release/fundsp_player bank_simd dummy --seconds 4000
        2.67 real         2.65 user         0.01 sys

-> 10% (SIMD in resonator bank)

optimized » /usr/bin/time target/release/fundsp_player bank_current dummy --seconds 4000
        7.40 real         7.37 user         0.02 sys

original » /usr/bin/time target/release/fundsp_player bank_current dummy --seconds 4000
        9.57 real         9.53 user         0.03 sys

-> 22% (Normal resonator bank)

process()-based

optimized » /usr/bin/time target/release/fundsp_player harmonic_series dummy --seconds 4000
       10.53 real        10.45 user         0.06 sys

original » /usr/bin/time target/release/fundsp_player harmonic_series dummy --seconds 4000
       10.58 real        10.49 user         0.06 sys

-> 0.005% (as expected not much difference)

If you ignore the loop-unrolling part of the patch. Using uninitialised temp storage might still be interesting. It is nicer code, but only contributes about 1%. Different to the code for uninitialised BufferArray, I wanted the unsafe to be seen at the call site. Because if you use such a frame as input for calculations, the program will crash. This is true for the BufferArray case, too, so making it an unsafe function might be a good idea.

rhizoome commented 2 months ago

I broke something.

It cannot the be too bad though, my examples still produce the correct spectrums.

rhizoome commented 2 months ago

I think it might be better to optimize this at the leaf-node of the graph, since there we know most of the time how many channels there are. I will try that tomorrow.

SamiPerttu commented 2 months ago

The code is correct if you add a call to process_remainder at the end. I tested it on x86/64 and could not detect any gains, however - did you measure those on ARM?

rhizoome commented 2 months ago

The code is correct if you add a call to process_remainder at the end. I tested it on x86/64 and could not detect any gains, however - did you measure those on ARM?

Yes on arm. I‘ll test my example on x86, to be sure. I was surprised that at_f32() has an overhead, maybe that is an arm thing.

rhizoome commented 2 months ago

The code is correct if you add a call to process_remainder at the end. I tested it on x86/64 and could not detect any gains, however - did you measure those on ARM?

On x86 I have 20%.

This is the code I am running:

fn build_bank_current() -> impl AudioUnit {
    (noise()
        >> split()
        >> (resonator_hz(440.0, 50.0)
            | resonator_hz(440.0 * 2.0, 50.0)
            | resonator_hz(440.0 * 3.0, 50.0)
            | resonator_hz(440.0 * 4.0, 50.0)
            | resonator_hz(440.0 * 5.0, 50.0)
            | resonator_hz(440.0 * 6.0, 50.0)
            | resonator_hz(440.0 * 7.0, 50.0)
            | resonator_hz(440.0 * 8.0, 50.0))
        >> join())
        * 0.1
}

And this is how I am running it:

pub fn dummy(seconds: u32, build_name: &str) {
    let sink: [f32; 8] = [0.0; 8];
    let sink_ptr = sink.as_ptr() as *mut [f32; 8];
    let mut runner = Runner::new(build_name);
    let mut backend = runner.backend();
    let count: usize = SAMPLE_RATE as usize / MAX_BUFFER_SIZE * 2 * seconds as usize;
    for _ in 0..count {
        backend.process();
        // Avoid optimizations
        for wide in 0..BUFFER_LEN {
            for channel in 0..1 {
                let wide_buf = backend.buffer.at(channel, wide).to_array();
                unsafe {
                    ptr::write_volatile(sink_ptr, wide_buf);
                }
            }
        }
    }
}

Maybe I did something stupid. Here is the complete code: https://github.com/rhizoome/fundsp_player

SamiPerttu commented 2 months ago

I added this to 0.19.1.

rhizoome commented 2 months ago

Thanks for trying it out! As promised I looked into optimising at the leaf:

It is even more efficient than the root optimisation, but the code is only nice for the trivial-case of one INPUT/OUTPUT. While looking around I saw that even the fn process() in many leaf AudioUnits uses at_f32(). So I think this means:

First we probably have to understand what is going on. I still can't believe that at_f32() has actually an overhead. Do I just force the CPU to fill the cache correctly? On the other hand, the fact that the optimisation is working even better in the one INPUT/OUTPUT case at the leaf implicates that at_f32() really has an overhead. But maybe it is implemented badly in wide or bytemuck?
- Can we try to use unsafe in at_f32() and just force the f32 out of the register-value? (Just to compare performance, not as a solution) <---- [That is the next thing I try]
Maybe we can come up with a way to remove this impedance mismatch* (I currently can't) and I would help cleaning it up
Or I would just build my system and then optimise the things I use and try to upstream the optimisation on a case-by-case basis
- Which probably means: making generic helper that does conversion from frame "by" channel to channel "by" frame and back
* The fact that because of f32x8 the memory is not ordered correctly for processing.

optimized » /usr/bin/time target/release/fundsp_player bank_current dummy --seconds 2000
        3.48 real         3.47 user         0.00 sys

not » /usr/bin/time target/release/fundsp_player bank_current dummy --seconds 2000
        5.04 real         4.77 user         0.01 sys

-> 30%

diff --git a/src/biquad.rs b/src/biquad.rs
index 8b0d0b7a..1cc421b2 100644
--- a/src/biquad.rs
+++ b/src/biquad.rs
@@ -7,6 +7,7 @@

 use super::audionode::*;
 use super::math::*;
+use super::prelude::{BufferMut, BufferRef};
 use super::setting::*;
 use super::shape::*;
 use super::signal::*;
@@ -162,6 +163,19 @@ impl<F: Real> Biquad<F> {
     pub fn set_coefs(&mut self, coefs: BiquadCoefs<F>) {
         self.coefs = coefs;
     }
+
+    #[inline]
+    fn _tick(&mut self, x0: f32) -> f32 {
+        let x0 = convert(x0);
+        let y0 = self.coefs.b0 * x0 + self.coefs.b1 * self.x1 + self.coefs.b2 * self.x2
+            - self.coefs.a1 * self.y1
+            - self.coefs.a2 * self.y2;
+        self.x2 = self.x1;
+        self.x1 = x0;
+        self.y2 = self.y1;
+        self.y1 = y0;
+        F::to_f32(y0)
+    }
 }

 impl<F: Real> AudioNode for Biquad<F> {
@@ -183,16 +197,25 @@ impl<F: Real> AudioNode for Biquad<F> {
     #[inline]
     fn tick(&mut self, input: &Frame<f32, Self::Inputs>) -> Frame<f32, Self::Outputs> {
         let x0 = convert(input[0]);
-        let y0 = self.coefs.b0 * x0 + self.coefs.b1 * self.x1 + self.coefs.b2 * self.x2
-            - self.coefs.a1 * self.y1
-            - self.coefs.a2 * self.y2;
-        self.x2 = self.x1;
-        self.x1 = x0;
-        self.y2 = self.y1;
-        self.y1 = y0;
+        let y0 = self._tick(x0);
         [convert(y0)].into()
     }

+    fn process(&mut self, size: usize, input: &BufferRef, output: &mut BufferMut) {
+        for i in 0..full_simd_items(size) {
+            let simd_in = input.at(0, i);
+            let mut simd_out = output.at(0, i);
+            let array_in = simd_in.as_array_ref();
+            let array_out = simd_out.as_array_mut();
+            for x in 0..array_in.len() {
+                array_out[x] = self._tick(array_in[x]);
+            }
+        }
+
+        self.process_remainder(size, input, output);
+    }
+

rhizoome commented 2 months ago

I did the test of bypassing bytemuck there is no performance difference!!

-> So it is solving a cache issue!!

I do not like to solve cache issues in a multi-platform library like that. Because the advantage we have on desktop-computers might very well be a problem on embedded MCUs like cortex without a cache. I can test on cortex-M0, maybe llvm is intelligent enough to remove the copying, if not I am pretty sure the code in this PR is worse.

That is probably the reason why you did not see a difference, you had a more mixed load. I just run that one resonator-bank.

Solving impedance mismatch on the other hand would not harm MCUs. Let me know what you think! Do not feel obligated to do anything, I should build my complete system and then optimise. I was just doing this to get a feel for the performance of the resonator-bank, because that is definitely a core feature. I think I now know the numbers and whether this is optimisation is kept is not very important. I can always patch it on builds for desktop-computers. I am not afraid of patches.

SamiPerttu commented 2 months ago

Closing this. I decided for the time being to revert back to the earlier method. It was worth a shot, though.

SamiPerttu / fundsp

Optimize audio node processing for simd avoiding bytemuck #53

tick()-based

process()-based