Closed rhizoome closed 2 months ago
I broke something.
It cannot the be too bad though, my examples still produce the correct spectrums.
I think it might be better to optimize this at the leaf-node of the graph, since there we know most of the time how many channels there are. I will try that tomorrow.
The code is correct if you add a call to process_remainder
at the end. I tested it on x86/64 and could not detect any gains, however - did you measure those on ARM?
The code is correct if you add a call to
process_remainder
at the end. I tested it on x86/64 and could not detect any gains, however - did you measure those on ARM?
Yes on arm. I‘ll test my example on x86, to be sure. I was surprised that at_f32() has an overhead, maybe that is an arm thing.
The code is correct if you add a call to
process_remainder
at the end. I tested it on x86/64 and could not detect any gains, however - did you measure those on ARM?
On x86 I have 20%.
This is the code I am running:
fn build_bank_current() -> impl AudioUnit {
(noise()
>> split()
>> (resonator_hz(440.0, 50.0)
| resonator_hz(440.0 * 2.0, 50.0)
| resonator_hz(440.0 * 3.0, 50.0)
| resonator_hz(440.0 * 4.0, 50.0)
| resonator_hz(440.0 * 5.0, 50.0)
| resonator_hz(440.0 * 6.0, 50.0)
| resonator_hz(440.0 * 7.0, 50.0)
| resonator_hz(440.0 * 8.0, 50.0))
>> join())
* 0.1
}
And this is how I am running it:
pub fn dummy(seconds: u32, build_name: &str) {
let sink: [f32; 8] = [0.0; 8];
let sink_ptr = sink.as_ptr() as *mut [f32; 8];
let mut runner = Runner::new(build_name);
let mut backend = runner.backend();
let count: usize = SAMPLE_RATE as usize / MAX_BUFFER_SIZE * 2 * seconds as usize;
for _ in 0..count {
backend.process();
// Avoid optimizations
for wide in 0..BUFFER_LEN {
for channel in 0..1 {
let wide_buf = backend.buffer.at(channel, wide).to_array();
unsafe {
ptr::write_volatile(sink_ptr, wide_buf);
}
}
}
}
}
Maybe I did something stupid. Here is the complete code: https://github.com/rhizoome/fundsp_player
I added this to 0.19.1.
Thanks for trying it out! As promised I looked into optimising at the leaf:
It is even more efficient than the root optimisation, but the code is only nice for the trivial-case of one
INPUT/OUTPUT. While looking around I saw that even the fn process()
in many leaf AudioUnits
uses at_f32()
. So I think this means:
at_f32()
has actually an overhead. Do I just force the CPU to fill the cache correctly? On the other hand, the fact that the optimisation is working even better in the one
INPUT/OUTPUT case at the leaf implicates that at_f32()
really has an overhead. But maybe it is implemented badly in wide
or bytemuck
?
unsafe
in at_f32()
and just force the f32
out of the register-value? (Just to compare performance, not as a solution) <---- [That is the next thing I try]*
(I currently can't) and I would help cleaning it upOr I would just build my system and then optimise the things I use and try to upstream the optimisation on a case-by-case basis
helper
that does conversion from frame
"by" channel
to channel
"by" frame
and back *
The fact that because of f32x8
the memory is not ordered correctly for processing.
optimized » /usr/bin/time target/release/fundsp_player bank_current dummy --seconds 2000
3.48 real 3.47 user 0.00 sys
not » /usr/bin/time target/release/fundsp_player bank_current dummy --seconds 2000
5.04 real 4.77 user 0.01 sys
-> 30%
diff --git a/src/biquad.rs b/src/biquad.rs
index 8b0d0b7a..1cc421b2 100644
--- a/src/biquad.rs
+++ b/src/biquad.rs
@@ -7,6 +7,7 @@
use super::audionode::*;
use super::math::*;
+use super::prelude::{BufferMut, BufferRef};
use super::setting::*;
use super::shape::*;
use super::signal::*;
@@ -162,6 +163,19 @@ impl<F: Real> Biquad<F> {
pub fn set_coefs(&mut self, coefs: BiquadCoefs<F>) {
self.coefs = coefs;
}
+
+ #[inline]
+ fn _tick(&mut self, x0: f32) -> f32 {
+ let x0 = convert(x0);
+ let y0 = self.coefs.b0 * x0 + self.coefs.b1 * self.x1 + self.coefs.b2 * self.x2
+ - self.coefs.a1 * self.y1
+ - self.coefs.a2 * self.y2;
+ self.x2 = self.x1;
+ self.x1 = x0;
+ self.y2 = self.y1;
+ self.y1 = y0;
+ F::to_f32(y0)
+ }
}
impl<F: Real> AudioNode for Biquad<F> {
@@ -183,16 +197,25 @@ impl<F: Real> AudioNode for Biquad<F> {
#[inline]
fn tick(&mut self, input: &Frame<f32, Self::Inputs>) -> Frame<f32, Self::Outputs> {
let x0 = convert(input[0]);
- let y0 = self.coefs.b0 * x0 + self.coefs.b1 * self.x1 + self.coefs.b2 * self.x2
- - self.coefs.a1 * self.y1
- - self.coefs.a2 * self.y2;
- self.x2 = self.x1;
- self.x1 = x0;
- self.y2 = self.y1;
- self.y1 = y0;
+ let y0 = self._tick(x0);
[convert(y0)].into()
}
+ fn process(&mut self, size: usize, input: &BufferRef, output: &mut BufferMut) {
+ for i in 0..full_simd_items(size) {
+ let simd_in = input.at(0, i);
+ let mut simd_out = output.at(0, i);
+ let array_in = simd_in.as_array_ref();
+ let array_out = simd_out.as_array_mut();
+ for x in 0..array_in.len() {
+ array_out[x] = self._tick(array_in[x]);
+ }
+ }
+
+ self.process_remainder(size, input, output);
+ }
+
I did the test of bypassing bytemuck there is no performance difference!!
-> So it is solving a cache issue!!
I do not like to solve cache issues in a multi-platform library like that. Because the advantage we have on desktop-computers might very well be a problem on embedded MCUs like cortex without a cache. I can test on cortex-M0, maybe llvm is intelligent enough to remove the copying, if not I am pretty sure the code in this PR is worse.
That is probably the reason why you did not see a difference, you had a more mixed load. I just run that one resonator-bank.
Solving impedance mismatch on the other hand would not harm MCUs. Let me know what you think! Do not feel obligated to do anything, I should build my complete system and then optimise. I was just doing this to get a feel for the performance of the resonator-bank, because that is definitely a core feature. I think I now know the numbers and whether this is optimisation is kept is not very important. I can always patch it on builds for desktop-computers. I am not afraid of patches.
Closing this. I decided for the time being to revert back to the earlier method. It was worth a shot, though.
EDIT: Although I do not like how the code looks, it gives a 10% to 22% performance improvement. Maybe you could give it a thought. Also as far as I can tell, the results are still correct, although the tests do not like it.
I was confused by the difference in execution-profile of
tick()
andprocess()
basedAudioNodes
. So I looked deeper into the profile: The largest identifiable chunk that was spent outsideprocess()
calculating things, was 13% in a crate calledbytemuck
. I found out that this happens, because you need some tricks to access af32
in aavx-
orneon-value
directly. So I thought what happens if we just access the complete data directly in one go? It turns out in that case it is just atransmute()
, which should be a no-op.As usual with optimisation, I am not sure if the ugly code is actually worth it. So please feel free to reject or heavily modify what I did. It is just a proposal.
tick()-based
process()-based
If you ignore the loop-unrolling part of the patch. Using uninitialised temp storage might still be interesting. It is nicer code, but only contributes about 1%. Different to the code for uninitialised
BufferArray
, I wanted theunsafe
to be seen at the call site. Because if you use such a frame as input for calculations, the program will crash. This is true for theBufferArray
case, too, so making it an unsafe function might be a good idea.