Optimize float32 IQ code path using ARM NEON

sergeyvfx commented 1 year ago

The availability of NEON instruction is performed implicitly (from the building process point of view) based on __ARM_NEON preprocessor value. This should not be adding restrictions on the portability of the final library since the optimizer itself might be using NEON in the same configuration.

Benchmarking was done on Seeed reTerminal hardware which is based on Raspberry Pi CM 4. Both GCC and Clang toolchain was tested using the -O3 -march=armv8-a+crc -mtune=cortex-a72 flags.

The time of signal processing in the consumer thread prior to the callback invocation was measured.

           Base       NEON     Speedup
GCC-10     2.6488     2.5597   4%
Clang-13   2.9867     2.7528   8%

The speedup is not liner from the register width due to both GCC and Clang performing auto-vectorization when O3 optimization level is used.

The time measurement is not included into this patch.

Further speed improvement is possible to cover other sample types, but those can happen as a followup development. It should also be possible to close the gap between GCC and Clang, but this is also not related to this patch.

touil commented 1 year ago

I think we first need a robust test fixture that compares the candidate implementations with the "canonical" implementation. This could take form of a raw dump file from the hardware that passes through the both DSP chains, then we compare the resulting IQ data. Note that the compiler may rearrange the instructions and result in slightly different values for the samples. This is still perfectly fine as it only the bits way below the noise floor of the desired signal. This means the test must be smart enough to validate the time domain and spectral contents instead of doing a plain binary comparison. EDIT: The compiler options like -ffast-math and its equivalents can get you in the same situation where the binary comparison doesn't pass while the data is still valid.

sergeyvfx commented 1 year ago

That is a good point about an automated test of some sort. Maybe even some benchmarking fixture :)

Indeed need to be careful with the exact comparison. I think something like comparison of the actual IQ processor output with the ground-truth with some floating point epsilon will suffice.

Did you have some plans or ideas how to tackle the testing? As in, maybe there is already a testing framework you'd like to be used? Or maybe there is some work going on for getting the regression/unit tests integrated? Otherwise I can give it a go and try to get it going without pulling any external dependencies in.

dernasherbrezon commented 1 year ago

I spent considerable amount of time trying to optimize this algorithm and got mixed results.

process_fir_taps - is not used by default. Default filter is 47 taps which is processed by fir_interleaved_24
fir_interleaved_24 - is rather optimised because it takes into consideration that filter is symmetrical and can save ~12 multiplications. I tried manually unrolling it into 4 items at a time (https://github.com/dernasherbrezon/arm-tests/blob/main/dot_prod_12.c#L55) and it gave me ~23% improvement on Raspberry pi with "-O3 -mfloat-abi=hard -march=armv8-a -mfpu=neon -ftree-vectorize -ffast-math -funroll-loops" flags.
As I mentioned previously DC removal algorithm has a data dependency issue so cannot be optimized further for SIMD.
I tried combining all algorithms into single function (https://github.com/dernasherbrezon/airspyone_host/blob/neon_tuning/libairspy/src/iqconverter_float_tuned.c#L118). The idea was to eliminate memory copy here https://github.com/airspy/airspyone_host/blob/master/libairspy/src/iqconverter_float.c#L364 which happens every 47 * 32 samples. And also remove branching over here: https://github.com/airspy/airspyone_host/blob/master/libairspy/src/iqconverter_float.c#L436
I wrote a performance test which can compare old implementation with the new one: https://github.com/dernasherbrezon/airspyone_host/blob/neon_tuning/libairspy/test/test_iqconverter_float.c

On MacBook Air M1: 0.027605 (old) vs 0.016459 (tuned) improvement. ~40%. On Raspberry PI 3b+: 0.272147 (old) vs 0.242934 (tuned) improvement. ~11%

So even if the performance is slightly better I don't think it worth the hassle. There might be other areas of improvement:

Remove DC-removal algorithm which can be suitable for certain applications.
Use INT16_REAL types that should be much faster but suspect to quantisation noise.
Use lower sampling rates at the firmware. Which in turn slightly defeat the purpose of using airspy with oversampled signal.
Decimate as soon as data reached application

I will also check libusb integration. Currently iqconverter_float_process takes 200ms for RaspberryPI to process 6msps rate which should be 20% CPU. On practice I see airspy_rx into /dev/null takes 80% CPU ( airspy_rx -f 472.2 -a 6000000 -r /dev/null -t 0)

touil commented 1 year ago

I don't think it worth the hassle.

That's what she said. The code is already hard to read. By specializing it for every hardware you lose the readability completely.

Remove DC-removal algorithm which can be suitable for certain applications.

Could be fine for most apps. A spike will show up in the upper edge of the spectrum, but that's perfectly fine. Mind you, the DC removal can be a couple of adds, a sub and a delay in INT16. Still recursive tho.

Use INT16_REAL types that should be much faster but suspect to quantisation noise.

No problem with the quantization of 12bit data in a 16bit container. The RF noise overrides the quantization noise in a properly working system.

Use lower sampling rates at the firmware. Which in turn slightly defeat the purpose of using airspy with oversampled signal.

Decimate as soon as data reached application

Inject your own IQ conversion half-band filter that only preserves a small part of the spectrum then decimate. Much, much cheaper.

dernasherbrezon commented 1 year ago

No problem with the quantization of 12bit data in a 16bit container. The RF noise overrides the quantization noise in a properly working system.

I meant something different though:

12bit -> 16bit -> float -> (filtering) -> float
12bit -> 16bit -> (filtering) -> 16bit

The second one will loose some precision definitely. But will it be noticeable? Have a practical impact? Unlikely?

Inject your own IQ conversion half-band filter that only preserves a small part of the spectrum then decimate. Much, much cheaper.

Do you have some in mind? The code has fir_interleaved_4, fir_interleaved_8, fir_interleaved_12 that might suggest you tested some.

touil commented 1 year ago

12bit -> 16bit -> (filtering) -> 16bit

That's exactly what I commented.

airspy / airspyone_host

Optimize float32 IQ code path using ARM NEON #89