libairspy is slow when receiving float32 iq on Apple M1

dernasherbrezon commented 1 year ago

Hi,

I'm trying to add airspy support to my project r2lcoud. During the performance testing I found that libairspy code (consumer_threadproc) is very slow. It consumes ~70% of time, whereas other 30% consumed for frequency xlating filter (https://github.com/dernasherbrezon/sdr-server/blob/airspy/src/xlating.c#L34).

My guess is that DC spike removal and some filtering was not auto-vectorised on M1. So this issue is more of a question:

What is purpose of additional filtering after receiving the signal? (translate_fs_4)

bvernoux commented 1 year ago

This issue could be related to the PR https://github.com/airspy/airspyone_host/pull/89 which is in standby since few years Note: On my side I cannot help for anything specific to Apple M1 or other Apple CPU as I do not have any Apple Product and I do not plan to own/use any Apple products.

dernasherbrezon commented 1 year ago

Hi,

I can tune performance myself by taking AIRSPY_SAMPLE_RAW and going from there. That's not an issue. I was more wondering about the translate_fs_4 design. What is the purpose of this filter?

bvernoux commented 1 year ago

For more details about translate_fs_4 see the document http://jmfriedt.free.fr/tutorial_jmfriedt_glmf_eng.pdf which explain it in details

touil commented 1 year ago

Try converting 20MSPS real to 10MSPS IQ using the conventional methods, then compare. This will give some insight on which part of the code must be optimized for your CPU. The current implementation reduces the problem to passing a 10MSPS Real stream through 23 taps. If you can do better, let us know!

touil commented 1 year ago

better

I mean algorithmically. The same algo can be hand tuned for any target CPU and probably beat the compiler.

dernasherbrezon commented 1 year ago

No, I don't think "conventional methods" will be faster. From what I can see the code is already highly optimised both algorithmically and CPU-friendly. I guess I have several options:

Build my own version of libairspy and ensure "-O3" and "-mfpu=neon-xxx" are use for raspberry. Most likely default Debian builds don't have it. Hand-written assembly gives 8% max according to https://github.com/airspy/airspyone_host/pull/89. Which is within error tolerance to me.
Skip "Remove DC" logic. I can see it takes ~30% of time. Do you know what will be the impact on Hilbert transform (band-pass filter) + frequency xlating by f/4? I assume DC will be shifted to the very side of the spectrum and aliased to both ends of it. My app natively does frequency xlating and low pass filtering. I can assume 5-10% of bandwidth can be reserved from accessing and save some time on processing.
Add support for 2976000 samples rates into airspy_firmware so it will be integer-divided to 48k without fractional re-sampler. I haven't looked closely if such rate can be archived on the stick itself.

airspy / airspyone_host

libairspy is slow when receiving float32 iq on Apple M1 #91