espressif / esp-dsp

DSP library for ESP-IDF
Apache License 2.0
442 stars 87 forks source link

SOLVED: A 1½cycle/tap fp32 FIR filter of arbitrary length and decimation (DSP-101) #68

Closed f4lc0n-asm closed 11 months ago

f4lc0n-asm commented 1 year ago

Hello,

this one was a tough nut to crack, but ultimately success! :)

Notable differences: • Coefficients and delay line must begin and end at 16-byte boundary. Use attribute((aligned (16))) and SIZE_ALIGN16() macro to achieve this. Both are required because of 128-bit reads from memory. • Coefficients must be reversed unless symmetrical (quite often). This is to achieve maximum speed by keeping the ASM code much simpler. • uint is used instead of int in fir_f32_s struct because these values are always non-negative (ASM routines use unsigned compare). • Since the non-decimating FIR filter doesn't produce any errors, the output is declared void.

I am attaching the sources with tests. Use 7-Zip to extract them from the 7-Zip archive. Feel free to incorporate them into this project. Let me know if you need me to tweak the ASM sources.

Cheers!

f4lc0n

FIR_fp32_1.5c_per_tap_v1.5.zip Improved: Further speed optimizations, which will be largely felt for short FIRs (dozens of taps).

dmitry1945 commented 1 year ago

HI @f4lc0n-asm I will look, and will include your part.

Thank you very much!

f4lc0n-asm commented 1 year ago

Hello Dmitry1945,

just looked at your attempt for 1½cycle/tap fp32 FIR filter: • Reducing the the non-zero offset loops to float quads is a great idea indeed! It considerably speeds up short FIR filters (e.g. for N=11 one gets 7.3cycles/tap vs. 8.3cycles/tap of the old version). Long FIR filter performance stays rougly the same (e.g. for N=2001). • Putting the first and second loops into one offset code segment is not a good idea because the fist loop has offset derived from pos and the second one from N-pos. These two offsets can differ (16 combinations in total) depending one the FIR processed (different Ns) and actual pos. • Increasing the iteration count of the loops by one (aka addi a14, a14, 3 in your code) is an interesting idea. However, after the first loops it is necessary to correct the FIR coeff pointer because it is just increased by an integer multiple of 16 in the first loops. Additionally, there are superfluous 128-bit memory accesses (both coeffs and delay line) even beyond the arrays sized with SIZE_ALIGN4() macro! Not to mention that the superfluous array items must be zeroed for the correct computation of the result! I did try this and achieved similar speed and smaller code size. However, coeff and delay line arrays must be sized with SIZE_ALIGN4()+4 and FIR init function, which clears the delay line prior to its use, must zero the first SIZE_ALIGN4()+3 elements! The same applies to the coeff array where the superfluous elements at the end of the array must also be zeroed.

The devil is in the detail…

Anyway, the more heads, the more headway! :)

Prepared a new faster and smaller v3.3 with dual-use quad float loops: • Coefficients and delay line must begin and end at 16-byte boundary. Use ALIGN16 and SIZE_ALIGN4() macros to achieve this. Both are required because of 128-bit reads from memory. • uint is used instead of int in fir_f32_s struct because these values are always non-negative (ASM routines use unsigned compare). • Since the non-decimating FIR filter doesn't emit any errors, the function output is declared void.

Also prepared a new smaller and slightly faster v4.4 (when compared to previous v4.x): • Coefficients and delay line must begin and end at 16-byte boundary. Use ALIGN16 and SIZE_ALIGN4()+4 macros to achieve this. Both are required because of 128-bit reads from memory and superfluous 128-bit reads from memory. • FIR init function, which clears the delay line prior to its use, must zero the first SIZE_ALIGN4()+3 elements! The same applies to the coeff array where the superfluous elements at the end of the array must also be zeroed. • uint is used instead of int in fir_f32_s struct because these values are always non-negative (ASM routines use unsigned compare). • Since the non-decimating FIR filter doesn't emit any errors, the function output is declared void.

I am attaching the sources with tests. Use 7-Zip to extract them from the 7-Zip archive. Feel free to incorporate them into this project. Let me know if you need me to tweak the ASM sources.

Cheers!

f4lc0n

FIR_fp32_1.5c_per_tap_v3.5.zip Improved: Faster non-decimating FIR execution for short FIRs (dozens of taps) when compared to v3.2. Changed: A0 save/restore - the stack not used anymore.

FIR_fp32_1.5c_per_tap_v4.5.zip Improved: Faster non-decimating FIR execution for short FIRs (dozens of taps) when compared to v4.3 Changed: A0 save/restore - the stack not used anymore.

dmitry1945 commented 12 months ago

Hi @f4lc0n-asm the FIR filters optimyzed for esp32s3 now in the esp-dsp. The implementation is different that you have placed, but the interface and requirements for aligment are the same. I think you will be able to use it in your project without modifications.

Please take a look.

That you, and best regards, Dmitry

f4lc0n-asm commented 12 months ago

Hello Dmitry1945,

just looking at esp-dsp/blob/master/modules/fir/float/dsps_fir_init_f32.c line 44: if (fir->N%4 != 0) - does it mean that FIR length N must be an integer multiple of 4 in your ESP32-S3 FIR solution? I devised solution just for any FIR length, even prime numbers for mathematics afficionados :)

f4lc0n

dmitry1945 commented 12 months ago

Hi @f4lc0n-asm The parameter len, yes, we have a trick. It should be divided by 4, but, you can use any length, just add zero the the end of the impulse response.

Dmitry

f4lc0n-asm commented 12 months ago

Just looking at your FIR ASM sources. Some optimization tips (line numbers are from dsps_fir_f32_aes3.S, but it the same spirit apply to dsps_fird_f32_aes3.S):

Well, 3 solid solutions appeared! The important thing is that people have a choice! As for my choice - I only trust my ASM code so much that I even create custom RTOS in ASM for my applications. So I guess you already know the answer! ;)

Cheers!

f4lc0n