SOLVED: A 1½cycle/tap fp32 FIR filter of arbitrary length and decimation (DSP-101)

f4lc0n-asm commented 1 year ago

Hello,

this one was a tough nut to crack, but ultimately success! :)

Notable differences: • Coefficients and delay line must begin and end at 16-byte boundary. Use attribute((aligned (16))) and SIZE_ALIGN16() macro to achieve this. Both are required because of 128-bit reads from memory. • Coefficients must be reversed unless symmetrical (quite often). This is to achieve maximum speed by keeping the ASM code much simpler. • uint is used instead of int in fir_f32_s struct because these values are always non-negative (ASM routines use unsigned compare). • Since the non-decimating FIR filter doesn't produce any errors, the output is declared void.

I am attaching the sources with tests. Use 7-Zip to extract them from the 7-Zip archive. Feel free to incorporate them into this project. Let me know if you need me to tweak the ASM sources.

Cheers!

f4lc0n

FIR_fp32_1.5c_per_tap_v1.5.zip Improved: Further speed optimizations, which will be largely felt for short FIRs (dozens of taps).

dmitry1945 commented 1 year ago

HI @f4lc0n-asm I will look, and will include your part.

Thank you very much!

f4lc0n-asm commented 1 year ago

Hello Dmitry1945,

just looked at your attempt for 1½cycle/tap fp32 FIR filter: • Reducing the the non-zero offset loops to float quads is a great idea indeed! It considerably speeds up short FIR filters (e.g. for N=11 one gets 7.3cycles/tap vs. 8.3cycles/tap of the old version). Long FIR filter performance stays rougly the same (e.g. for N=2001). • Putting the first and second loops into one offset code segment is not a good idea because the fist loop has offset derived from pos and the second one from N-pos. These two offsets can differ (16 combinations in total) depending one the FIR processed (different Ns) and actual pos. • Increasing the iteration count of the loops by one (aka addi a14, a14, 3 in your code) is an interesting idea. However, after the first loops it is necessary to correct the FIR coeff pointer because it is just increased by an integer multiple of 16 in the first loops. Additionally, there are superfluous 128-bit memory accesses (both coeffs and delay line) even beyond the arrays sized with SIZE_ALIGN4() macro! Not to mention that the superfluous array items must be zeroed for the correct computation of the result! I did try this and achieved similar speed and smaller code size. However, coeff and delay line arrays must be sized with SIZE_ALIGN4()+4 and FIR init function, which clears the delay line prior to its use, must zero the first SIZE_ALIGN4()+3 elements! The same applies to the coeff array where the superfluous elements at the end of the array must also be zeroed.

The devil is in the detail…

Anyway, the more heads, the more headway! :)

Prepared a new faster and smaller v3.3 with dual-use quad float loops: • Coefficients and delay line must begin and end at 16-byte boundary. Use ALIGN16 and SIZE_ALIGN4() macros to achieve this. Both are required because of 128-bit reads from memory. • uint is used instead of int in fir_f32_s struct because these values are always non-negative (ASM routines use unsigned compare). • Since the non-decimating FIR filter doesn't emit any errors, the function output is declared void.

Also prepared a new smaller and slightly faster v4.4 (when compared to previous v4.x): • Coefficients and delay line must begin and end at 16-byte boundary. Use ALIGN16 and SIZE_ALIGN4()+4 macros to achieve this. Both are required because of 128-bit reads from memory and superfluous 128-bit reads from memory. • FIR init function, which clears the delay line prior to its use, must zero the first SIZE_ALIGN4()+3 elements! The same applies to the coeff array where the superfluous elements at the end of the array must also be zeroed. • uint is used instead of int in fir_f32_s struct because these values are always non-negative (ASM routines use unsigned compare). • Since the non-decimating FIR filter doesn't emit any errors, the function output is declared void.

I am attaching the sources with tests. Use 7-Zip to extract them from the 7-Zip archive. Feel free to incorporate them into this project. Let me know if you need me to tweak the ASM sources.

Cheers!

f4lc0n

FIR_fp32_1.5c_per_tap_v3.5.zip Improved: Faster non-decimating FIR execution for short FIRs (dozens of taps) when compared to v3.2. Changed: A0 save/restore - the stack not used anymore.

FIR_fp32_1.5c_per_tap_v4.5.zip Improved: Faster non-decimating FIR execution for short FIRs (dozens of taps) when compared to v4.3 Changed: A0 save/restore - the stack not used anymore.

dmitry1945 commented 12 months ago

Hi @f4lc0n-asm the FIR filters optimyzed for esp32s3 now in the esp-dsp. The implementation is different that you have placed, but the interface and requirements for aligment are the same. I think you will be able to use it in your project without modifications.

Please take a look.

That you, and best regards, Dmitry

f4lc0n-asm commented 12 months ago

Hello Dmitry1945,

just looking at esp-dsp/blob/master/modules/fir/float/dsps_fir_init_f32.c line 44: if (fir->N%4 != 0) - does it mean that FIR length N must be an integer multiple of 4 in your ESP32-S3 FIR solution? I devised solution just for any FIR length, even prime numbers for mathematics afficionados :)

f4lc0n

dmitry1945 commented 12 months ago

Hi @f4lc0n-asm The parameter len, yes, we have a trick. It should be divided by 4, but, you can use any length, just add zero the the end of the impulse response.

Dmitry

f4lc0n-asm commented 12 months ago

Just looking at your FIR ASM sources. Some optimization tips (line numbers are from dsps_fir_f32_aes3.S, but it the same spirit apply to dsps_fird_f32_aes3.S):

During the FIR computation, the delay line is read from the oldest sample to the newest - the coefficients must reflect that, i.e. first padding with float zeros, if any, then the FIR coefficients in reverse order - you can easily avoid that by storing the samples into the delay line with decrementing pos like I do in my FIR routines.
AND instruction at line 76 is superfluous because EE.LDF.128 instruction automatically and internally zeroes out 4 address LSBs before reading the 4 floats - A12 and A15 are no longer needed.
Second loops for offsets 1, 2, and 3 can be optimized. E.g. for offset 1 delete the instruction at line 138 and move the instruction from line 146 inbtwn the lines 140 and 141 (it becomes the 1st instruction in the loop instead of the last one).
AND instruction at line 86 can be replaced with EXTUI a8, a11, 2, 2 - A13 is no longer needed. You will have bits 3 and 2 shifted into bits 1 and 0. Test them for zero after decrementing with ADDI.N a8, a8, -1.

Well, 3 solid solutions appeared! The important thing is that people have a choice! As for my choice - I only trust my ASM code so much that I even create custom RTOS in ASM for my applications. So I guess you already know the answer! ;)

Cheers!

f4lc0n

espressif / esp-dsp

SOLVED: A 1½cycle/tap fp32 FIR filter of arbitrary length and decimation (DSP-101) #68