Details of solution:
A hw loop of just 3 float instructions loading 2 floats from memory and performing
a multiply-add on them is highly suboptimal because madd.s's input operands
directly depend on the memory loads (you need at least 1 other instruction btwn them
and madd.s). The solution can be loading another pair of values, and performing 2 madd.s's
to different result registers to avoid mutual dependency, and only later adding them together.
The drawback: FIR length couldn't be arbitrary, just an integer multiple of madd.s's used,
i.e. 2, but that's acceptable, Decimation must also be an integer multiple of madd.s's used,
i.e. 2. The processing just takes 2 cycles per pair in 2 madd.s version vs.4 cycles
per pair in 1 madd.s version → 2× faster :) For even faster execution per pair, 128-bit read
instructions can be used. The processing takes just 1½ cycles per pair → 2⅔× faster :]
If a solution which allows arbitrary decimation is found, I will post it here.
const.s f4, 0 // zero out accs
const.s f5, 0
loopnez a13, loop1 // executes in just 2 cycles per coeff-data pair
ee.ldf.64.xp f2, f0, a9, a10 // preload 2 of coeffs
ee.ldf.64.xp f3, f1, a11, a12 // preload 2 data samples from delay line
madd.s f4, f2, f1 // accumulation of the 1st coeff-data pair
madd.s f5, f0, f3 // accumulation of the 2nd coeff-data pair
loop1:
loopnez a14, loop2 // executes in just 2 cycles per coeff-data pair
ee.ldf.64.xp f2, f0, a9, a10 // preload 2 coeffs
ee.ldf.64.xp f3, f1, a11, a12 // preload 2 data samples from delay line
madd.s f4, f2, f1 // accumulation of the 1st coeff-data pair
madd.s f5, f0, f3 // accumulation of the 2nd coeff-data pair
loop2:
// some 3 useful instructions here to bridge / fill in the 3-cycle output latency of madd.s f5 …
add.s f4, f4, f5 // add the 2 accs together
========================================================================================================
const.s f8, 0 // zero out accs
const.s f9, 0
const.s f10, 0
const.s f11, 0
loopnez a13, loop1 // executes in just 1½ cycles per coeff-data pair
ee.ldf.128.xp f6, f4, f2, f0, a9, a10 // preload 4 coeffs
ee.ldf.128.xp f7, f5, f3, f1, a11, a12 // preload 4 data samples from delay line
madd.s f8, f6, f1 // accumulation of the 1st coeff-data pair
madd.s f9, f4, f3 // accumulation of the 2nd coeff-data pair
madd.s f10, f2, f5 // accumulation of the 3rd coeff-data pair
madd.s f11, f0, f7 // accumulation of the 4th coeff-data pair
loop1:
loopnez a14, loop2 // executes in just 1½ cycles per coeff-data pair
ee.ldf.128.xp f6, f4, f2, f0, a9, a10 // preload 4 coeffs
ee.ldf.128.xp f7, f5, f3, f1, a11, a12 // preload 4 data samples from delay line
madd.s f8, f6, f1 // accumulation of the 1st coeff-data pair
madd.s f9, f4, f3 // accumulation of the 2nd coeff-data pair
madd.s f10, f2, f5 // accumulation of the 3rd coeff-data pair
madd.s f11, f0, f7 // accumulation of the 4th coeff-data pair
loop2:
// some 3 useful instructions here to bridge / fill in the 3-cycle output latency of madd.s f11 …
add.s f8, f8, f9 // add the 4 accs together
add.s f9, f10, f11
add.s f8, f8, f9
A public domain suggestion for notable speed improvement of floating point FIR calculation on ESP32-S3
Original algorithm: https://github.com/espressif/esp-dsp/blob/master/modules/fir/float/dsps_fir_f32_ae32.S
Details of solution: A hw loop of just 3 float instructions loading 2 floats from memory and performing a multiply-add on them is highly suboptimal because madd.s's input operands directly depend on the memory loads (you need at least 1 other instruction btwn them and madd.s). The solution can be loading another pair of values, and performing 2 madd.s's to different result registers to avoid mutual dependency, and only later adding them together. The drawback: FIR length couldn't be arbitrary, just an integer multiple of madd.s's used, i.e. 2, but that's acceptable, Decimation must also be an integer multiple of madd.s's used, i.e. 2. The processing just takes 2 cycles per pair in 2 madd.s version vs. 4 cycles per pair in 1 madd.s version → 2× faster :) For even faster execution per pair, 128-bit read instructions can be used. The processing takes just 1½ cycles per pair → 2⅔× faster :] If a solution which allows arbitrary decimation is found, I will post it here.
The new official & detailed (702 pages) Xtensa® Instruction Set Architecture (ISA) Summary: https://www.cadence.com/content/dam/cadence-www/global/en_US/documents/tools/ip/tensilica-ip/isa-summary.pdf