Notable speed improvement of floating point FIR on ESP32-S3

A public domain suggestion for notable speed improvement of floating point FIR calculation on ESP32-S3

Original algorithm: https://github.com/espressif/esp-dsp/blob/master/modules/fir/float/dsps_fir_f32_ae32.S

Details of solution: A hw loop of just 3 float instructions loading 2 floats from memory and performing a multiply-add on them is highly suboptimal because madd.s's input operands directly depend on the memory loads (you need at least 1 other instruction btwn them and madd.s). The solution can be loading another pair of values, and performing 2 madd.s's to different result registers to avoid mutual dependency, and only later adding them together. The drawback: FIR length couldn't be arbitrary, just an integer multiple of madd.s's used, i.e. 2, but that's acceptable, Decimation must also be an integer multiple of madd.s's used, i.e. 2. The processing just takes 2 cycles per pair in 2 madd.s version vs. 4 cycles per pair in 1 madd.s version → 2× faster :) For even faster execution per pair, 128-bit read instructions can be used. The processing takes just 1½ cycles per pair → 2⅔× faster :] If a solution which allows arbitrary decimation is found, I will post it here.

The new official & detailed (702 pages) Xtensa® Instruction Set Architecture (ISA) Summary: https://www.cadence.com/content/dam/cadence-www/global/en_US/documents/tools/ip/tensilica-ip/isa-summary.pdf

    const.s f4, 0      // zero out accs
    const.s f5, 0

    loopnez a13, loop1 // executes in just 2 cycles per coeff-data pair
    ee.ldf.64.xp f2, f0, a9, a10  // preload 2 of coeffs
    ee.ldf.64.xp f3, f1, a11, a12 // preload 2 data samples from delay line
    madd.s f4, f2, f1  // accumulation of the 1st coeff-data pair
    madd.s f5, f0, f3  // accumulation of the 2nd coeff-data pair
loop1:

    loopnez a14, loop2 // executes in just 2 cycles per coeff-data pair
    ee.ldf.64.xp f2, f0, a9, a10  // preload 2 coeffs
    ee.ldf.64.xp f3, f1, a11, a12 // preload 2 data samples from delay line
    madd.s f4, f2, f1  // accumulation of the 1st coeff-data pair
    madd.s f5, f0, f3  // accumulation of the 2nd coeff-data pair
loop2:

    // some 3 useful instructions here to bridge / fill in the 3-cycle output latency of madd.s f5 … 
    add.s f4, f4, f5   // add the 2 accs together

========================================================================================================

    const.s f8, 0      // zero out accs
    const.s f9, 0
    const.s f10, 0
    const.s f11, 0

    loopnez a13, loop1 // executes in just 1½ cycles per coeff-data pair
    ee.ldf.128.xp f6, f4, f2, f0, a9, a10  // preload 4 coeffs
    ee.ldf.128.xp f7, f5, f3, f1, a11, a12 // preload 4 data samples from delay line
    madd.s f8, f6, f1  // accumulation of the 1st coeff-data pair
    madd.s f9, f4, f3  // accumulation of the 2nd coeff-data pair
    madd.s f10, f2, f5 // accumulation of the 3rd coeff-data pair
    madd.s f11, f0, f7 // accumulation of the 4th coeff-data pair
loop1:

    loopnez a14, loop2 // executes in just 1½ cycles per coeff-data pair
    ee.ldf.128.xp f6, f4, f2, f0, a9, a10  // preload 4 coeffs
    ee.ldf.128.xp f7, f5, f3, f1, a11, a12 // preload 4 data samples from delay line
    madd.s f8, f6, f1  // accumulation of the 1st coeff-data pair
    madd.s f9, f4, f3  // accumulation of the 2nd coeff-data pair
    madd.s f10, f2, f5 // accumulation of the 3rd coeff-data pair
    madd.s f11, f0, f7 // accumulation of the 4th coeff-data pair
loop2:

    // some 3 useful instructions here to bridge / fill in the 3-cycle output latency of madd.s f11 … 
    add.s f8, f8, f9   // add the 4 accs together
    add.s f9, f10, f11
    add.s f8, f8, f9

espressif / esp-dsp

Notable speed improvement of floating point FIR on ESP32-S3 #66