SOLVED: An ESP32(-S3) fast fp32 division using a reciprocal inline assembly with arg/result passsed in fp32 regs for even more speed! (AUD-5741)

f4lc0n-asm commented 1 month ago

ESP32(-S3) fp32 division is notoriously slow. It can be made faster several times by using a reciprocal asm sequence, which is accurate to 1 ULP - precise enough for most cases. ESP32(-S3) ABI specifies passing both func's input args and output value in general-purpose regs (A2-A15) - even for floats, but for inline assembly in C that may not be the case - tested various scenarios and both input and output are passed in fp32 regs (F0-F15) where possible, which surely speeds things up :) This code was inspired by https://blog.llandsmeer.com/tech/2021/04/08/esp32-s2-fpu.html, which I significantly enhanced:

2 asm instructions less (no wfr/rfr)
no fixed fp32 regs (gcc can freely choose which ones - allows more optimizations)
no static keyword for recipsf2() (is visible outside its source file)

Here it is with Public Domain License:

__attribute__((always_inline)) inline
float recipsf2(float input) {
    float result, temp;
    asm(
        "recip0.s %0, %2\n"
        "const.s %1, 1\n"
        "msub.s %1, %2, %0\n"
        "madd.s %0, %0, %1\n"
        "const.s %1, 1\n"
        "msub.s %1, %2, %0\n"
        "maddn.s %0, %0, %1\n"
        :"=&f"(result),"=&f"(temp):"f"(input)
    );
    return result;
}

#define DIV(a, b) (a)*recipsf2(b)

Cheers,

f4lc0n

Fixed: Added & for the temp var so that it is mapped to a unique fp32 reg (in some recipsf2() usage cases it wasn't). Changed: Removed volatile after asm. Changed: The 1st maddn.s to madd.s so that it corresponds to the canonical reciprocal sequence in Xtensa ISA Summary on p. 113.

f4lc0n-asm commented 1 month ago

Ideally, this could be incorporated into Xtensa GCC. Activating the fast float division using a reciprocal with something like #pragma fast_float_div.

f4lc0n-asm commented 1 month ago

To test DIV(a,b), I replaced q=(b1-b2)/(a1-a2); with q=DIV(b1-b2,a1-a2); and q=b1+(d-a1)*((d-a2)*((b2-b3)/(a2-a3)-q)/(a3-a1)+q); with q=b1+(d-a1)*((d-a2)*DIV(DIV(b2-b3,a2-a3)-q,a3-a1)+q); - the resulting assembly was MUCH more optimized, NO automatic vars on the stack used and func execution cycles dropped from 397 to 277, i.e. 1.43× faster! See float plinc(float d) in plinc.c in https://github.com/espressif/esp-dsp/files/14334535/dcomp_v1.2.zip at https://github.com/espressif/esp-dsp/issues/76. Also measured just the 2 expressions: 202 vs. 80 cycles and 144 vs. 37 instructions - i.e. 2½× faster and 3.9× smaller! Quite a surprise! :) If you know the Xtensa GCC folks, please, let them know! Thank you!

espressif / esp-adf

SOLVED: An ESP32(-S3) fast fp32 division using a reciprocal inline assembly with arg/result passsed in fp32 regs for even more speed! (AUD-5741) #1284