Open f4lc0n-asm opened 1 month ago
Ideally, this could be incorporated into Xtensa GCC. Activating the fast float division using a reciprocal with something like #pragma fast_float_div
.
To test DIV(a,b)
, I replaced q=(b1-b2)/(a1-a2);
with q=DIV(b1-b2,a1-a2);
and q=b1+(d-a1)*((d-a2)*((b2-b3)/(a2-a3)-q)/(a3-a1)+q);
with q=b1+(d-a1)*((d-a2)*DIV(DIV(b2-b3,a2-a3)-q,a3-a1)+q);
- the resulting assembly was MUCH more optimized, NO automatic vars on the stack used and func execution cycles dropped from 397 to 277, i.e. 1.43× faster! See float plinc(float d)
in plinc.c in https://github.com/espressif/esp-dsp/files/14334535/dcomp_v1.2.zip at https://github.com/espressif/esp-dsp/issues/76. Also measured just the 2 expressions: 202 vs. 80 cycles and 144 vs. 37 instructions - i.e. 2½× faster and 3.9× smaller! Quite a surprise! :) If you know the Xtensa GCC folks, please, let them know! Thank you!
ESP32(-S3) fp32 division is notoriously slow. It can be made faster several times by using a reciprocal asm sequence, which is accurate to 1 ULP - precise enough for most cases. ESP32(-S3) ABI specifies passing both func's input args and output value in general-purpose regs (A2-A15) - even for floats, but for inline assembly in C that may not be the case - tested various scenarios and both input and output are passed in fp32 regs (F0-F15) where possible, which surely speeds things up :) This code was inspired by https://blog.llandsmeer.com/tech/2021/04/08/esp32-s2-fpu.html, which I significantly enhanced:
wfr
/rfr
)static
keyword forrecipsf2()
(is visible outside its source file)Here it is with Public Domain License:
Cheers,
f4lc0n
Fixed: Added
&
for thetemp
var so that it is mapped to a unique fp32 reg (in somerecipsf2()
usage cases it wasn't). Changed: Removedvolatile
afterasm
. Changed: The 1stmaddn.s
tomadd.s
so that it corresponds to the canonical reciprocal sequence in Xtensa ISA Summary on p. 113.