Open f4lc0n-asm opened 1 month ago
Ideally, this could be incorporated into Xtensa GCC. Activating the fast float division using a reciprocal with something like #pragma fast_float_div
.
To test DIV(a,b)
, I replaced q=(b1-b2)/(a1-a2);
with q=DIV(b1-b2,a1-a2);
and q=b1+(d-a1)*((d-a2)*((b2-b3)/(a2-a3)-q)/(a3-a1)+q);
with q=b1+(d-a1)*((d-a2)*DIV(DIV(b2-b3,a2-a3)-q,a3-a1)+q);
- the resulting assembly was MUCH more optimized, NO automatic vars on the stack used and func execution cycles dropped from 397 to 277, i.e. 1.43× faster! See float plinc(float d)
in plinc.c in https://github.com/espressif/esp-dsp/files/14334535/dcomp_v1.2.zip at https://github.com/espressif/esp-dsp/issues/76. Also measured just the 2 expressions: 202 vs. 80 cycles and 144 vs. 37 instructions - i.e. 2½× faster and 3.9× smaller! Quite a surprise! :) If you know the Xtensa GCC folks, please, let them know! Thank you!
Is your feature request related to a problem?
ESP32(-S3) fp32 division is notoriously slow. It can be made faster several times by using a reciprocal asm sequence, which is accurate to 1 ULP - precise enough for most cases. ESP32(-S3) ABI specifies passing both func's input args and output value in general-purpose regs (A2-A15) - even for floats, but for inline assembly in C that may not be the case - tested various scenarios and both input and output are passed in fp32 regs (F0-F15) where possible, which surely speeds things up :) This code was inspired by https://blog.llandsmeer.com/tech/2021/04/08/esp32-s2-fpu.html, which I significantly enhanced:
wfr
/rfr
)static
keyword forrecipsf2()
(is visible outside its source file)Here it is with Public Domain License:
Cheers,
f4lc0n
Fixed: Added
&
for thetemp
var so that it is mapped to a unique fp32 reg (in somerecipsf2()
usage cases it wasn't). Changed: Removedvolatile
afterasm
. Changed: The 1stmaddn.s
tomadd.s
so that it corresponds to the canonical reciprocal sequence in Xtensa ISA Summary on p. 113.Describe the solution you'd like.
This solution can be added to your "math" section.
Describe alternatives you've considered.
Moving to ESP32-P4, which has just a 3-cycle
fdiv.s
instruction, is not always possible…Additional context.
No response