earlephilhower / esp-quick-toolchain

GCC toolchain for esp8266/arduino on MacOS, Linux, ARM64, Raspberry Pi, and Windows
87 stars 24 forks source link

gcc: xtensa: make trying to replace 'l32r' with 'movi' + 'slli' regardless of optimizing for size or not, because 'l32r' is much slower than the latter on ESP8266 #33

Closed jjsuwa-sys3175 closed 2 years ago

jjsuwa-sys3175 commented 2 years ago
** constant loading benchmark test **

** adjacent 3 loading, 100000 times **
MOVI instruction  : 400180 cycles (4.00 cycles/loop)
constant synthesis: 700000 cycles (7.00 cycles/loop)
L32R instruction  : 2000000 cycles (20.00 cycles/loop)

** adjacent 4 loading, 100000 times **
MOVI instruction  : 500179 cycles (5.00 cycles/loop)
constant synthesis: 900181 cycles (9.00 cycles/loop)
L32R instruction  : 2700180 cycles (27.00 cycles/loop)

** adjacent 5 loading, 100000 times **
MOVI instruction  : 600181 cycles (6.00 cycles/loop)
constant synthesis: 1100180 cycles (11.00 cycles/loop)
L32R instruction  : 3300000 cycles (33.00 cycles/loop)

** adjacent 6 loading, 100000 times **
MOVI instruction  : 700000 cycles (7.00 cycles/loop)
constant synthesis: 1300179 cycles (13.00 cycles/loop)
L32R instruction  : 4100180 cycles (41.00 cycles/loop)

(Arduino sketch is here)

it concludes:

on ESP8266.

the refman says this behavior is implementation-specific:

This functionality (IRAM/IROM as data) is provided for initialization and test purposes, for which performance is not critical, so these operations may be significantly slower on some Xtensa implementations.

Xtensa(R) Instruction Set Reference Manual, "4.5.8 General RAM/ROM Option Features"

earlephilhower commented 2 years ago

Can you compare the generated binary sizes, please, for a non-trivial example? Maybe one of the webserver ones?

I'm worried it may grow somewhat by replacing a single instruction and constant (which might be shared now, saving more space) with multiple instructions.

jjsuwa-sys3175 commented 2 years ago

I'm worried it may grow somewhat by replacing a single instruction and constant (which might be shared now, saving more space) with multiple instructions.

until now, the replacement occurs only if optimizing for size (-Os, default setting for Arduino core) because reciprocal throughput of L32R may reach 1 cycle; (see https://github.com/earlephilhower/esp-quick-toolchain/pull/20#issuecomment-745783611) however for ESP8266, that assumption is not correct.

Again, -Os was specified in platform.txt already, thus replaciing L32R (+ 4-byte literal) to MOVI.n + SLLI was always done unless the option was changed to -O2.