arduino / ArduinoCore-mbed

327 stars 191 forks source link

rp2040 core does not use rom floating point functions #614

Open WestfW opened 1 year ago

WestfW commented 1 year ago

The rp2040 includes optimized floating point functions in ROM, using linker magic to replace the standard gcc floating point functions when building applications from the pico SDK or Philhower core.

But the Arduino mbed core for rp2040 does not do the linker magic, causing applications that use floating point to be much slower (and larger) than on other rp2040 platforms.

the linker magic is included here: https://github.com/earlephilhower/arduino-pico/blob/master/lib/platform_wrap.txt

Arduino's (much smaller) link options: https://github.com/arduino/ArduinoCore-mbed/blob/master/variants/RASPBERRY_PI_PICO/ldflags.txt

facchinm commented 1 year ago

Hi @WestfW , at the time we first ported RP2040 on mbed-os I benchmarked the two solutions, and soft fp were not THAT bad (sometimes even faster than the ROM version) while I agree that the flash occupation increases a lot (which was not considered a problem due to the large flash space available). Anyway, we would gladly accept a PR that implements all the wraps (while taking care not to break the other mcus supported by this core) :slightly_smiling_face:

WestfW commented 1 year ago

Here's a quick benchmark of tangent (which was one of the more computationally intensive of the library functions.) Nearly 10x improvement using the ROM functions!

Philhower core, variant rpipico (using ROM float functions) tanf Elapsed time: 983 tan (double) Elapsed time: 1982 tanf Elapsed time: 876 tan (double) Elapsed time: 1962

Arduinocore (using gcc float functions.) tanf Elapsed time: 8023 tan (double) Elapsed time: 12048 tanf Elapsed time: 7360 tan (double) Elapsed time: 11701

void setup() {
  Serial.begin(115200);
  while (!Serial) ;
#ifdef ARDUINO_VARIANT
  Serial.println("\nPhilhower core, variant " ARDUINO_VARIANT);
#else
  Serial.println("\nArduinocore");
#endif
}

uint32_t startmicros, endmicros;
void loop() {
  startmicros = micros();
  for (float f = 0.0; f < 2 * (float)PI; f += PI / (float)60.0) {
    digitalWrite(5, ((int)tanf(f)) & 1);
  }
  endmicros = micros();
  Serial.print("tanf Elapsed time: ");
  Serial.println(endmicros - startmicros);
  delay(5000);
  startmicros = micros();
  for (double f = 0.0; f < 2 * PI; f += PI / 60.0) {
    digitalWrite(5, ((int)tan(f)) & 1);
  }
  endmicros = micros();
  Serial.print("tan (double) Elapsed time: ");
  Serial.println(endmicros - startmicros);
  delay(5000);
}
daveythacher commented 1 year ago

What happens if this is RAM only?

Edit: Does Arduino MBED recommend precomputation as a general practice for RP2040?

WestfW commented 1 year ago

What? RAM-only executables can still call ROM. What does precomputation have to do with anything? The Pico SDK does floats by "wrapping" calls to the standard FP functions in short bits of code that call the ROM functions, but the way they've done it requires a bunch of -Wl,--wrap=__aeabi_fadd options in the link step.

daveythacher commented 1 year ago

What? RAM-only executables can still call ROM.

RAM only is to see if the access time plays a role in the performance or not. RAM and ROM should be about the same.

What does precomputation have to do with anything?

Nothing, it was just a guess for why.

The Pico SDK does floats by "wrapping" calls to the standard FP functions in short bits of code that call the ROM functions, but the way they've done it requires a bunch of -Wl,--wrap=__aeabi_fadd options in the link step.

Can you degrade performance using SDK to recreate the numbers you see with MBED? If so, you can test RAM only with copy to RAM to prove if flash is a problem.

Flash is likely a problem no matter what. Unless there is a defect in the ROM version, I would guess the ROM version will always be better.

Edit: My guess being this is your point?

rivimey commented 7 months ago

Can't comment on the linker opts, but I would remind you that the rom functions are faster partly because they are intentionally less accurate (and in particular the rom code is not IEEE754 conformant). I would therefore argue that giving users a choice between smaller-faster-but-less-accurate and larger-slower-but-more-accurate was useful, rather than forcing one selection on all users.

WestfW commented 7 months ago

Not true, at least in principle...

Qfplib-M0-full is a library of IEEE 754 single- and double-precision floating-point arithmetic routines for microcontrollers based on the ARM Cortex-M0 core (ARMv6-M architecture).

It provides correctly rounded (to nearest, even-on-tie) addition, subtraction, multiplication, division and square root operations, and sine, cosine, tangent, arctangent, logarithm and exponential functions that give a high degree of accuracy.

Limitations and deviations from the IEEE 754 standard On input and output NaNs are converted to infinities and denormals are flushed to zero.