arduino / ArduinoCore-mbed

346 stars 199 forks source link

Portenta's M7 performance hobbled by compiler flag #111

Open PaulStoffregen opened 3 years ago

PaulStoffregen commented 3 years ago

I ran the CoreMark benchmark on Portenta. It's running significantly slower than M7 at 480 MHz should. Most of the poor performance is due to line 14 in variants/PORTENTA_H7_M7/cflags.txt.

-Os

Here's the results from 3 runs on Portenta, 2 of them by editing line 14 in cflags.txt.

Optimization CoreMark Code Size
-Os 1127 138640
-O2 1484 139680
-O3 1542 142912

On AVR and SAMD, optimizing for size (-Os) works well. But on M4 & M7 cores, it costs quite a lot of performance. If you want to give Portenta users a substantial speed boost, just edit line 14 to use better compiler optimizations.

hpssjellis commented 2 years ago

This is super interesting.

The 03 feature is faster but the memory is larger. See this data with an edgeimpulse.com machine learning program. Note: Program using 03 would not compile on the even core split, but was 20 ms faster to classify vision objects: from 121 ms to 101 ms. That is an tremendous speed improvement!


using 0s flag using the 1.0 M7 and 1.0 M5 core split

Sketch uses 776368 bytes (98%) of program storage space. Maximum is 786432 bytes.
Global variables use 89808 bytes (17%) of dynamic memory, leaving 433816 bytes for local variables. Maximum is 523624 bytes.

run_classifier returned: 0
Predictions (DSP: 1 ms., Classification: 121 ms., Anomaly: 0 ms.): 
[0.94531, 0.05078, 0.00391, 0.00000]

using O3 flag   using 1.5 M7 and  0.5 M4 core split

Sketch uses 806184 bytes (55%) of program storage space. Maximum is 1441792 bytes.
Global variables use 89808 bytes (17%) of dynamic memory, leaving 433816 bytes for local variables. Maximum is 523624 bytes.

run_classifier returned: 0
Predictions (DSP: 1 ms., Classification: 101 ms., Anomaly: 0 ms.): 
[0.99609, 0.00000, 0.00000, 0.00000]