Compiler is not emitting vsqrt.f32 on ARM Cortex M4F targets

pavel-kirienko commented 7 years ago

See this for background: https://github.com/PX4/Firmware/pull/6829#issuecomment-287359454

LorenzMeier commented 7 years ago

@davids5 Can you have a look?

davids5 commented 7 years ago

@LorenzMeier yes on it now

pavel-kirienko commented 7 years ago

@davids5 my build log: build_log.tar.gz

davids5 commented 7 years ago

@pavel-kirienko - thank you!

pavel-kirienko commented 7 years ago

PX4ESC build log: log.tar.gz

NaterGator commented 7 years ago

I think this is an upstream bug in glibc: https://sourceware.org/bugzilla/show_bug.cgi?id=20660

The compiler won't substitute with the FPU call unless we specify -fno-math-errno, correct? The typically non-intrusive float speed flags are -fno-math-errno, -ffinite-math-only, and -fno-signed-zeros.

pavel-kirienko commented 7 years ago

Even if the call can't be substituted with vsqrt, one would still expect the compiler to emit vsqrt inside __ieee754_sqrtf().

NaterGator commented 7 years ago

Actually the ARM GCC toolchain uses newlib, so I stand corrected (though the newlib default implementations are even worse). Most likely we're linking the wrong libc/libm versions.

NaterGator commented 7 years ago

@pavel-kirienko that's not the case because vsqrt.f32 wouldn't allow the lib to adhere to the IEEE754 spec. Upstream discussion here: https://answers.launchpad.net/gcc-arm-embedded/+question/246145

I can confirm just passing -fno-math-errno substitutes a great deal of calls to the IEEE compliant sqrtf call. Objdump on the FPU-enabled libm confirms that vsqrt instruction is not used.

davids5 commented 7 years ago

@NaterGator Thank you for the background. I think we need to evaluate what we are linking to. I thought we have are own math.h and use libm. Normally nuttx provides it's libc but some other foreign header/lib may have snuck in.

Do I understand correctly that the "IEEE compliant sqrtf" and use of " vsqrt instruction" are mutually exclusive?

pavel-kirienko commented 7 years ago

@davids5 @NaterGator CM4's FPU is IEEE754 compatible: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0439b/Chdehebg.html. Some additional support code might be still needed for handling errno, but I still see no reason why the compiler would choose to avoid the vsqrt instruction, since it seems to be compliant.

NaterGator commented 7 years ago

@davids5 the TRM for Cortex M4 says the FPU implementation is IEEE754 compliant; the difference is effectively that errno will not be updated by the use of the FPU instruction. I frankly don't understand why the newlib implementation doesn't utilize VSQRT.F32, which is unfortunate because the libm methods will still use the slow ieee754_sqrtf even with -fno-math-errno flag passed during compile time. (acos/asin, etc)

I did check and we are linking to the appropriate FPU-enabled libm and libc. (With respect to NuttX I haven't carefully checked that we're using the ARM GCC supplied libc).

edit: @pavel-kirienko beat me to it.

NaterGator commented 7 years ago

For what it is worth, here is a board sitting idle on the desk currently compiled from master:

And here is the same board in the same state with -fno-math-errno supplied to the compiler flags:

Edit: to temper enthusiasm here a bit, look at fw_att_control... I believe the sensors were considered too far out of spec to boot fully on the 2nd instance.

davids5 commented 7 years ago

On the Nuttx Build we should be only using the NuttX built libc or there can binary incompatibility disasters i.e the FILE in gcc ARM != the FILE NuttX.

davids5 commented 7 years ago

@NaterGator are you on skype? I am david_s5

davids5 commented 7 years ago

Looping in @priseborough

priseborough commented 7 years ago

The ekf2 should not be doing any double precision operations

pavel-kirienko commented 7 years ago

Hi Paul, the map reprojection routine, among others, does some double precision computations. Is that not intentional?

On Mar 18, 2017 1:32 AM, "Paul Riseborough" notifications@github.com wrote:

The ekf2 should not be doing any double precision operations

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PX4/Firmware/issues/6838#issuecomment-287487641, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJUZKBlHyoRFLDVwy8okPo0VwnMg8Fmks5rmwoZgaJpZM4MgrRn .

priseborough commented 7 years ago

The EKF operations should all be single precision, but yes, the conversion from GPS lat/long to local position used by the EKF and vice-versa will require doubles.

davids5 commented 7 years ago

@NaterGator

I have confirmed we are not using NuttX math routines just the math.h from nuttx-patches and the include files from nuttx. libc is definitely linked with the NuttX one.

davids5 commented 7 years ago

@LorenzMeier , @NaterGator @pavel-kirienko @dagar @jgoppert

I did some experiments with adding -ffast-math FYI:-funsafe-math-optimizations has always been in the build

-ffast-math
Sets -fno-math-errno, -funsafe-math-optimizations, 
-fno-trapping-math, -ffinite-math-only, -fno-rounding-math and -fno-signaling-nans.
This option causes the preprocessor macro __FAST_MATH__ to be defined.

This option should never be turned on by any -O option since it can result in incorrect output for programs which depend on an exact implementation of IEEE or ISO rules/specifications for math functions. 

-fno-math-errno
Do not set ERRNO after calling math functions that are executed with a single instruction, e.g., sqrt. A program that relies on IEEE exceptions for math error handling may want to use this flag for speed while maintaining IEEE arithmetic compatibility.
This option should never be turned on by any -O option since it can result in incorrect output for programs which depend on an exact implementation of IEEE or ISO rules/specifications for math functions.

The default is -fmath-errno. 

-funsafe-math-optimizations
Allow optimizations for floating-point arithmetic that (a) assume that arguments and results are valid and (b) may violate IEEE or ANSI standards. When used at link-time, it may include libraries or startup files that change the default FPU control word or other similar optimizations.
This option should never be turned on by any -O option since it can result in incorrect output for programs which depend on an exact implementation of IEEE or ISO rules/specifications for math functions.

The default is -fno-unsafe-math-optimizations. 

-ffinite-math-only
Allow optimizations for floating-point arithmetic that assume that arguments and results are not NaNs or +-Infs.
This option should never be turned on by any -O option since it can result in incorrect output for programs which depend on an exact implementation of IEEE or ISO rules/specifications.

The default is -fno-finite-math-only. 

-fno-trapping-math
Compile code assuming that floating-point operations cannot generate user-visible traps. These traps include division by zero, overflow, underflow, inexact result and invalid operation. This option implies -fno-signaling-nans. Setting this option may allow faster code if one relies on “non-stop” IEEE arithmetic, for example.
This option should never be turned on by any -O option since it can result in incorrect output for programs which depend on an exact implementation of IEEE or ISO rules/specifications for math functions.

The default is -ftrapping-math. 

-frounding-math
Disable transformations and optimizations that assume default floating point rounding behavior. This is round-to-zero for all floating point to integer conversions, and round-to-nearest for all other arithmetic truncations. This option should be specified for programs that change the FP rounding mode dynamically, or that may be executed with a non-default rounding mode. This option disables constant folding of floating point expressions at compile-time (which may be affected by rounding mode) and arithmetic transformations that are unsafe in the presence of sign-dependent rounding modes.
The default is -fno-rounding-math.

This option is experimental and does not currently guarantee to disable all GCC optimizations that are affected by rounding mode. Future versions of GCC may provide finer control of this setting using C99's FENV_ACCESS pragma. This command line option will be used to specify the default state for FENV_ACCESS. 

-fsignaling-nans
Compile code assuming that IEEE signaling NaNs may generate user-visible traps during floating-point operations. Setting this option disables optimizations that may change the number of exceptions visible with signaling NaNs. This option implies -ftrapping-math.
This option causes the preprocessor macro __SUPPORT_SNAN__ to be defined.

The default is -fno-signaling-nans.

This option is experimental and does not currently guarantee to disable all GCC optimizations that affect signaling NaN behavior.

Master:f2164b135d5c3327cbb2352ba86d2389016911e6

Here we note what @pavel-kirienko had commented on: the sqrtf functions is called and therefore vsqrt.f32 is not used.

    text    data     bss     dec     hex filename
1440572    6140   20724 1467436  16642c build_px4fmu-v3_default/src/firmware/nuttx/firmware_nuttx

09:37:33 (master $=) ~/Desktop/dev/PX4/repos/mainline/Firmware$ arm-none-eabi-objdump -d -C build_px4fmu-v3_default/src/firmware/nuttx/firmware_nuttx | grep '<sqrt'
 80086ba:       f0f6 fb45       bl      80fed48 <sqrtf>
 800acde:       f0f4 f833       bl      80fed48 <sqrtf>
 800b2f4:       f0f3 fd28       bl      80fed48 <sqrtf>
 800b470:       f0f3 fc6a       bl      80fed48 <sqrtf>
 800d198:       f0f1 fdd6       bl      80fed48 <sqrtf>
 8030a30:       f0ce f98a       bl      80fed48 <sqrtf>
 8032d18:       f0cc f816       bl      80fed48 <sqrtf>
 8032dd2:       f0cb ffb9       bl      80fed48 <sqrtf>
 80425b0:       f0bc fbca       bl      80fed48 <sqrtf>
 804332a:       f0bb f8f9       bl      80fe520 <sqrt>
 8048034:       f0b6 be88       b.w     80fed48 <sqrtf>
 8048108:       f0b6 fe1e       bl      80fed48 <sqrtf>
 8048292:       f0b6 fd59       bl      80fed48 <sqrtf>
 8048a6c:       f0b6 b96c       b.w     80fed48 <sqrtf>
 8048adc:       f0b6 f934       bl      80fed48 <sqrtf>
 8048b42:       f0b6 f901       bl      80fed48 <sqrtf>
 8048ba6:       f0b6 f8cf       bl      80fed48 <sqrtf>
 8048bf4:       f0b6 f8a8       bl      80fed48 <sqrtf>
 804970e:       f0b5 fb1b       bl      80fed48 <sqrtf>
 8049f48:       f0b4 fefe       bl      80fed48 <sqrtf>
 804a6fc:       f0b4 fb24       bl      80fed48 <sqrtf>
 804a824:       f0b4 fa90       bl      80fed48 <sqrtf>
 804f35a:       f0af fcf5       bl      80fed48 <sqrtf>
 804f3e4:       f0af fcb0       bl      80fed48 <sqrtf>
 804fe04:       f0ae ffa0       bl      80fed48 <sqrtf>
 80526e8:       f0ac bb2e       b.w     80fed48 <sqrtf>
 8052d4c:       f0ab bffc       b.w     80fed48 <sqrtf>
 8052ed0:       f0ab ff3a       bl      80fed48 <sqrtf>
 8053b08:       f0ab b91e       b.w     80fed48 <sqrtf>
 80543b8:       f0aa fcc6       bl      80fed48 <sqrtf>
 805442e:       f0aa fc8b       bl      80fed48 <sqrtf>
 8054842:       f0aa fa81       bl      80fed48 <sqrtf>
 805499a:       f0aa f9d5       bl      80fed48 <sqrtf>
 8056668:       f0a7 ff5a       bl      80fe520 <sqrt>
 8058522:       f0a6 fc11       bl      80fed48 <sqrtf>
 805860a:       f0a6 fb9d       bl      80fed48 <sqrtf>
 805871a:       f0a6 fb15       bl      80fed48 <sqrtf>
 806005e:       f09e fe73       bl      80fed48 <sqrtf>
 80600a6:       f09e fe4f       bl      80fed48 <sqrtf>
 8061760:       f09d faf2       bl      80fed48 <sqrtf>
 8061998:       f09d f9d6       bl      80fed48 <sqrtf>
 8061a18:       f09d f996       bl      80fed48 <sqrtf>
 8061a30:       f09d f98a       bl      80fed48 <sqrtf>
 8061a3c:       f09d f984       bl      80fed48 <sqrtf>
 8061bfc:       f09d f8a4       bl      80fed48 <sqrtf>
 8061ed6:       f09c ff37       bl      80fed48 <sqrtf>
 8061f56:       f09c fef7       bl      80fed48 <sqrtf>
 8061f70:       f09c feea       bl      80fed48 <sqrtf>
 8061f7c:       f09c fee4       bl      80fed48 <sqrtf>
 8065428:       f099 fc8e       bl      80fed48 <sqrtf>
 806573c:       f099 fb04       bl      80fed48 <sqrtf>
 806598c:       f099 f9dc       bl      80fed48 <sqrtf>
 806fa9a:       f08f f955       bl      80fed48 <sqrtf>
 806fc24:       f08f f890       bl      80fed48 <sqrtf>
 807227c:       f08c fd64       bl      80fed48 <sqrtf>
 8072288:       f08c fd5e       bl      80fed48 <sqrtf>
 8073504:       f08b fc20       bl      80fed48 <sqrtf>
 807c6ea:       f082 fb2d       bl      80fed48 <sqrtf>

Processes: 22 total, 2 running, 20 sleeping
CPU usage: 43.63% tasks, 1.97% sched, 54.40% idle
DMA Memory: 5120 total, 1536 used 1536 peak
Uptime: 249.398s total, 139.580s idle

 PID COMMAND                   CPU(ms) CPU(%)  USED/STACK PRIO(BASE) STATE
   0 Idle Task                  139580 54.400   652/  748   0 (  0)  READY
   1 hpwork                       8161  3.338   888/ 1780 192 (192)  w:sig
   2 lpwork                        396  0.151   644/ 1780  50 ( 50)  w:sig
   3 init                        45923  0.000  1664/ 2484 100 (100)  w:sem
 310 top                          5492  2.959  1248/ 1684 100 (100)  RUN
 107 gps                           733  0.075   936/ 1372 220 (220)  w:sem
 110 dataman                       146  0.000   704/ 1180  90 ( 90)  w:sem
 151 sensors                     10265  4.324  1344/ 1980 250 (250)  w:sem
 153 commander                    6201  2.427  2856/ 3548 140 (140)  w:sig
 164 px4io                        7393  3.034   968/ 1380 240 (240)  w:sem
 173 mavlink_if0                  4455  1.669  1736/ 2436 100 (100)  w:sig
 175 mavlink_rcv_if0                17  0.000   952/ 2100 175 (175)  w:sem
 186 mavlink_if1                  3108  1.289  1632/ 2388 100 (100)  w:sig
 187 mavlink_rcv_if1                16  0.000   952/ 2100 175 (175)  w:sem
 307 commander_low_prio              8  0.000   592/ 2996  50 ( 50)  w:sem
 223 mavlink_if2                 10603  4.248  1632/ 2388 100 (100)  w:sig
 229 mavlink_rcv_if2                18  0.000   952/ 2100 175 (175)  w:sem
 234 sdlog2                        123  0.000  1696/ 3316 177 (177)  w:sig
 277 ekf2                        29594 12.518  5072/ 5780 250 (250)  w:sem
 284 mc_att_control              10766  4.704  1280/ 1676 250 (250)  w:sem
 286 mc_pos_control               6464  2.883   616/ 1876 250 (250)  w:sem
 296 navigator                      26  0.000   952/ 1572 105 (105)  w:sem

             total       used       free    largest
Mem:        235024     189072      45952      42912

Master: with -ffast-math https://github.com/PX4/Firmware/pull/6863 Here we note that there are only 2 calls to the sqrt function. The reset have been replaced with vsqrt.f32

with -ffast-math

   text    data     bss     dec     hex filename
1435836    6140   20724 1462700  1651ac build_px4fmu-v3_default/src/firmware/nuttx/firmware_nuttx

09:33:05 (master_hw_fpu $) ~/Desktop/dev/PX4/repos/mainline/Firmware$ arm-none-eabi-objdump -d -C build_px4fmu-v3_default/src/firmware/nuttx/firmware_nuttx | grep '<sqrt>'
 8043202:       f0ba fa19       bl      80fd638 <sqrt>
 805601c:       f0a7 fb0c       bl      80fd638 <sqrt>

Processes: 22 total, 2 running, 20 sleeping
CPU usage: 42.24% tasks, 1.90% sched, 55.86% idle
DMA Memory: 5120 total, 1536 used 1536 peak
Uptime: 249.073s total, 147.554s idle

 PID COMMAND                   CPU(ms) CPU(%)  USED/STACK PRIO(BASE) STATE
   0 Idle Task                  147553 55.859   652/  748   0 (  0)  READY
   1 hpwork                       8484  3.500   960/ 1780 192 (192)  w:sem
   2 lpwork                        403  0.152   644/ 1780  50 ( 50)  w:sig
   3 init                        37662  0.000  1660/ 2484 100 (100)  w:sem
 303 top                          1099  2.968  1248/ 1684 100 (100)  RUN
 100 gps                           781  0.532   936/ 1372 220 (220)  w:sem
 103 dataman                       145  0.000   704/ 1180  90 ( 90)  w:sem
 144 sensors                      8668  3.424  1328/ 1980 250 (250)  w:sem
 146 commander                    6500  2.968  2848/ 3548 140 (140)  w:sig
 153 px4io                        7102  2.891  1056/ 1380 240 (240)  w:sem
 166 mavlink_if0                  4582  1.826  1736/ 2436 100 (100)  w:sig
 168 mavlink_rcv_if0                16  0.000   952/ 2100 175 (175)  w:sem
 179 mavlink_if1                  3190  1.217  1632/ 2388 100 (100)  w:sig
 180 mavlink_rcv_if1                16  0.000   952/ 2100 175 (175)  w:sem
 300 commander_low_prio              8  0.000   592/ 2996  50 ( 50)  w:sem
 216 mavlink_if2                 11284  4.642  1632/ 2388 100 (100)  w:sig
 218 mavlink_rcv_if2                17  0.000   952/ 2100 175 (175)  w:sem
 223 sdlog2                        123  0.000  1696/ 3316 177 (177)  w:sig
 270 ekf2                        27796 11.339  5064/ 5780 250 (250)  w:sem
 277 mc_att_control              10659  4.261  1280/ 1676 250 (250)  w:sem
 279 mc_pos_control               6208  2.511   664/ 1876 250 (250)  w:sem
 285 navigator                      28  0.000   952/ 1572 105 (105)  w:sem

nsh> free
             total       used       free    largest
Mem:        235024     187904      47120      44256
nsh>

davids5 commented 7 years ago

@NaterGator

I think my top comparisons are not as striking as yours because of the configuration settings.

Would you be able to replicate the test you did above with https://github.com/PX4/Firmware/pull/6863 ?

NaterGator commented 7 years ago

@davids5 see my edit at the bottom of that post; I was flashing to an autopilot that has been subjected to some ... hard ... landings and the MEMS sensors are a bit out of spec, so the controller failed to start. I see ~2-3% improvement with -ffast-math.

I'm going to write a quick benchmark to test the VSQRT instruction vs. the FPU-optimized libcall. We are likely splitting hairs over an unimportant optimization (the more concerning issue, IMO, is anywhere we have unintended use of doubles)

NaterGator commented 7 years ago

Well, the proof is in the pudding. VSQRT.F32 is performing more slowly than the libcall.

Source for benchmarking:

Disassembly:

davids5 commented 7 years ago

@NaterGator

Good work!

So it bigger and slower - Not Good. Do you think this explains why the lib is not using the instruction?

(edited) This post was an improper conclusion on my part. 1) The above benchmark was invalid based on the use of the hrt with interrupts off. 2) code size goes down with the -ffast-math flags option set.

NaterGator commented 7 years ago

@davids5 it is a bit of a contrived benchmark with some side effects (adjacent FPU instructions introduce delays, the tight loop means OOO execution is not realized, etc) but there is a case to be made that the libm implementation is better.

The processor docs suggest the sqrt instruction should execute in ~14 cycles, so I find it quite surprising that the lib implementation can finish the jump + all branching in that many cycles. I'm going to be studying this a bit more to try and suss out any additional side effects.

LorenzMeier commented 7 years ago

Our matrices are sparse. A ton of instructions go against 0's and 1's. So we might need a wrapper that does skip simple cases and only solves the real computations. Interested to prototype? == 0 and == 1 checks might go a long way.

NaterGator commented 7 years ago

@davids5 I reimplemented with cycle-accurate accounting using DWT_CYCCNT, the results are more in line with what I would expect (which makes me a bit suspicious of hrt_absolute_time) :

This suggests that on-average the instruction is about 20x faster.

pavel-kirienko commented 7 years ago

@NaterGator could you please upload your assembly from above as text? Thanks in advance.

davids5 commented 7 years ago

@NaterGator

The hrt expects to be called once per ms to detect the wrapping.

I would think you could add it in the loop and it would be a constant in both measurements.

If were are looking to recover CPU cycles then the % cpu delta is in question here in an A / B comparison of with and without -ffast-math.

pavel-kirienko commented 7 years ago

I agree that the first result above that shows sqrt() faster than VSQRT is really hard to believe. I would like to have a closer look, having the assembly as text rather than a picture would help. ;)

On the subject of special cases proposed by Lorenz, I haven't tried that but I foresee that the performance impact would be likely negative due to branch prediction issues. The flash prefetcher takes 5~6 cycles to fetch a line from flash, plus additional several cycles needed to perform comparison against 0/1 make the computational load comparable to the 14+ cycles needed to execute a VSQRT. Mispredicted branching may have a serious impact on performance even on these relatively simple cores. There is a different project I am involved in where we managed to improve performance of the critical hard real time routines by ~15% simply by way of adding branch hinting constructs (see __builtin_expect()) throughout the hot paths of the code.

NaterGator commented 7 years ago

@pavel-kirienko sorry, I took off before reading your comment. I'll post the assembly tomorrow morning.

The errant behavior is explained by hrt_absolute_time expecting to be called before the timer overflows. IMO that is really undesirable behavior and we should use timer chaining to work around it.

davids5 commented 7 years ago

@NaterGator - the HRT is used for real time scheduling. It has an ISR that uses the call hrt_absolute_time() regularly so this is not normally an issue; Because the interrupts were off the use case exceeded its abilities.

jgoppert commented 7 years ago

Regarding sparse matrix math, I would like to add sparse matrix math to matrix eventually. Probably wouldn't be too hard.

On Mar 20, 2017 7:33 PM, "David Sidrane" notifications@github.com wrote:

@NaterGator https://github.com/NaterGator - the HRT is used for real time scheduling. It has an ISR that uses the call hrt_absolute_time() regularly so this is not normally an issue; Because the interrupts were off the use case exceeded its abilities.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PX4/Firmware/issues/6838#issuecomment-287958674, or mute the thread https://github.com/notifications/unsubscribe-auth/AAc6rNpKkU_qy6QTm2dBVv2uH9izffIHks5rnzcEgaJpZM4MgrRn .

NaterGator commented 7 years ago

@davids5 exactly; I'm just worried about the case where something in the system is misbehaving and the hrt_absolute_time reference no longer represents a stable clock. Obviously we're not falling out the sky because of any such potential issue... just trying to wrap my head around how it would actually manifest and if / how best to mitigate it.

Upstream has apparently noticed (or somebody filed a bugreport with the newlib folks): https://sourceware.org/git/gitweb.cgi?p=newlib-cygwin.git;a=commitdiff;h=baf32fb85fd6ef5e3e5975a357a40de72dc92e15

@pavel-kirienko assembly attached: firmware_nuttx_perfsqrt.zip

NaterGator commented 7 years ago

Ok, so I used some trickery to replace __ieee754_sqrtf with the upstream version, the results are more promising:

Processes: 23 total, 3 running, 20 sleeping
CPU usage: 42.96% tasks, 1.90% sched, 55.14% idle
DMA Memory: 5120 total, 0 used 0 peak
Uptime: 129.188s total, 72.898s idle

 PID COMMAND                   CPU(ms) CPU(%)  USED/STACK PRIO(BASE) STATE 
   0 Idle Task                   72901 55.144   644/  748   0 (  0)  READY 
   1 hpwork                       3622  2.797   816/ 1780 192 (192)  w:sig 
   2 lpwork                        580  0.361   640/ 1780  50 ( 50)  w:sig 
   3 init                         3473  0.000  1664/ 2484 100 (100)  w:sem 
 238 commander_low_prio              4  0.000   592/ 2996  50 ( 50)  w:sem 
  38 gps                           371  0.541   904/ 1372 220 (220)  w:sem 
  73 sensors                      4415  3.429  1344/ 1980 250 (250)  w:sem 
  75 commander                    3020  2.346  2864/ 3548 140 (140)  w:sig 
  82 px4io                        3563  2.978  1072/ 1380 240 (240)  w:sem 
  90 mavlink_if0                  2166  1.714  1632/ 2436 100 (100)  READY 
  91 mavlink_rcv_if0                 8  0.000   952/ 2100 175 (175)  w:sem 
 108 mavlink_if1                  1615  1.173  1632/ 2388 100 (100)  w:sig 
 109 mavlink_rcv_if1                 8  0.000   952/ 2100 175 (175)  w:sem 
 144 mavlink_if2                 12104 10.108  1640/ 2388 100 (100)  w:sig 
 145 mavlink_rcv_if2                76  0.000  1456/ 2100 175 (175)  w:sem 
 156 logger                        570  0.361  1152/ 3532 250 (250)  w:sem 
 157 log_writer_file                 0  0.000   344/ 1060  60 ( 60)  w:sem 
 196 ekf2                        14223 11.732  5072/ 5780 250 (250)  w:sem 
 199 fw_att_control               4785  4.061   952/ 1476 250 (250)  w:sem 
 202 fw_pos_ctrl_l1                 50  0.000   664/ 1676 250 (250)  w:sem 
 211 navigator                      13  0.000   796/ 1572 105 (105)  w:sem 
 254 top                           840  1.353  1248/ 1684 100 (100)  RUN   
 245 mavlink_shell                6714  0.000   856/ 2020 100 (100)  w:sem 
nsh> 
nsh> perf sqrt 100000
100000 iterations:
    Libcall elapsed: 5700029 cyc, VSQRT.F32 elapsed: 1699998 cyc
nsh>

The disassembly of __ieee754_sqrtf is now:

080aa8d4 <__ieee754_sqrtf>:
 80aa8d4:   eeb1 0ac0   vsqrt.f32   s0, s0
 80aa8d8:   4770        bx  lr
 80aa8da:   Address 0x080aa8da is out of bounds.

Effectively I query gcc to resolve the libm that gcc will link and use objcopy to weaken the __ieee754_sqrtf symbol in the copy of libm, define our replacement __ieee754_sqrtf from upstream in a stm32 specific file (in this case board_identity.c), and link against the patched libm. This shows up as a ~6x improvement when invoking patched libm sqrtf versus unpatched.

Nonetheless: idle time is not dramatically increased. It is slightly better, 2-3%. Is 2-3% worth implementing a cmake step to patch libm for STM32F4 targets?

If yes, I'll submit a PR.

davids5 commented 7 years ago

@NaterGator

Is this the best of both worlds? No Size hit of +4K and a faster sqrtf. - Nice

Can you use -Wl,-wrap,__ieee754_sqrtf and then just link the replacement?

davids5 commented 7 years ago

Can the replacement live in the src/drivers/stm32/math/hw_sqrtf.c or the like?

davids5 commented 7 years ago

I take that back - that is incorrect it is an arm-7m with FPU replacement. So this needs a new home that can be used by M4F and possibly M7F.

davids5 commented 7 years ago

@pavel-kirienko, @NaterGator @LorenzMeier @jgoppert

I am rounding up the conclusions here, because I think this thread has some bad conclusions on my part along with the some of the original benchmarks that were skewed.

1) As @pavel-kirienko noted and @NaterGator demonstrated with the DWT_CYCCNT benchmark the FPU version is ~20X faster than the SW version.

2) The -ffast-math reduces the code size by 4736 bytes. (master:1440572 master_hw_fpu:1435836 - I had this backwards above and will edit the posts)

3) -ffast-math replaces all but 2 calls to __ieee754_sqrtf.

4) There are still calls to __ieee754_sqrtf - Here is one path that is still is calling __ieee754_sqrtf

So I think the remaining questions are: 1) Do we deem -ffast-math to be safe in our application? 2) Should we port in @NaterGator's replacements from upstream in addition to using the -ffast-math ?

PX4 / PX4-Autopilot

Compiler is not emitting vsqrt.f32 on ARM Cortex M4F targets #6838