Closed pavel-kirienko closed 7 years ago
@davids5 Can you have a look?
@LorenzMeier yes on it now
@davids5 my build log: build_log.tar.gz
@pavel-kirienko - thank you!
PX4ESC build log: log.tar.gz
I think this is an upstream bug in glibc: https://sourceware.org/bugzilla/show_bug.cgi?id=20660
The compiler won't substitute with the FPU call unless we specify -fno-math-errno, correct? The typically non-intrusive float speed flags are -fno-math-errno, -ffinite-math-only, and -fno-signed-zeros.
Even if the call can't be substituted with vsqrt, one would still expect the compiler to emit vsqrt inside __ieee754_sqrtf()
.
Actually the ARM GCC toolchain uses newlib, so I stand corrected (though the newlib default implementations are even worse). Most likely we're linking the wrong libc/libm versions.
@pavel-kirienko that's not the case because vsqrt.f32 wouldn't allow the lib to adhere to the IEEE754 spec. Upstream discussion here: https://answers.launchpad.net/gcc-arm-embedded/+question/246145
I can confirm just passing -fno-math-errno substitutes a great deal of calls to the IEEE compliant sqrtf call. Objdump on the FPU-enabled libm confirms that vsqrt instruction is not used.
@NaterGator Thank you for the background. I think we need to evaluate what we are linking to. I thought we have are own math.h and use libm. Normally nuttx provides it's libc but some other foreign header/lib may have snuck in.
Do I understand correctly that the "IEEE compliant sqrtf" and use of " vsqrt instruction" are mutually exclusive?
@davids5 @NaterGator CM4's FPU is IEEE754 compatible: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0439b/Chdehebg.html. Some additional support code might be still needed for handling errno
, but I still see no reason why the compiler would choose to avoid the vsqrt
instruction, since it seems to be compliant.
@davids5 the TRM for Cortex M4 says the FPU implementation is IEEE754 compliant; the difference is effectively that errno will not be updated by the use of the FPU instruction. I frankly don't understand why the newlib implementation doesn't utilize VSQRT.F32, which is unfortunate because the libm methods will still use the slow ieee754_sqrtf even with -fno-math-errno flag passed during compile time. (acos/asin, etc)
I did check and we are linking to the appropriate FPU-enabled libm and libc. (With respect to NuttX I haven't carefully checked that we're using the ARM GCC supplied libc).
edit: @pavel-kirienko beat me to it.
For what it is worth, here is a board sitting idle on the desk currently compiled from master:
And here is the same board in the same state with -fno-math-errno supplied to the compiler flags:
Edit: to temper enthusiasm here a bit, look at fw_att_control... I believe the sensors were considered too far out of spec to boot fully on the 2nd instance.
On the Nuttx Build we should be only using the NuttX built libc or there can binary incompatibility disasters i.e the FILE in gcc ARM != the FILE NuttX.
@NaterGator are you on skype? I am david_s5
Looping in @priseborough
The ekf2 should not be doing any double precision operations
Hi Paul, the map reprojection routine, among others, does some double precision computations. Is that not intentional?
On Mar 18, 2017 1:32 AM, "Paul Riseborough" notifications@github.com wrote:
The ekf2 should not be doing any double precision operations
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PX4/Firmware/issues/6838#issuecomment-287487641, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJUZKBlHyoRFLDVwy8okPo0VwnMg8Fmks5rmwoZgaJpZM4MgrRn .
The EKF operations should all be single precision, but yes, the conversion from GPS lat/long to local position used by the EKF and vice-versa will require doubles.
@NaterGator
I have confirmed we are not using NuttX math routines just the math.h from nuttx-patches and the include files from nuttx. libc is definitely linked with the NuttX one.
@LorenzMeier , @NaterGator @pavel-kirienko @dagar @jgoppert
I did some experiments with adding -ffast-math
FYI:-funsafe-math-optimizations
has always been in the build
-ffast-math
Sets -fno-math-errno, -funsafe-math-optimizations,
-fno-trapping-math, -ffinite-math-only, -fno-rounding-math and -fno-signaling-nans.
This option causes the preprocessor macro __FAST_MATH__ to be defined.
This option should never be turned on by any -O option since it can result in incorrect output for programs which depend on an exact implementation of IEEE or ISO rules/specifications for math functions.
-fno-math-errno
Do not set ERRNO after calling math functions that are executed with a single instruction, e.g., sqrt. A program that relies on IEEE exceptions for math error handling may want to use this flag for speed while maintaining IEEE arithmetic compatibility.
This option should never be turned on by any -O option since it can result in incorrect output for programs which depend on an exact implementation of IEEE or ISO rules/specifications for math functions.
The default is -fmath-errno.
-funsafe-math-optimizations
Allow optimizations for floating-point arithmetic that (a) assume that arguments and results are valid and (b) may violate IEEE or ANSI standards. When used at link-time, it may include libraries or startup files that change the default FPU control word or other similar optimizations.
This option should never be turned on by any -O option since it can result in incorrect output for programs which depend on an exact implementation of IEEE or ISO rules/specifications for math functions.
The default is -fno-unsafe-math-optimizations.
-ffinite-math-only
Allow optimizations for floating-point arithmetic that assume that arguments and results are not NaNs or +-Infs.
This option should never be turned on by any -O option since it can result in incorrect output for programs which depend on an exact implementation of IEEE or ISO rules/specifications.
The default is -fno-finite-math-only.
-fno-trapping-math
Compile code assuming that floating-point operations cannot generate user-visible traps. These traps include division by zero, overflow, underflow, inexact result and invalid operation. This option implies -fno-signaling-nans. Setting this option may allow faster code if one relies on “non-stop” IEEE arithmetic, for example.
This option should never be turned on by any -O option since it can result in incorrect output for programs which depend on an exact implementation of IEEE or ISO rules/specifications for math functions.
The default is -ftrapping-math.
-frounding-math
Disable transformations and optimizations that assume default floating point rounding behavior. This is round-to-zero for all floating point to integer conversions, and round-to-nearest for all other arithmetic truncations. This option should be specified for programs that change the FP rounding mode dynamically, or that may be executed with a non-default rounding mode. This option disables constant folding of floating point expressions at compile-time (which may be affected by rounding mode) and arithmetic transformations that are unsafe in the presence of sign-dependent rounding modes.
The default is -fno-rounding-math.
This option is experimental and does not currently guarantee to disable all GCC optimizations that are affected by rounding mode. Future versions of GCC may provide finer control of this setting using C99's FENV_ACCESS pragma. This command line option will be used to specify the default state for FENV_ACCESS.
-fsignaling-nans
Compile code assuming that IEEE signaling NaNs may generate user-visible traps during floating-point operations. Setting this option disables optimizations that may change the number of exceptions visible with signaling NaNs. This option implies -ftrapping-math.
This option causes the preprocessor macro __SUPPORT_SNAN__ to be defined.
The default is -fno-signaling-nans.
This option is experimental and does not currently guarantee to disable all GCC optimizations that affect signaling NaN behavior.
Master:f2164b135d5c3327cbb2352ba86d2389016911e6
Here we note what @pavel-kirienko had commented on: the sqrtf
functions is called and therefore vsqrt.f32
is not used.
text data bss dec hex filename
1440572 6140 20724 1467436 16642c build_px4fmu-v3_default/src/firmware/nuttx/firmware_nuttx
09:37:33 (master $=) ~/Desktop/dev/PX4/repos/mainline/Firmware$ arm-none-eabi-objdump -d -C build_px4fmu-v3_default/src/firmware/nuttx/firmware_nuttx | grep '<sqrt'
80086ba: f0f6 fb45 bl 80fed48 <sqrtf>
800acde: f0f4 f833 bl 80fed48 <sqrtf>
800b2f4: f0f3 fd28 bl 80fed48 <sqrtf>
800b470: f0f3 fc6a bl 80fed48 <sqrtf>
800d198: f0f1 fdd6 bl 80fed48 <sqrtf>
8030a30: f0ce f98a bl 80fed48 <sqrtf>
8032d18: f0cc f816 bl 80fed48 <sqrtf>
8032dd2: f0cb ffb9 bl 80fed48 <sqrtf>
80425b0: f0bc fbca bl 80fed48 <sqrtf>
804332a: f0bb f8f9 bl 80fe520 <sqrt>
8048034: f0b6 be88 b.w 80fed48 <sqrtf>
8048108: f0b6 fe1e bl 80fed48 <sqrtf>
8048292: f0b6 fd59 bl 80fed48 <sqrtf>
8048a6c: f0b6 b96c b.w 80fed48 <sqrtf>
8048adc: f0b6 f934 bl 80fed48 <sqrtf>
8048b42: f0b6 f901 bl 80fed48 <sqrtf>
8048ba6: f0b6 f8cf bl 80fed48 <sqrtf>
8048bf4: f0b6 f8a8 bl 80fed48 <sqrtf>
804970e: f0b5 fb1b bl 80fed48 <sqrtf>
8049f48: f0b4 fefe bl 80fed48 <sqrtf>
804a6fc: f0b4 fb24 bl 80fed48 <sqrtf>
804a824: f0b4 fa90 bl 80fed48 <sqrtf>
804f35a: f0af fcf5 bl 80fed48 <sqrtf>
804f3e4: f0af fcb0 bl 80fed48 <sqrtf>
804fe04: f0ae ffa0 bl 80fed48 <sqrtf>
80526e8: f0ac bb2e b.w 80fed48 <sqrtf>
8052d4c: f0ab bffc b.w 80fed48 <sqrtf>
8052ed0: f0ab ff3a bl 80fed48 <sqrtf>
8053b08: f0ab b91e b.w 80fed48 <sqrtf>
80543b8: f0aa fcc6 bl 80fed48 <sqrtf>
805442e: f0aa fc8b bl 80fed48 <sqrtf>
8054842: f0aa fa81 bl 80fed48 <sqrtf>
805499a: f0aa f9d5 bl 80fed48 <sqrtf>
8056668: f0a7 ff5a bl 80fe520 <sqrt>
8058522: f0a6 fc11 bl 80fed48 <sqrtf>
805860a: f0a6 fb9d bl 80fed48 <sqrtf>
805871a: f0a6 fb15 bl 80fed48 <sqrtf>
806005e: f09e fe73 bl 80fed48 <sqrtf>
80600a6: f09e fe4f bl 80fed48 <sqrtf>
8061760: f09d faf2 bl 80fed48 <sqrtf>
8061998: f09d f9d6 bl 80fed48 <sqrtf>
8061a18: f09d f996 bl 80fed48 <sqrtf>
8061a30: f09d f98a bl 80fed48 <sqrtf>
8061a3c: f09d f984 bl 80fed48 <sqrtf>
8061bfc: f09d f8a4 bl 80fed48 <sqrtf>
8061ed6: f09c ff37 bl 80fed48 <sqrtf>
8061f56: f09c fef7 bl 80fed48 <sqrtf>
8061f70: f09c feea bl 80fed48 <sqrtf>
8061f7c: f09c fee4 bl 80fed48 <sqrtf>
8065428: f099 fc8e bl 80fed48 <sqrtf>
806573c: f099 fb04 bl 80fed48 <sqrtf>
806598c: f099 f9dc bl 80fed48 <sqrtf>
806fa9a: f08f f955 bl 80fed48 <sqrtf>
806fc24: f08f f890 bl 80fed48 <sqrtf>
807227c: f08c fd64 bl 80fed48 <sqrtf>
8072288: f08c fd5e bl 80fed48 <sqrtf>
8073504: f08b fc20 bl 80fed48 <sqrtf>
807c6ea: f082 fb2d bl 80fed48 <sqrtf>
Processes: 22 total, 2 running, 20 sleeping
CPU usage: 43.63% tasks, 1.97% sched, 54.40% idle
DMA Memory: 5120 total, 1536 used 1536 peak
Uptime: 249.398s total, 139.580s idle
PID COMMAND CPU(ms) CPU(%) USED/STACK PRIO(BASE) STATE
0 Idle Task 139580 54.400 652/ 748 0 ( 0) READY
1 hpwork 8161 3.338 888/ 1780 192 (192) w:sig
2 lpwork 396 0.151 644/ 1780 50 ( 50) w:sig
3 init 45923 0.000 1664/ 2484 100 (100) w:sem
310 top 5492 2.959 1248/ 1684 100 (100) RUN
107 gps 733 0.075 936/ 1372 220 (220) w:sem
110 dataman 146 0.000 704/ 1180 90 ( 90) w:sem
151 sensors 10265 4.324 1344/ 1980 250 (250) w:sem
153 commander 6201 2.427 2856/ 3548 140 (140) w:sig
164 px4io 7393 3.034 968/ 1380 240 (240) w:sem
173 mavlink_if0 4455 1.669 1736/ 2436 100 (100) w:sig
175 mavlink_rcv_if0 17 0.000 952/ 2100 175 (175) w:sem
186 mavlink_if1 3108 1.289 1632/ 2388 100 (100) w:sig
187 mavlink_rcv_if1 16 0.000 952/ 2100 175 (175) w:sem
307 commander_low_prio 8 0.000 592/ 2996 50 ( 50) w:sem
223 mavlink_if2 10603 4.248 1632/ 2388 100 (100) w:sig
229 mavlink_rcv_if2 18 0.000 952/ 2100 175 (175) w:sem
234 sdlog2 123 0.000 1696/ 3316 177 (177) w:sig
277 ekf2 29594 12.518 5072/ 5780 250 (250) w:sem
284 mc_att_control 10766 4.704 1280/ 1676 250 (250) w:sem
286 mc_pos_control 6464 2.883 616/ 1876 250 (250) w:sem
296 navigator 26 0.000 952/ 1572 105 (105) w:sem
total used free largest
Mem: 235024 189072 45952 42912
Master: with -ffast-math https://github.com/PX4/Firmware/pull/6863
Here we note that there are only 2 calls to the sqrt
function. The reset have been replaced with vsqrt.f32
with -ffast-math
text data bss dec hex filename
1435836 6140 20724 1462700 1651ac build_px4fmu-v3_default/src/firmware/nuttx/firmware_nuttx
09:33:05 (master_hw_fpu $) ~/Desktop/dev/PX4/repos/mainline/Firmware$ arm-none-eabi-objdump -d -C build_px4fmu-v3_default/src/firmware/nuttx/firmware_nuttx | grep '<sqrt>'
8043202: f0ba fa19 bl 80fd638 <sqrt>
805601c: f0a7 fb0c bl 80fd638 <sqrt>
Processes: 22 total, 2 running, 20 sleeping
CPU usage: 42.24% tasks, 1.90% sched, 55.86% idle
DMA Memory: 5120 total, 1536 used 1536 peak
Uptime: 249.073s total, 147.554s idle
PID COMMAND CPU(ms) CPU(%) USED/STACK PRIO(BASE) STATE
0 Idle Task 147553 55.859 652/ 748 0 ( 0) READY
1 hpwork 8484 3.500 960/ 1780 192 (192) w:sem
2 lpwork 403 0.152 644/ 1780 50 ( 50) w:sig
3 init 37662 0.000 1660/ 2484 100 (100) w:sem
303 top 1099 2.968 1248/ 1684 100 (100) RUN
100 gps 781 0.532 936/ 1372 220 (220) w:sem
103 dataman 145 0.000 704/ 1180 90 ( 90) w:sem
144 sensors 8668 3.424 1328/ 1980 250 (250) w:sem
146 commander 6500 2.968 2848/ 3548 140 (140) w:sig
153 px4io 7102 2.891 1056/ 1380 240 (240) w:sem
166 mavlink_if0 4582 1.826 1736/ 2436 100 (100) w:sig
168 mavlink_rcv_if0 16 0.000 952/ 2100 175 (175) w:sem
179 mavlink_if1 3190 1.217 1632/ 2388 100 (100) w:sig
180 mavlink_rcv_if1 16 0.000 952/ 2100 175 (175) w:sem
300 commander_low_prio 8 0.000 592/ 2996 50 ( 50) w:sem
216 mavlink_if2 11284 4.642 1632/ 2388 100 (100) w:sig
218 mavlink_rcv_if2 17 0.000 952/ 2100 175 (175) w:sem
223 sdlog2 123 0.000 1696/ 3316 177 (177) w:sig
270 ekf2 27796 11.339 5064/ 5780 250 (250) w:sem
277 mc_att_control 10659 4.261 1280/ 1676 250 (250) w:sem
279 mc_pos_control 6208 2.511 664/ 1876 250 (250) w:sem
285 navigator 28 0.000 952/ 1572 105 (105) w:sem
nsh> free
total used free largest
Mem: 235024 187904 47120 44256
nsh>
@NaterGator
I think my top comparisons are not as striking as yours because of the configuration settings.
Would you be able to replicate the test you did above with https://github.com/PX4/Firmware/pull/6863 ?
@davids5 see my edit at the bottom of that post; I was flashing to an autopilot that has been subjected to some ... hard ... landings and the MEMS sensors are a bit out of spec, so the controller failed to start. I see ~2-3% improvement with -ffast-math.
I'm going to write a quick benchmark to test the VSQRT instruction vs. the FPU-optimized libcall. We are likely splitting hairs over an unimportant optimization (the more concerning issue, IMO, is anywhere we have unintended use of doubles)
Well, the proof is in the pudding. VSQRT.F32 is performing more slowly than the libcall.
Source for benchmarking:
Disassembly:
@NaterGator
Good work!
So it bigger and slower - Not Good. Do you think this explains why the lib is not using the instruction?
(edited) This post was an improper conclusion on my part.
1) The above benchmark was invalid based on the use of the hrt with interrupts off.
2) code size goes down with the -ffast-math
flags option set.
@davids5 it is a bit of a contrived benchmark with some side effects (adjacent FPU instructions introduce delays, the tight loop means OOO execution is not realized, etc) but there is a case to be made that the libm implementation is better.
The processor docs suggest the sqrt instruction should execute in ~14 cycles, so I find it quite surprising that the lib implementation can finish the jump + all branching in that many cycles. I'm going to be studying this a bit more to try and suss out any additional side effects.
Our matrices are sparse. A ton of instructions go against 0's and 1's. So we might need a wrapper that does skip simple cases and only solves the real computations. Interested to prototype? == 0 and == 1 checks might go a long way.
@davids5 I reimplemented with cycle-accurate accounting using DWT_CYCCNT, the results are more in line with what I would expect (which makes me a bit suspicious of hrt_absolute_time) :
This suggests that on-average the instruction is about 20x faster.
@NaterGator could you please upload your assembly from above as text? Thanks in advance.
@NaterGator
The hrt expects to be called once per ms to detect the wrapping.
I would think you could add it in the loop and it would be a constant in both measurements.
If were are looking to recover CPU cycles then the % cpu delta is in question here in an A / B comparison of with and without -ffast-math.
I agree that the first result above that shows sqrt()
faster than VSQRT is really hard to believe. I would like to have a closer look, having the assembly as text rather than a picture would help. ;)
On the subject of special cases proposed by Lorenz, I haven't tried that but I foresee that the performance impact would be likely negative due to branch prediction issues. The flash prefetcher takes 5~6 cycles to fetch a line from flash, plus additional several cycles needed to perform comparison against 0/1 make the computational load comparable to the 14+ cycles needed to execute a VSQRT. Mispredicted branching may have a serious impact on performance even on these relatively simple cores. There is a different project I am involved in where we managed to improve performance of the critical hard real time routines by ~15% simply by way of adding branch hinting constructs (see __builtin_expect()
) throughout the hot paths of the code.
@pavel-kirienko sorry, I took off before reading your comment. I'll post the assembly tomorrow morning.
The errant behavior is explained by hrt_absolute_time expecting to be called before the timer overflows. IMO that is really undesirable behavior and we should use timer chaining to work around it.
@NaterGator - the HRT is used for real time scheduling. It has an ISR that uses the call hrt_absolute_time() regularly so this is not normally an issue; Because the interrupts were off the use case exceeded its abilities.
Regarding sparse matrix math, I would like to add sparse matrix math to matrix eventually. Probably wouldn't be too hard.
On Mar 20, 2017 7:33 PM, "David Sidrane" notifications@github.com wrote:
@NaterGator https://github.com/NaterGator - the HRT is used for real time scheduling. It has an ISR that uses the call hrt_absolute_time() regularly so this is not normally an issue; Because the interrupts were off the use case exceeded its abilities.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PX4/Firmware/issues/6838#issuecomment-287958674, or mute the thread https://github.com/notifications/unsubscribe-auth/AAc6rNpKkU_qy6QTm2dBVv2uH9izffIHks5rnzcEgaJpZM4MgrRn .
@davids5 exactly; I'm just worried about the case where something in the system is misbehaving and the hrt_absolute_time reference no longer represents a stable clock. Obviously we're not falling out the sky because of any such potential issue... just trying to wrap my head around how it would actually manifest and if / how best to mitigate it.
Upstream has apparently noticed (or somebody filed a bugreport with the newlib folks): https://sourceware.org/git/gitweb.cgi?p=newlib-cygwin.git;a=commitdiff;h=baf32fb85fd6ef5e3e5975a357a40de72dc92e15
@pavel-kirienko assembly attached: firmware_nuttx_perfsqrt.zip
Ok, so I used some trickery to replace __ieee754_sqrtf with the upstream version, the results are more promising:
Processes: 23 total, 3 running, 20 sleeping
CPU usage: 42.96% tasks, 1.90% sched, 55.14% idle
DMA Memory: 5120 total, 0 used 0 peak
Uptime: 129.188s total, 72.898s idle
PID COMMAND CPU(ms) CPU(%) USED/STACK PRIO(BASE) STATE
0 Idle Task 72901 55.144 644/ 748 0 ( 0) READY
1 hpwork 3622 2.797 816/ 1780 192 (192) w:sig
2 lpwork 580 0.361 640/ 1780 50 ( 50) w:sig
3 init 3473 0.000 1664/ 2484 100 (100) w:sem
238 commander_low_prio 4 0.000 592/ 2996 50 ( 50) w:sem
38 gps 371 0.541 904/ 1372 220 (220) w:sem
73 sensors 4415 3.429 1344/ 1980 250 (250) w:sem
75 commander 3020 2.346 2864/ 3548 140 (140) w:sig
82 px4io 3563 2.978 1072/ 1380 240 (240) w:sem
90 mavlink_if0 2166 1.714 1632/ 2436 100 (100) READY
91 mavlink_rcv_if0 8 0.000 952/ 2100 175 (175) w:sem
108 mavlink_if1 1615 1.173 1632/ 2388 100 (100) w:sig
109 mavlink_rcv_if1 8 0.000 952/ 2100 175 (175) w:sem
144 mavlink_if2 12104 10.108 1640/ 2388 100 (100) w:sig
145 mavlink_rcv_if2 76 0.000 1456/ 2100 175 (175) w:sem
156 logger 570 0.361 1152/ 3532 250 (250) w:sem
157 log_writer_file 0 0.000 344/ 1060 60 ( 60) w:sem
196 ekf2 14223 11.732 5072/ 5780 250 (250) w:sem
199 fw_att_control 4785 4.061 952/ 1476 250 (250) w:sem
202 fw_pos_ctrl_l1 50 0.000 664/ 1676 250 (250) w:sem
211 navigator 13 0.000 796/ 1572 105 (105) w:sem
254 top 840 1.353 1248/ 1684 100 (100) RUN
245 mavlink_shell 6714 0.000 856/ 2020 100 (100) w:sem
nsh>
nsh> perf sqrt 100000
100000 iterations:
Libcall elapsed: 5700029 cyc, VSQRT.F32 elapsed: 1699998 cyc
nsh>
The disassembly of __ieee754_sqrtf is now:
080aa8d4 <__ieee754_sqrtf>:
80aa8d4: eeb1 0ac0 vsqrt.f32 s0, s0
80aa8d8: 4770 bx lr
80aa8da: Address 0x080aa8da is out of bounds.
Effectively I query gcc to resolve the libm that gcc will link and use objcopy to weaken the __ieee754_sqrtf symbol in the copy of libm, define our replacement __ieee754_sqrtf from upstream in a stm32 specific file (in this case board_identity.c), and link against the patched libm. This shows up as a ~6x improvement when invoking patched libm sqrtf versus unpatched.
Nonetheless: idle time is not dramatically increased. It is slightly better, 2-3%. Is 2-3% worth implementing a cmake step to patch libm for STM32F4 targets?
If yes, I'll submit a PR.
@NaterGator
Is this the best of both worlds? No Size hit of +4K and a faster sqrtf. - Nice
Can you use -Wl,-wrap,__ieee754_sqrtf
and then just link the replacement?
Can the replacement live in the src/drivers/stm32/math/hw_sqrtf.c or the like?
I take that back - that is incorrect it is an arm-7m with FPU replacement. So this needs a new home that can be used by M4F and possibly M7F.
@pavel-kirienko, @NaterGator @LorenzMeier @jgoppert
I am rounding up the conclusions here, because I think this thread has some bad conclusions on my part along with the some of the original benchmarks that were skewed.
1) As @pavel-kirienko noted and @NaterGator demonstrated with the DWT_CYCCNT benchmark the FPU version is ~20X faster than the SW version.
2) The -ffast-math
reduces the code size by 4736 bytes. (master:1440572 master_hw_fpu:1435836 - I had this backwards above and will edit the posts)
3) -ffast-math
replaces all but 2 calls to __ieee754_sqrtf.
4) There are still calls to __ieee754_sqrtf - Here is one path that is still is calling __ieee754_sqrtf
So I think the remaining questions are:
1) Do we deem -ffast-math
to be safe in our application?
2) Should we port in @NaterGator's replacements from upstream in addition to using the -ffast-math
?
See this for background: https://github.com/PX4/Firmware/pull/6829#issuecomment-287359454