UBC-Thunderbots / Software

Robot Soccer Playing AI
http://www.ubcthunderbots.ca
GNU Lesser General Public License v3.0
53 stars 99 forks source link

Fix Floating Point Exception Errors #2088

Closed jonathanlew closed 3 years ago

jonathanlew commented 3 years ago

Description of the task

There are floating point exception errors that we ignore, e.g. sqrt of a negative number, overflow, etc. The most visible way this fails is when it causes the robot state to contain nans (not a number), which causes robot costs to be nans, which causes the munkres algorithm, and consequently AI, to stall.

Floating point errors can be detected automatically by adding feenableexcept(FE_INVALID | FE_OVERFLOW); to the main function or where it has to be called, e.g. in the simulated test fixture constructor. More info:

Currently, every simulated test suffers from at least one floating point exception error, typically in physics or primitives functions.

Examples: float overflow

Thread 1 "shoot_or_pass_p" received signal SIGFPE, Arithmetic exception.
0x00007ffff65c12a6 in app_trajectory_planner_impl_getMaximumSpeedProfile (path=..., num_elements=10, t_start=0, t_end=1, max_allowable_acceleration=3, speed_cap=2, 
    max_allowable_speed_profile=0x7ffffffe17f0) at firmware/app/control/trajectory_planner_impl.c:25
25          const float max_speed = sqrtf(max_allowable_acceleration * radius_of_curvature);
(gdb) p radius_of_curvature
$1 = 3.40282347e+38

sqrt of negative value:

#2  0x00007ffff57ecf39 in shared_physics_getFinalSpeed (initial_speed=0, displacement=-0.0872662738, acceleration=10) at firmware/shared/physics.c:170
170     return sqrtf(powf(initial_speed, 2) + 2 * displacement * acceleration);

Acceptance criteria

Blocked By

MathewMacDougall commented 3 years ago

Ooooh floating point is always really tricky.

For the first error, it's possible the radius_of_curvature is very large because the path being navigated is a straight line (where the radius of curvature is technically infinite). IIRC this was thought of and there was some limit to it, but maybe it's not applied everywhere or there's some other way for it to slip through.

Edit: There is a check in firmware/shared/math/polynomial_2d.c where we check for division by zero and set the result to FLT_MAX. I suspect this line is being hit and FLT_MAX is being returned, since the value of FLT_MAX is very close to what's in the trace above (1e37 vs 3e38).

https://www.tutorialspoint.com/c_standard_library/float_h.htm

I'm not as sure what the best fix is here (@garethellis0 for ideas), but I think checking the value of radius_of_curvature against some threshold value and then either calculating max_speed as-is (assuming you're on a curve and are limited in your max speed), otherwise just assuming the max speed is the previous max speed + robot max acceleration (not limited by curvature). Do not just set the value directly to ROBOT_MAX_SPEED because the values need to be "continuous" in the array in terms of what is physically achievable. I'd probably suggest a threshold a few orders of magnitude less than FLT_MAX (but beware the value will be different when compiled for different systems) so that nearly all the time the existing calculation will be used, but should prevent overflow. (this assumed that max_allowable_acceleration stays around it's current order of magnitude. If it significantly increases, the threshold should decrease).

For the second error given how (relatively) large the displacement value is I'm guessing that's a programming error somewhere where we aren't taking the absolute value of something. -0.08 seems like much too large a value to occur "naturally" though accumulation of floating-point error. I took a quick look and it seems like displacement comes from the segment_lengths. There are both linear and angular segment lengths. The calculations for linear seem fine (impossible to be negative). However for angular in firmware/app/control/trajectory_planner_impl.c:53 I suspect this is where the negative values are coming from. I'd bet adding fabs on this line would fix the issue. If you imagine a regular "1D" polynomial, I think this error occurs at the "top" of the curve where the first derivative flips from positive to negative (eg. when x=0 for y=-(x^2) + 3). @garethellis0 please sanity check me here.

garethellis0 commented 3 years ago

I'll double check @MathewMacDougall's comment later, but just to confirm, you're not going to enable floating point exceptions for the actual binaries right? I'd recommend against it because it's unlikely that we'd catch all issues in simulated tests and segfaults during an actual game should be avoided at all costs.

garethellis0 commented 3 years ago

Another idea here is that you could make a 'tbots_gtest_main' target that enables the exceptions for you and then just have tests depend on that instead of the gtest_main target.

jonathanlew commented 3 years ago

I'll double check @MathewMacDougall's comment later, but just to confirm, you're not going to enable floating point exceptions for the actual binaries right? I'd recommend against it because it's unlikely that we'd catch all issues in simulated tests and segfaults during an actual game should be avoided at all costs.

Yes, that's right, we'd only enable the exceptions in tests