ArduPilot / ardupilot

ArduPlane, ArduCopter, ArduRover, ArduSub source
http://ardupilot.org/
GNU General Public License v3.0
10.84k stars 17.29k forks source link

AirSim + SITL Rover - EKF floating point exception #14469

Open Eraz1997 opened 4 years ago

Eraz1997 commented 4 years ago

Bug report

Issue details

When running SITL with rover simulation linked to AirSim (-f airsim-rover), it often happens that EKF (both 2 and 3) gets a Floating point exception. The problem is in function EKFGSF_yaw::updateRotMat(...) located in libraries/EKFGSF_yaw.cpp. The values of P matrix in EKF get huge apparently at random, then a floating point exception is thrown.

To reproduce the bug you have to simply connect AirSim with ardurover simulation in SITL and control the rover (I used dronekit library in Python).

Version master - updated to May 23, 2020

Platform [ ] All [ ] AntennaTracker [ ] Copter [ ] Plane [X] Rover [ ] Submarine

Airframe type airsim-rover

Hardware type SITL (ardurover)

Logs Logs of the exception dumped with dumpstack.sh

[New Thread 6028.0x26bc]
[New Thread 6028.0x2f2c]
[New Thread 6028.0x2f6c]
[New Thread 6028.0x9d4]
#0  0x00007ffbde9dfaa1 in ntdll!DbgBreakPoint () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
No symbol table info available.
#1  0x00007ffbdea0d49b in ntdll!DbgUiRemoteBreakin () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
No symbol table info available.
#2  0x00007ffbdca37bd4 in KERNEL32!BaseThreadInitThunk () from /cygdrive/c/Windows/System32/KERNEL32.DLL
No symbol table info available.
#3  0x00007ffbde9ace51 in ntdll!RtlUserThreadStart () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
No symbol table info available.
#4  0x0000000000000000 in ?? ()
No symbol table info available.
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 5 (Thread 6028.0x9d4):
#0  0x00007ffbde9dfaa1 in ntdll!DbgBreakPoint () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
No symbol table info available.
#1  0x00007ffbdea0d49b in ntdll!DbgUiRemoteBreakin () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
No symbol table info available.
#2  0x00007ffbdca37bd4 in KERNEL32!BaseThreadInitThunk () from /cygdrive/c/Windows/System32/KERNEL32.DLL
No symbol table info available.
#3  0x00007ffbde9ace51 in ntdll!RtlUserThreadStart () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
No symbol table info available.
#4  0x0000000000000000 in ?? ()
No symbol table info available.
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 4 (Thread 6028.0x2f6c):
#0  0x00007ffbde9dfa04 in ntdll!ZwWaitForWorkViaWorkerFactory () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
No symbol table info available.
#1  0x00007ffbde973fe0 in ntdll!RtlInitializeResource () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
No symbol table info available.
#2  0x0000000180048353 in _cygtls::call2(unsigned int (*)(void*, void*), void*, void*) () from /usr/bin/cygwin1.dll
No symbol table info available.
#3  0x0000000180048404 in _cygtls::call(unsigned int (*)(void*, void*), void*) () from /usr/bin/cygwin1.dll
No symbol table info available.
#4  0x00007ffbdca37bd4 in KERNEL32!BaseThreadInitThunk () from /cygdrive/c/Windows/System32/KERNEL32.DLL
No symbol table info available.
#5  0x00007ffbde9ace51 in ntdll!RtlUserThreadStart () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
No symbol table info available.
#6  0x0000000000000000 in ?? ()
No symbol table info available.
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 3 (Thread 6028.0x2f2c):
#0  0x00007ffbde9dc134 in ntdll!ZwReadFile () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
No symbol table info available.
#1  0x00007ffbdc655187 in ReadFile () from /cygdrive/c/Windows/System32/KERNELBASE.dll
No symbol table info available.
#2  0x0000000180105f59 in proc_waiter(void*) () from /usr/bin/cygwin1.dll
No symbol table info available.
#3  0x0000000180047363 in cygthread::callfunc(bool) () from /usr/bin/cygwin1.dll
No symbol table info available.
#4  0x0000000180047926 in cygthread::stub(void*) () from /usr/bin/cygwin1.dll
No symbol table info available.
#5  0x0000000180048353 in _cygtls::call2(unsigned int (*)(void*, void*), void*, void*) () from /usr/bin/cygwin1.dll
No symbol table info available.
#6  0x0000000180048404 in _cygtls::call(unsigned int (*)(void*, void*), void*) () from /usr/bin/cygwin1.dll
No symbol table info available.
#7  0x00007ffbdca37bd4 in KERNEL32!BaseThreadInitThunk () from /cygdrive/c/Windows/System32/KERNEL32.DLL
No symbol table info available.
#8  0x00007ffbde9ace51 in ntdll!RtlUserThreadStart () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
No symbol table info available.
#9  0x0000000000000000 in ?? ()
No symbol table info available.
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 2 (Thread 6028.0x26bc):
#0  0x00007ffbde9dc134 in ntdll!ZwReadFile () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
No symbol table info available.
#1  0x00007ffbdc655187 in ReadFile () from /cygdrive/c/Windows/System32/KERNELBASE.dll
No symbol table info available.
#2  0x000000018013e441 in wait_sig(void*) () from /usr/bin/cygwin1.dll
No symbol table info available.
#3  0x0000000180047363 in cygthread::callfunc(bool) () from /usr/bin/cygwin1.dll
No symbol table info available.
#4  0x0000000180047926 in cygthread::stub(void*) () from /usr/bin/cygwin1.dll
No symbol table info available.
#5  0x0000000180048353 in _cygtls::call2(unsigned int (*)(void*, void*), void*, void*) () from /usr/bin/cygwin1.dll
No symbol table info available.
#6  0x0000000180048404 in _cygtls::call(unsigned int (*)(void*, void*), void*) () from /usr/bin/cygwin1.dll
No symbol table info available.
#7  0x00007ffbdca37bd4 in KERNEL32!BaseThreadInitThunk () from /cygdrive/c/Windows/System32/KERNEL32.DLL
No symbol table info available.
#8  0x00007ffbde9ace51 in ntdll!RtlUserThreadStart () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
No symbol table info available.
#9  0x0000000000000000 in ?? ()
No symbol table info available.
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 1 (Thread 6028.0x102c):
#0  0x00007ffbde9dcbc4 in ntdll!ZwWaitForMultipleObjects () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
No symbol table info available.
#1  0x00007ffbdc688037 in WaitForMultipleObjectsEx () from /cygdrive/c/Windows/System32/KERNELBASE.dll
No symbol table info available.
#2  0x00007ffbdc687f1e in WaitForMultipleObjects () from /cygdrive/c/Windows/System32/KERNELBASE.dll
No symbol table info available.
#3  0x00000001800485f5 in cygwait(void*, _LARGE_INTEGER*, unsigned int) () from /usr/bin/cygwin1.dll
No symbol table info available.
#4  0x0000000180173757 in wait4 () from /usr/bin/cygwin1.dll
No symbol table info available.
#5  0x00000001801439d1 in child_info_spawn::worker(char const*, char const* const*, char const* const*, int, int, int) () from /usr/bin/cygwin1.dll
No symbol table info available.
#6  0x0000000180144dcd in spawnve () from /usr/bin/cygwin1.dll
No symbol table info available.
#7  0x00000001801452e2 in spawnvp () from /usr/bin/cygwin1.dll
No symbol table info available.
#8  0x0000000180153a47 in system () from /usr/bin/cygwin1.dll
No symbol table info available.
#9  0x00000001801367ab in _sigfe () from /usr/bin/cygwin1.dll
No symbol table info available.
#10 0x0000000100555fdb in AP_HAL::dump_stack_trace() ()
No symbol table info available.
#11 0x0000000100414689 in _sig_fpe(int) ()
No symbol table info available.
#12 0x0000000180063acd in _cygtls::call_signal_handler() () from /usr/bin/cygwin1.dll
No symbol table info available.
#13 0x000000018013f027 in sig_send(_pinfo*, siginfo_t&, _cygtls*) () from /usr/bin/cygwin1.dll
No symbol table info available.
#14 0x0000000180061ccd in exception::handle(_EXCEPTION_RECORD*, void*, _CONTEXT*, _DISPATCHER_CONTEXT*) () from /usr/bin/cygwin1.dll
No symbol table info available.
#15 0x00007ffbde9e11cf in ntdll!.chkstk () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
No symbol table info available.
#16 0x00007ffbde9aa209 in ntdll!RtlRaiseException () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
No symbol table info available.
#17 0x00007ffbde9dfe3e in ntdll!KiUserExceptionDispatcher () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
No symbol table info available.
#18 0x00000001004286d4 in EKFGSF_yaw::updateRotMat(Matrix3<float> const&, Vector3<float> const&) ()
No symbol table info available.
#19 0x0000000100428411 in EKFGSF_yaw::predictAHRS(unsigned char) ()
No symbol table info available.
#20 0x00000001004295ea in EKFGSF_yaw::predict(unsigned char) ()
No symbol table info available.
#21 0x000000010042898d in EKFGSF_yaw::update(Vector3<float> const&, Vector3<float> const&, float, float, bool, float) ()
No symbol table info available.
#22 0x0000000100480c1e in NavEKF2_core::runYawEstimatorPrediction() ()
No symbol table info available.
#23 0x000000010046a218 in NavEKF2_core::UpdateFilter(bool) ()
No symbol table info available.
#24 0x0000000100578c7c in NavEKF2::UpdateFilter() ()
No symbol table info available.
#25 0x00000001004c71be in AP_AHRS_NavEKF::update_EKF2() ()
No symbol table info available.
#26 0x00000001004c937d in AP_AHRS_NavEKF::update(bool) ()
No symbol table info available.
#27 0x000000010054cde4 in Rover::ahrs_update() ()
No symbol table info available.
#28 0x000000010045c42a in AP_Scheduler::run(unsigned int) ()
No symbol table info available.
#29 0x000000010045c930 in AP_Scheduler::loop() ()
No symbol table info available.
#30 0x0000000100427581 in AP_Vehicle::loop() ()
No symbol table info available.
#31 0x00000001005c239d in HAL_SITL::run(int, char* const*, AP_HAL::HAL::Callbacks*) const ()
No symbol table info available.
#32 0x00000001005f3ecb in main ()
No symbol table info available.
A debugging session is active.

    Inferior 1 [process 6028] will be detached.

Quit anyway? (y or n) [answered Y; input not from terminal]
[Inferior 1 (process 6028) detached]
rmackay9 commented 4 years ago

I wonder if you could try it out on Copter or Plane (in AirSim)? This is a new feature that we are planning to release with Plane-4.0.6 and Copter-4.0.4 so I'm just a little worried that we have a bug that could get out.

rajat2004 commented 4 years ago

@Eraz1997 I'm unable to reproduce this with the latest AirSim release as well as master. My testing was mostly just moving in manual, a bit in guided and then RTL. Could you post the script or some more details on how to reproduce this? Logs would also be helpful, you could comment out the #if 0 and #endif around this code, that should provide more info about the data Airsim is sending

There hasn't been any such change in master which would affect Rover, so I think using a release should also work in trying to reproduce. Releases can be downloaded from here - https://github.com/microsoft/AirSim/releases This doesn't require any Unreal Engine setup, building, etc. Just unzip and run the env.

Side note, Airsim rover needs some serious tuning, will try to give it a shot over the weekend, but haven't done that before, so any suggestions, etc. would be great!

Eraz1997 commented 4 years ago

Thank you for the support guys! @rmackay9 No way to reproduce the bug with Copter, it works fine! @rajat2004 I'm using latest AirSim realease and this is the code I used to control the Rover:

from __future__ import print_function
import time
from dronekit import connect, VehicleMode, LocationGlobalRelative

# Parse connection string
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--connect')
args = parser.parse_args()

connectionString = args.connect

if not connectionString:
    quit();

# Connect to vehicle
print('Connecting to vehicle on: %s' % connectionString)
vehicle = connect(connectionString, wait_ready=False)
vehicle.initialize(8, 30)
vehicle.wait_ready('autopilot_version')
vehicle.parameters['avoid_enable'] = 0
vehicle.parameters['PRX_TYPE'] = 0

# Arm
print("Basic pre-arm checks")
while not vehicle.is_armable:
    print(" Waiting for vehicle to initialise...")
    time.sleep(1)

print("Arming motors")
vehicle.armed = True

while not vehicle.armed:
    print(" Waiting for arming...")
    time.sleep(1)

print("Armed")

angle = 1500
throttle = 1500

pitch = 1500 # (only for copter)
yaw = 1500   # (only for copter)

@vehicle.on_message('RC_CHANNELS')
def chin_listener(self, name, message):
    angle = message.chan1_raw
    throttle = message.chan3_raw
    pitch = message.chan2_raw
    yaw = message.chan4_raw

# Manual control through RC channels
vehicle.channels.overrides['1'] = 1500 # angle
vehicle.channels.overrides['3'] = 1500 # throttle

vehicle.channels.overrides['2'] = 1500 # pitch (only for copter)
vehicle.channels.overrides['4'] = 1500 # yaw (only for copter)

import keyboard  # using module keyboard
while True:  # making a loop
    if keyboard.is_pressed('q'):  # if key 'q' is pressed
        print('Exit!')
        break  # finishing the loop
    if keyboard.is_pressed('w'):
        throttle += 50
        if(throttle > 2000):
            throttle = 2000
        vehicle.channels.overrides['3'] = throttle
    if keyboard.is_pressed('s'):
        throttle -= 50
        if(throttle < 1000):
            throttle = 1000
        vehicle.channels.overrides['3'] = throttle
    if keyboard.is_pressed('d'):
        angle += 50
        if(angle > 2000):
            angle = 2000
        vehicle.channels.overrides['1'] = angle
    if keyboard.is_pressed('a'):
        angle -= 50
        if(angle < 1000):
            angle = 1000
        vehicle.channels.overrides['1'] = angle

    # only for Copter
    if keyboard.is_pressed('i'):
        pitch += 50
        if(pitch > 2000):
            pitch = 2000
        vehicle.channels.overrides['2'] = pitch
    if keyboard.is_pressed('k'):
        pitch -= 50
        if(pitch < 1000):
            pitch = 1000
        vehicle.channels.overrides['2'] = pitch
    if keyboard.is_pressed('l'):
        yaw += 50
        if(yaw > 2000):
            yaw = 2000
        vehicle.channels.overrides['4'] = yaw
    if keyboard.is_pressed('j'):
        yaw -= 50
        if(yaw < 1000):
            yaw = 1000
        vehicle.channels.overrides['4'] = yaw

    time.sleep(0.2)
    print("Angle: ", angle, " - Throttle: ", throttle, " | Pitch: ", pitch, " - Yaw: ", yaw, end="\r")

# Exit
print("Close vehicle object")
vehicle.close()

Here there is my log file, I removed the #if 0 and #endif in the code as you suggested

The settings.json configuration file of AirSim is the same described in the official documentation of Ardupilot

rajat2004 commented 4 years ago

@Eraz1997 Thanks for the script! After having a brief look at the log you provided to see the RC values, became quite clear as to what to do to trigger the exception. The main thing required was extreme RC input, so my method was to use manual mode with rc 3 1700 and rc 1 1300 first to turn the vehicle to the left so that there's a straight free path ahead. Then increase Throttle to 2000, wait for a few seconds so that it picks up speed and then apply slight Roll of say 1300 and see what happens! I hadn't tested such high values before so wasn't aware of this.

I was unable to reproduce this with AP built with debugging enabled, with both being 2000 also. But with the default build happened fairly quickly

See Update at the end, has the backtrace

airsim_rover_exception.zip

There are 2 log files, 5 is the one with debugging enabled, as you can see, pushed it fairly hard but no exception. 8 is without debugging and has the exception.

Maybe a GCC optimisation problem? Do let me know if I can provide more info or testing. Ping @rmackay9 @tridge

BTW, I reproduced this with the latest AirSim release, so there shouldn't be any need to build AirSim from source, for those trying to reproduce this

Update: I'm actually able to reproduce with AP's default SITL also following the method I described above to increase Throttle to 2000 and then add Roll In SITL, it's happening with debugging also, maybe I didn't push it enough when trying with AirSim. Backtrace -

Program received signal SIGFPE, Arithmetic exception.
0x000055555576b1ca in EKFGSF_yaw::correct (this=0x555555a96420, 
    mdl_idx=1 '\001') at ../../libraries/AP_NavEKF/EKFGSF_yaw.cpp:458
458         K[1][0] = -P10*P11*t7+P10*t7*t8;
(gdb) bt
#0  0x000055555576b1ca in EKFGSF_yaw::correct (this=0x555555a96420, 
    mdl_idx=1 '\001') at ../../libraries/AP_NavEKF/EKFGSF_yaw.cpp:458
#1  0x000055555576a09f in EKFGSF_yaw::update (this=0x555555a96420, delAng=..., 
    delVel=..., delAngDT=-1.35348309e+17, delVelDT=1.35348309e+17, runEKF=247, 
    TAS=-14798104) at ../../libraries/AP_NavEKF/EKFGSF_yaw.cpp:120
#2  0x00005555557099ab in NavEKF2_core::UpdateFilter (this=0x555555a9675c, 
    predict=true) at ../../libraries/AP_NavEKF2/AP_NavEKF2_core.cpp:594
#3  0x00005555555e52ab in NavEKF2::UpdateFilter (
    this=0x555555a1ab48 <rover+8744>)
    at ../../libraries/AP_NavEKF2/AP_NavEKF2.cpp:782
#4  0x000055555559f4d7 in AP_AHRS_NavEKF::update_EKF2 (
    this=0x555555a1a830 <rover+7952>)
    at ../../libraries/AP_AHRS/AP_AHRS_NavEKF.cpp:204
#5  0x000055555559f24a in AP_AHRS_NavEKF::update (
    this=0x555555a1a830 <rover+7952>, skip_ins_update=false)
    at ../../libraries/AP_AHRS/AP_AHRS_NavEKF.cpp:135
#6  0x000055555558cbf4 in Rover::ahrs_update (this=0x555555a18920 <rover>)
    at ../../Rover/Rover.cpp:195
#7  0x000055555558d507 in Functor<void>::method_wrapper<Rover, &Rover::ahrs_update> (obj=0x555555a18920 <rover>) at ../../libraries/AP_HAL/utility/functor.h:88
#8  0x00005555555e342a in Functor<void>::operator() (
    this=0x555555a01660 <Rover::scheduler_tasks+32>)
    at ../../libraries/AP_HAL/utility/functor.h:54
---Type <return> to continue, or q <return> to quit---
#9  0x00005555555fc909 in AP_Scheduler::run (this=0x555555a18998 <rover+120>, time_available=20000) at ../../libraries/AP_Scheduler/AP_Scheduler.cpp:206
#10 0x00005555555fcdd1 in AP_Scheduler::loop (this=0x555555a18998 <rover+120>) at ../../libraries/AP_Scheduler/AP_Scheduler.cpp:327
#11 0x00005555555ffae6 in AP_Vehicle::loop (this=0x555555a18920 <rover>) at ../../libraries/AP_Vehicle/AP_Vehicle.cpp:112
#12 0x0000555555679877 in HAL_SITL::run (this=0x555555a285e0 <AP_HAL::get_HAL()::hal>, argc=9, argv=0x7fffffffdcd8, callbacks=0x555555a18920 <rover>)
    at ../../libraries/AP_HAL_SITL/HAL_SITL_Class.cpp:219
#13 0x000055555558d23f in main (argc=9, argv=0x7fffffffdcd8) at ../../Rover/Rover.cpp:372

Log - sitl_exception.zip

rajat2004 commented 4 years ago

I've opened https://github.com/ArduPilot/ardupilot/pull/14476 which adds an autotest to reproduce this

PR has been merged, use ./Tools/autotest/autotest.py --gdb --debug build.Rover test.Rover.DriveMaxRCIN to run the test since it's currently disabled. Also. I think the title can be changed since it's not specific to Airsim

IamPete1 commented 4 years ago

its probably worth noting that to get the error your rotating at 450 deg/second. The SITL copter will only do about 200 max, so I suspect this could be reproduced on any vehicle. I think its fair to say that this is outside the expected operating range, so the fix is probably just to check the yaw rate is less than some number before doing the EKFGSF_yaw::updateRotMat()

cthoma20-arc commented 4 years ago

Hi!

I was running into this issue as well, and not just when doing high speed turns. I looked into it and I think I found the problem. For me, it was crashing or behaving strangely when turning 180 degrees around. I think the issue is that AirSim calculates the angular velocity from the delta of euler angles, but it was not accounting for the case where the angle wraps from 180 degrees to -180 degrees, so when this happened it causes a very large delta, making the angular velocity very high, causing the floating point exception. I created a pull request for AirSim here: https://github.com/microsoft/AirSim/pull/2956

This still doesn't help with the MaxRCInput test, but it is a related problem, so I thought I would mention it. I haven't run into the EKF floating point exception "in the wild" since making this change.

Hopefully that will be helpful to someone. Thanks!