ArduPilot / ardupilot

ArduPlane, ArduCopter, ArduRover, ArduSub source
http://ardupilot.org/
GNU General Public License v3.0
10.46k stars 17.11k forks source link

SITL: crash on Apple M1 when the simulator is compiled for arm64 and --speedup is not specified at startup #19588

Open ntamas opened 2 years ago

ntamas commented 2 years ago

Bug report

The SITL simulator crashes on macOS after connecting to it if --speedup is not set to an explicit value (e.g,. 1) from the command line.

Steps to reproduce:

  1. Get a Mac with an Apple Silicon CPU.
  2. Compile the SITL on macOS for arm64 and run it with ./build/sitl/bin/arducopter --model copter
  3. Connect to its TCP port
  4. Notice that the SITL crashes with "illegal instruction"

Version

ArduCopter 4.1.2, after cherry-picking the commit that fixes the floating-point exceptions on Apple Silicon.

Platform [X] All [ ] AntennaTracker [ ] Copter [ ] Plane [ ] Rover [ ] Submarine

Airframe type N/A.

Hardware type SITL

Logs

❯ build/sitl/bin/arducopter --model copter
Suggested EK3_BCOEF_* = 16.288, EK3_MCOEF = 0.209
Starting sketch 'ArduCopter'
Starting SITL input
Using Irlock at port : 9005
bind port 5760 for 0
Serial port 0 on TCP port 5760
Waiting for connection ....
Connection on serial port 5760
bind port 5762 for 2
Serial port 2 on TCP port 5762
bind port 5763 for 3
Serial port 3 on TCP port 5763
MKFIFO failed with File exists
MKFIFO failed with File exists
Home: -35.363262 149.165237 alt=584.000000m hdg=353.000000
Smoothing reset at 0.001
Closed connection on serial port 0
Waiting for connection ....
Connection on serial port 5760
bus.bus=0 address=0x55
bus.bus=1 address=0x55
bus.bus=0 address=0x38
bus.bus=0 address=0x39
bus.bus=1 address=0x38
bus.bus=1 address=0x39
bus.bus=2 address=0x38
bus.bus=2 address=0x39
bus.bus=3 address=0x38
bus.bus=3 address=0x39
bus.bus=0 address=0x38
bus.bus=0 address=0x39
validate_structures:490: Validating structures
zsh: illegal hardware instruction  build/sitl/bin/arducopter --model copter

Additional info

lldb pins down the crash to this part of the code:

Process 29663 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_INSTRUCTION (code=1, subcode=0x1e390008)
    frame #0: 0x0000000100166ad8 arducopter`AP_Logger_File::io_thread_alive(this=0x0000000100a04d00) const at AP_Logger_File.cpp:1026:20
   1023     // SITL speedup options, so we allow for it here.
   1024     SITL::SITL *sitl = AP::sitl();
   1025     if (sitl != nullptr) {
-> 1026         timeout_ms *= sitl->speedup;
   1027     }
   1028 #endif
   1029     return (AP_HAL::millis() - _io_timer_heartbeat) < timeout_ms;
Target 0: (arducopter) stopped.

Printing the value of sitl->speedup reveals that it's -1, so we end up with a negative timeout:

(lldb) print sitl->speedup
(AP_Float) $0 = (_value = -1)
IamPete1 commented 2 years ago

Maybe you have a bad value in eeprom, try launching with -w to wipe.

ntamas commented 2 years ago

I forgot to mention that the same thing happens even with a clean EEPROM. I guess the problem is that the default value of the speedup parameter in libraries/SITL/SITL.cpp is -1, so if it is not overridden from the command line or from a default param file, it probably remains -1.

Edit: it seems like SIM_Aircraft.cpp sets its own speedup parameter to 1.0 by default and it is responsible for updating the SITL speedup parameter in fill_fdm(), but the code looks like this:

if (is_equal(last_speedup, -1.0f) && !is_equal(get_speedup(), 1.0f)) {
    sitl->speedup = get_speedup();
}

which basically means that the SITL speedup gets updated only if the new value is not 1.0.

peterbarker commented 2 years ago

Key question: why is it different on MacOSX?

khancyr commented 2 years ago

What it build with clang ?

peterbarker commented 2 years ago

What it build with clang ?

I've tried to replicate on Linux with both g++ and clang++ without luck.

ntamas commented 2 years ago

I managed to confirm that the issue arises only when compiling for Apple M1 (i.e. arm64). Forcing the SITL to be compiled for x86_64 produces an executable that does not crash.

Posting the Clang version number here for sake of posterity:

Apple clang version 13.0.0 (clang-1300.0.29.30)
Target: arm64-apple-darwin21.2.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
ntamas commented 2 years ago

FWIW, stil->speedup in AP_Logger_File.cpp in the io_thread_alive() function is still -1 even when compiling for x86_64, so it still yields a negative timeout value in timeout_ms. I believe that a negative timeout at this part of the code is incorrect; the intention was probably to use 1 if no timeout was specified in the command line.

hendjoshsr71 commented 2 years ago

@ntamas Would you like to put a PR in? This seems quite reasonable that at this line we could get some oddities here?

ntamas commented 2 years ago

I've filed a PR but this is more like a stopgap measure than an actual fix. I believe that the real problem is somewhere in SIM_Aircraft::fill_fdm(), which explicitly excludes updating stil->speedup when the speedup is equal to 1 (the default), and there must be a reason for that.

timtuxworth commented 2 years ago

Do you have MavProxy working on your M1 Mac? I haven't been able to get that to work so can't get to this point.

ntamas commented 2 years ago

No, I did not bother setting that up. mavproxy depends on wxPython, which does not have Apple SIlicon wheels yet so I would have had to complie that from source. I just connect straight to the main TCP port of the simulated instance from Skybrush and use that to interact with the simulated instance.


Edit: I tried to compile wxPython manually but there are tons of additional dependencies that I needed to install from Homebrew and I was still getting errors so I gave up for the time being. I think that the pure command-line parts of MAVProxy would probably work on Apple M1 but it would need to declare wxPython and related components as optional dependencies to facilitate using MAVProxy without a GUI.


Edit 2: forgot to mention; to reproduce this error you don't need MAVProxy, just start the SITL directly and then connect to TCP port 5760 with telnet or socat; this is enough to get to the point that causes the crash.

ntamas commented 2 years ago

I have tried this again with ArduCopter 4.2.0-rc2. sitl->speedup is still -1 in AP_Logger_File, and it still doesn't cause a crash when compiled for x86_64. I cannot test this on arm64, though, because the SITL does not compile cleanly for arm64. I need to remove all instances of PACKED from LogStructure.h to make it compile, otherwise I get tons of these warnings and ld bails out:

ld: warning: pointer not aligned at address 0x10036218A (__ZN6Copter13log_structureE + 2 from ArduCopter/Log.cpp.44.o)
ld: unaligned pointer(s) for architecture arm64
clang: error: linker command failed with exit code 1 (use -v to see invocation)

Removing PACKED makes the code compile, but then the SITL shows warnings about the log structure being invalid, and the logging never starts so it never hits the problematic part in AP_Logger_File.

ntamas commented 2 years ago

Okay, so I figured out that it is enough to remove PACKED from struct LogStructure in LogStructure.h (everything else can remain PACKED), and then the code compiles on Apple M1 as native arm64, and the crash is still there.

timtuxworth commented 2 years ago

I'm getting this same crash on my M1Pro even with --speedup 1 - but on a different line:


thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_INSTRUCTION (code=1, subcode=0x1e390008)
    frame #0: 0x00000001001fd0e0 arduplane AP_ESC_Telem_SITL::update(this=0x0000600000010000) at AP_ESC_Telem_SITL.cpp:45:23
   42   
   43   #if HAL_WITH_ESC_TELEM
   44       for (uint8_t i = 0; i < sitl->state.num_motors; i++) {
-> 45           update_rpm(i, sitl->state.rpm[sitl->state.vtol_motor_start+i]);
   46       }
   47   #endif
timtuxworth commented 2 years ago

I have discovered that the cause of the problem in AP_ESC_Telem_SITL.cpp is that update_rpm() was taking a uint16_t as a parameter whereas sitl->state.rpm[] contained floats.

@ntamas I notice that sitl->speedup is an AP_Float, while timeout_ms is a uint32_t. It seems to me that it might help to rewrite the calculation to make it clearer when the conversion from float to int should be happening: e.g. timeout_ms *= sitl->speedup; perhaps could be written as: timeout_ms = (uint32_t)(sitl->speedup * (float)timeout_ms)

At least - I think that's what is supposed to be happening.

timtuxworth commented 2 years ago

Key question: why is it different on MacOSX?

From the research I've done, it looks like the compiler on MacOSX is building in stricter type checking on parameter passing on implicit integer/floating point conversion which is generating a floating point exception. I found this quote:

It seems that on the M1, MacOS does not send a SIGFPE, but rather a SIGILL, when floating point exceptions are unblocked.

I think SIGILL = EXC_BAD_INSTRUCTION - so this is actually a floating point exception. I think it's possibly caused by a float being passed when the method requires a 2 byte uint16_t. If the float is passed as a float it might be 4 bytes, but I think c compilers sometimes pass floats as doubles, so it might even be 8 bytes on the stack.

ntamas commented 2 years ago

I have discovered that the cause of the problem in AP_ESC_Telem_SITL.cpp is that update_rpm() was taking a uint16_t as a parameter whereas sitl->state.rpm[] contained floats

Well, but the compiler should know about that and implicitly convert the float to an uint16_t when calling update_rpm(), right?

I notice that sitl->speedup is an AP_Float, while timeout_ms is a uint32_t. It seems to me that it might help to rewrite the calculation to make it clearer when the conversion from float to int should be happening:

That would be better, but still, a negative speedup does not make sense at that point and I don't know what an implicit conversion of a negative float to an uint32_t would do at that point of the code, so it still has to be figured out why sitl->speedup can remain negative if I don't specify the speedup explicitly at startup.

xlla commented 2 years ago

While compile build/sitl/bin/arducopter, I saw these warnings

[723/882] Compiling libraries/AP_AHRS/AP_AHRS_DCM.cpp
../../libraries/AP_PiccoloCAN/piccolo_protocol/scaledencode.c:71:23: warning: implicit conversion from 'unsigned long' to 'float' changes value from 4294967295 to 4294967296 [-Wimplicit-const-int-float-conversion]
    if(scaledvalue >= 4294967295ul)
                   ~~ ^~~~~~~~~~~~
../../libraries/AP_PiccoloCAN/piccolo_protocol/scaledencode.c:99:23: warning: implicit conversion from 'unsigned long' to 'float' changes value from 4294967295 to 4294967296 [-Wimplicit-const-int-float-conversion]
    if(scaledvalue >= 4294967295ul)
                   ~~ ^~~~~~~~~~~~
../../libraries/AP_PiccoloCAN/piccolo_protocol/scaledencode.c:128:27: warning: implicit conversion from 'long' to 'float' changes value from 2147483647 to 2147483648 [-Wimplicit-const-int-float-conversion]
        if(scaledvalue >= 2147483647l)
                       ~~ ^~~~~~~~~~~
../../libraries/AP_PiccoloCAN/piccolo_protocol/scaledencode.c:163:27: warning: implicit conversion from 'long' to 'float' changes value from 2147483647 to 2147483648 [-Wimplicit-const-int-float-conversion]
        if(scaledvalue >= 2147483647l)
                       ~~ ^~~~~~~~~~~
4 warnings generated.

libraries/AP_PiccoloCAN/piccolo_protocol/scaledencode.c:71

    // Make sure number fits in the range
    if(scaledvalue >= 4294967295ul)
        number = 4294967295ul;
    else if(scaledvalue <= 0)
        number = 0;
    else
        number = (uint32_t)(scaledvalue + 0.5f); // account for fractional truncation

should it be

if(scaledvalue >= 4294967295.0)

or

if((scaledvalue - 4294967295.0) > 0.0000001)

I am encounter "Floating point exception" while try takeoff in SITL, but I can't find exactly reason, please see #20869 , I am build on macOS using clang 13, while build Rover for physical board like F4BY, no warning at all.