PX4 / PX4-Autopilot

PX4 Autopilot Software
https://px4.io
BSD 3-Clause "New" or "Revised" License
8.09k stars 13.33k forks source link

macOS SITL broken after daemon change #10839

Closed julianoes closed 5 years ago

julianoes commented 5 years ago

SITL doesn't work anymore on macOS.

git bisect points to #10766.

I'll look into it.

julianoes commented 5 years ago
px4 starting.

INFO  [px4] Calling startup script: /bin/sh etc/init.d-posix/rcS 0
  CAL_ACC1_ID: curr: 0Process 65429 stopped
* thread #2, name = 'lpwork', stop reason = signal SIGCONT
    frame #0: 0x00007fff5b089d82 libsystem_kernel.dylib`__semwait_signal + 10
libsystem_kernel.dylib`__semwait_signal:
->  0x7fff5b089d82 <+10>: jae    0x7fff5b089d8c            ; <+20>
    0x7fff5b089d84 <+12>: movq   %rax, %rdi
    0x7fff5b089d87 <+15>: jmp    0x7fff5b080b0e            ; cerror
    0x7fff5b089d8c <+20>: retq
  thread #7, stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x000000010001d5ee px4`uORB::Manager::get_device_master(this=0x0000000000000000) at uORBManager.cpp:83
   80
   81   uORB::DeviceMaster *uORB::Manager::get_device_master()
   82   {
-> 83       if (!_device_master) {
   84           _device_master = new DeviceMaster();
   85
   86           if (_device_master == nullptr) {
Target 0: (px4) stopped.

(lldb) bt
* thread #2, name = 'lpwork', stop reason = signal SIGCONT
  * frame #0: 0x00007fff5b089d82 libsystem_kernel.dylib`__semwait_signal + 10
    frame #1: 0x00007fff5b004724 libsystem_c.dylib`nanosleep + 199
    frame #2: 0x00007fff5b004618 libsystem_c.dylib`usleep + 53
    frame #3: 0x00000001002bbcfe px4`work_process(wqueue=0x0000000100375f88, lock_id=1) at work_thread.c:185
    frame #4: 0x00000001002bbb35 px4`work_lpthread(argc=0, argv=0x0000000101100070) at work_thread.c:296
    frame #5: 0x00000001002b9446 px4`entry_adapter(ptr=0x0000000101100050) at px4_posix_tasks.cpp:105
    frame #6: 0x00007fff5b251661 libsystem_pthread.dylib`_pthread_body + 340
    frame #7: 0x00007fff5b25150d libsystem_pthread.dylib`_pthread_start + 377
    frame #8: 0x00007fff5b250bf9 libsystem_pthread.dylib`thread_start + 13

(lldb) c
Process 65429 resuming
Process 65429 stopped
* thread #7, stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x000000010001d5ee px4`uORB::Manager::get_device_master(this=0x0000000000000000) at uORBManager.cpp:83
   80
   81   uORB::DeviceMaster *uORB::Manager::get_device_master()
   82   {
-> 83       if (!_device_master) {
   84           _device_master = new DeviceMaster();
   85
   86           if (_device_master == nullptr) {
Target 0: (px4) stopped.

(lldb) bt
* thread #7, stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x000000010001d5ee px4`uORB::Manager::get_device_master(this=0x0000000000000000) at uORBManager.cpp:83
    frame #1: 0x000000010001de60 px4`uORB::Manager::node_advertise(this=0x0000000000000000, meta=0x000000010036bda8, instance=0x0000000000000000, priority=75) at uORBManager.cpp:313
  * frame #2: 0x000000010001da9c px4`uORB::Manager::node_open(this=0x0000000000000000, meta=0x000000010036bda8, advertiser=true, instance=0x0000000000000000, priority=75) at uORBManager.cpp:364
    frame #3: 0x000000010001d807 px4`uORB::Manager::orb_advertise_multi(this=0x0000000000000000, meta=0x000000010036bda8, data=0x0000700007d29f18, instance=0x0000000000000000, priority=75, queue_size=1) at uORBManager.cpp:175
    frame #4: 0x0000000100016080 px4`uORB::Manager::orb_advertise(this=0x0000000000000000, meta=0x000000010036bda8, data=0x0000700007d29f18, queue_size=1) at uORBManager.hpp:121
    frame #5: 0x000000010001602a px4`::orb_advertise(meta=0x000000010036bda8, data=0x0000700007d29f18) at uORB.cpp:45
    frame #6: 0x00000001002ce0de px4`_param_notify_changes() at parameters.cpp:305
    frame #7: 0x00000001002cf266 px4`param_set_internal(param=40, val=0x0000700007d29ff4, mark_saved=false, notify_changes=true) at parameters.cpp:772
    frame #8: 0x00000001002ced32 px4`::param_set(param=40, val=0x0000700007d29ff4) at parameters.cpp:793
    frame #9: 0x0000000100065f6f px4`do_set(name="CAL_ACC1_ID", val="1310728", fail_on_not_found=false) at param.cpp:649
    frame #10: 0x00000001000655d8 px4`::param_main(argc=4, argv=0x0000700007d2a060) at param.cpp:245
    frame #11: 0x00000001002bcc2e px4`px4_daemon::Pxh::process_line(line="param set CAL_ACC1_ID 1310728", silently_fail=true) at pxh.cpp:102
    frame #12: 0x00000001002c9bbe px4`px4_daemon::Server::_handle_client(arg=0x0000000000000005) at server.cpp:256
    frame #13: 0x00007fff5b251661 libsystem_pthread.dylib`_pthread_body + 340
    frame #14: 0x00007fff5b25150d libsystem_pthread.dylib`_pthread_start + 377
    frame #15: 0x00007fff5b250bf9 libsystem_pthread.dylib`thread_start + 13

(lldb) t 7
* thread #7, stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x000000010001d5ee px4`uORB::Manager::get_device_master(this=0x0000000000000000) at uORBManager.cpp:83
   80
   81   uORB::DeviceMaster *uORB::Manager::get_device_master()
   82   {
-> 83       if (!_device_master) {
   84           _device_master = new DeviceMaster();
   85
   86           if (_device_master == nullptr) {
(lldb) frame variable
(uORB::Manager *) this = 0x0000000000000000
julianoes commented 5 years ago

Turns out the daemon client returns immediately and that's what's probably causing the segfault as well as other issues I'm seeing.

julianoes commented 5 years ago

It looks like this is an issue that we get poll for POLLHUP too early on macOS. It looks like poll is not equal to poll, it all depends on the OS: https://www.greenend.org.uk/rjk/tech/poll.html

m-ou-se commented 5 years ago

What's the minimal init script or commands you run to reproduce this?

m-ou-se commented 5 years ago

I'd be very surprised if POLLHUP comes too early. It coming too late (well, not at all) could make sense, if on Mac the shutdown() doesn't trigger it. But in that case you shouldn't experience much problems, only that the thread doesn't get killed when the client disconnects.

m-ou-se commented 5 years ago

I was looking at the wrong shutdown. It wasn't the shutdown of the server thread that caused the problem, but the shutdown(WR) of the client. But that one is not really needed. Removing it should solve the problem: https://github.com/PX4/Firmware/pull/10846