PX4 / PX4-Autopilot

PX4 Autopilot Software
https://px4.io
BSD 3-Clause "New" or "Revised" License
8.2k stars 13.37k forks source link

[Bug] PX4 hardfaults with bad data from motors in UAVCAN #22031

Open alistair-blueflite opened 1 year ago

alistair-blueflite commented 1 year ago

Describe the bug

System hardfaulted when connecting a bad CAN motor (the motor was faulty, in that it was not sending correct CAN data).

From observing the hardfault log, it was found that the system hard-faulted in 'uc_stm32h7_can.cpp', in either 'can1_irq()' or 'can2_irq()' by calling PANIC().

Below is the hardfault log.

fault_2023_08_30_21_00_51.log

To Reproduce

Expected behavior

Screenshot / Media

No response

Flight Log

n/a

Software Version

custom build based on 1.14-beta2 (the uavcan drivers are unchanged in the custom build)

Flight controller

Holybro 6X

Vehicle type

Hybrid VTOL

How are the different components wired up (including port information)

No response

Additional context

No response

tstastny commented 1 year ago

@niklaut FYI

niklaut commented 1 year ago

Could you share your ELF file, otherwise its hard for me to find the correct location considering its a custom build?

alistair-blueflite commented 1 year ago

px4_fmu-v6x_default.zip

julianoes commented 1 year ago

I can see why it is a PANIC() which is basically an assert(). That's because that path "should never happen". The interrupt handlers seem to be configured correctly, so there are two interrupts (Rx, Tx) for CAN1, and two interrupts (Rx, Tx) for CAN2, and nothing else should call can1_irq or can2_irq.

https://github.com/PX4/PX4-Autopilot/blob/d2819ca7542c7fc1c3b122371d9dd35cb7005270/src/drivers/uavcan/uavcan_drivers/stm32h7/driver/src/uc_stm32h7_can.cpp#L1147-L1162

@alistair-blueflite do you have the case still reproducing? If so, could you try and either add a:

printf("irq: %d\n", irq);
fflush(stdout);

Inside of can1_irq() before PANIC, or attach gdb and set a breakpoint in the interrupt and catch it. I would be interested in 1. what's irq when the function is called and 2. who calls it, so the backtrace (presumably it's an interrupt but that's to confirm).

@davids5 do you have any other guess how we might get into this situation?

davids5 commented 1 year ago

@julianoes I would like to see the console output when this fault happens. It would also be best to get a debugger on it to see the HW state.

alistair-blueflite commented 1 year ago

@julianoes currently don't have the bad motors on me but will be able get them back on Thursday to test the fault with the debug statements. Will also try to get the debugger on it then.

alistair-blueflite commented 1 year ago

Hi @julianoes, when calling a custom 'actuator_test' command from the command line, the system hardfaulted once again. After clearing the SD card and power cycling, calling the same command, everything worked as normal. We did the power cycle one more time, and no issues. Seems the problem is not very repeatable.

I've attached the .elf and the hardfault log below. From here it seems the issue might be coming from 'isRxBufferEmpty' or 'makeSelectMasks' in uc_stm32h7_can.cpp? Also, I suspect when I attached the hardfault the last time, buffer might have also been the issue, because we were testing multiple versions of the firmware at the time, and may not have correctly matched up the .elf with the hardfault. This time, I can confirm that the .elf and hardfault match up.

px4_fmu-v6x_default.zip fault_2023_09_05_18_34_39.log

julianoes commented 1 year ago

@davids5 there is a logfile linked here: https://github.com/PX4/PX4-Autopilot/issues/22031#issue-1876261389

And an additional one here: https://github.com/PX4/PX4-Autopilot/issues/22031#issuecomment-1707200975

Is that what you mean?

davids5 commented 1 year ago

@julianoes No. I meant what is on the debug console a the time of the fault, before the reboot

niklaut commented 12 months ago

I just merged better hardfault support into emdbg: python3.8 -m emdbg.debug.gdb --elf ~/Downloads/px4_fmu-v6x_default.elf -py crashdebug --dump ~/Downloads/fault_2023_09_05_18_34_39.log

Issue shows that one of the ifaces[i] entries is a nullptr, which then tries to access the CanIface::isRxBufferEmpty() function with an offset of 31, to access the rx_queue_.len_ member variable, which matches up perfectly with the bus fault at address 0x1f (=31). Sadly the hardfault log does not contain the ifaces memory so I couldn't really get anything more useful out of it. I also don't know much about the internals of UAVCAN, so I recommend instrumenting the code that accesses ifaces to catch the circumstances of why it's a nullptr.

(gdb) arm scb /h
CFSR                             = 00008200                   // Configurable Fault Status Register
    BFSR                           ....82.. - 82              // BusFault Status Register
    BFARVALID                      ....8... - 1               // Indicates if BFAR has valid contents.
    PRECISERR                      .....2.. - 1               // Indicates if a precise data access error has occurred, and the processor has written the faulting address to the BFAR.
BFAR                             = 0000001f                   // BusFault Address Register

(gdb) bt
#0  0x0813975c in uavcan_stm32h7::CanIface::isRxBufferEmpty (this=this@entry=0x0) at src/drivers/uavcan/uavcan_drivers/stm32h7/driver/src/uc_stm32h7_can.cpp:905
#1  0x08139796 in uavcan_stm32h7::CanDriver::makeSelectMasks (this=this@entry=0x30027ab0, pending_tx=...) at src/drivers/uavcan/uavcan_drivers/stm32h7/driver/src/uc_stm32h7_can.cpp:936
#2  0x0813984c in uavcan_stm32h7::CanDriver::select (this=0x30027ab0, inout_masks=..., pending_tx=..., blocking_deadline=...) at src/drivers/uavcan/uavcan_drivers/stm32h7/driver/src/uc_stm32h7_can.cpp:978
#3  0x0813b54c in uavcan::CanIOManager::callSelect (this=this@entry=0x300283f8, inout_masks=..., pending_tx=..., blocking_deadline=...) at src/drivers/uavcan/libuavcan/libuavcan/src/transport/uc_can_io.cpp:274
#4  0x0813b7b6 in uavcan::CanIOManager::send (this=this@entry=0x300283f8, frame=..., tx_deadline=..., blocking_deadline=..., iface_mask=3 '\003', iface_mask@entry=255 '\377', qos=qos@entry=uavcan::CanTxQueue::Volatile, flags=flags@entry=0) at src/drivers/uavcan/libuavcan/libuavcan/src/transport/uc_can_io.cpp:378
#5  0x0813bc9a in uavcan::Dispatcher::send (this=0x300283f0, frame=..., tx_deadline=..., blocking_deadline=..., qos=uavcan::CanTxQueue::Volatile, flags=flags@entry=0, iface_mask=255 '\377') at src/drivers/uavcan/libuavcan/libuavcan/src/transport/uc_dispatcher.cpp:303
#6  0x0813d0be in uavcan::TransferSender::send (this=this@entry=0x30028548, payload=payload@entry=0x3002eba4 "", payload_len=payload_len@entry=7, tx_deadline=..., blocking_deadline=..., transfer_type=transfer_type@entry=uavcan::TransferTypeMessageBroadcast, dst_node_id=dst_node_id@entry=..., tid=...) at src/drivers/uavcan/libuavcan/libuavcan/src/transport/uc_transfer_sender.cpp:74
#7  0x0813d26c in uavcan::TransferSender::send (this=this@entry=0x30028548, payload=payload@entry=0x3002eba4 "", payload_len=payload_len@entry=7, tx_deadline=..., blocking_deadline=..., transfer_type=transfer_type@entry=uavcan::TransferTypeMessageBroadcast, dst_node_id=dst_node_id@entry=...) at src/drivers/uavcan/libuavcan/libuavcan/src/transport/uc_transfer_sender.cpp:169
#8  0x0813a63c in uavcan::GenericPublisherBase::genericPublish (this=this@entry=0x30028548, buffer=..., transfer_type=transfer_type@entry=uavcan::TransferTypeMessageBroadcast, dst_node_id=dst_node_id@entry=..., tid=tid@entry=0x0, blocking_deadline=...) at src/drivers/uavcan/libuavcan/libuavcan/src/node/uc_generic_publisher.cpp:51
#9  0x0813aff4 in uavcan::GenericPublisher<uavcan::protocol::NodeStatus_<0>, uavcan::protocol::NodeStatus_<0> >::genericPublish (blocking_deadline=..., tid=0x0, dst_node_id=..., transfer_type=uavcan::TransferTypeMessageBroadcast, message=..., this=0x30028548) at src/drivers/uavcan/libuavcan/./libuavcan/include/uavcan/node/generic_publisher.hpp:196
#10 uavcan::GenericPublisher<uavcan::protocol::NodeStatus_<0>, uavcan::protocol::NodeStatus_<0> >::publish (blocking_deadline=..., dst_node_id=..., transfer_type=uavcan::TransferTypeMessageBroadcast, message=..., this=0x30028548) at src/drivers/uavcan/libuavcan/./libuavcan/include/uavcan/node/generic_publisher.hpp:141
#11 uavcan::Publisher<uavcan::protocol::NodeStatus_<0> >::broadcast (message=..., this=0x30028548) at src/drivers/uavcan/libuavcan/./libuavcan/include/uavcan/node/publisher.hpp:54
#12 uavcan::NodeStatusProvider::publish (this=this@entry=0x30028520) at src/drivers/uavcan/libuavcan/libuavcan/src/protocol/uc_node_status_provider.cpp:26
#13 0x0813b094 in uavcan::NodeStatusProvider::startAndPublish (this=0x30028520, priority=...) at src/drivers/uavcan/libuavcan/libuavcan/src/protocol/uc_node_status_provider.cpp:72
#14 0xfeedc0de in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

(gdb) frame 0
#0  0x0813975c in uavcan_stm32h7::CanIface::isRxBufferEmpty (this=this@entry=0x0) at src/drivers/uavcan/uavcan_drivers/stm32h7/driver/src/uc_stm32h7_can.cpp:905
905     return rx_queue_.getLength() == 0;
=> 0x0813975c <_ZNK14uavcan_stm32h78CanIface15isRxBufferEmptyEv+12>:    e4 7f   ldrb    r4, [r4, #31]
(gdb) list
910     CriticalSectionLocker lock;
911     return error_cnt_ + rx_queue_.getOverflowCount();
912 }
913
914 unsigned CanIface::getRxQueueLength() const
915 {
916     CriticalSectionLocker lock;
917     return rx_queue_.getLength();
918 }
919

(gdb) frame 1
#1  0x08139796 in uavcan_stm32h7::CanDriver::makeSelectMasks (this=this@entry=0x30027ab0, pending_tx=...) at src/drivers/uavcan/uavcan_drivers/stm32h7/driver/src/uc_stm32h7_can.cpp:936
936         msk.read |= (ifaces[i]->isRxBufferEmpty() ? 0 : 1) << i;
   0x08139790 <_ZNK14uavcan_stm32h79CanDriver15makeSelectMasksERA3_PKN6uavcan8CanFrameE+32>:    50 46   mov r0, r10
   0x08139792 <_ZNK14uavcan_stm32h79CanDriver15makeSelectMasksERA3_PKN6uavcan8CanFrameE+34>:    ff f7 dd ff bl  0x8139750 <_ZNK14uavcan_stm32h78CanIface15isRxBufferEmptyEv>
(gdb) list
931 uavcan::CanSelectMasks CanDriver::makeSelectMasks(const uavcan::CanFrame * (& pending_tx)[uavcan::MaxCanIfaces]) const
932 {
933     uavcan::CanSelectMasks msk;
934
935     for (uint8_t i = 0; i < num_ifaces_; i++) {
936         msk.read |= (ifaces[i]->isRxBufferEmpty() ? 0 : 1) << i;
937
938         if (pending_tx[i] != UAVCAN_NULLPTR) {
939             msk.write |= (ifaces[i]->canAcceptNewTxFrame(*pending_tx[i]) ? 1 : 0) << i;
940         }
davids5 commented 12 months ago

ifaces[i] should not be null. IIRC It is part of the static initialization in uavcan

julianoes commented 10 months ago

@alistair-blueflite you got to the bottom of this, right?