PX4 / PX4-Autopilot

PX4 Autopilot Software
https://px4.io
BSD 3-Clause "New" or "Revised" License
8.31k stars 13.43k forks source link

Unpredictable compilation behavior #20105

Open Etruscian opened 2 years ago

Etruscian commented 2 years ago

Describe the bug

We're using a build server that uses the official docker containers to build the firmware and does a clean pull every time a build is started. Lately, we've noticed unpredictable behavior with builds. We're building v1.11.3 with some additions for our own drone that runs a cube orange. We've also tried different machines but they all give the same problems.

Two firmwares where the only difference is the values of some tuning parameters in the start up script behave very differently. The version before the change works fine, no issues. The version after the change gives a lot of UAVCAN errors.

We've noticed that the binaries aren't the same size (the second one is 8 bytes larger). If we take the working version and perform the same shift at the same location in the flash before uploading (with modifying the address pointers so the binaries are identical apart from git hash and build timestamps), the exact same problem occurs.

The difference of 8 bytes is determined to be due to the timestamp of compilation of certain c files being set in the binary. If this is the same timestamp (second accuracy), they all point to 1 timestamp in the binary. if these are different, 9 bytes get added to store the second timestamp. This timestamp was added at 0x139872 (relative to 0x08020000) and because of alignment, a single padding byte is removed at 0x166b8b. These offsets might be case specific, but might also indicate a problem somewhere.

Another way we notice this is that a later build with also minimal changes seems to freeze at touch down. We've made ram dumps and verified with a debugger that at least most of the PX4 code is still running (i.e. interrupts and at least some threads are still running). There is just no communication with the outside world. UAVCAN stops sending messages, uarts stop sending messages and the logger stops as well.

Using the ram dumps, we were able to verify that the incoming messages are still put in the rx buffers of the uarts (which indicates that the peripherals and DMA's are still running when the outward communication stopped). The tx buffer of the telemetry radio for instance contains only all the messages up to the last one we received over telemetry. We've also checked the BSRAM to see if any hard faults are logged but it was empty.

I personally also have a PX4 drone that runs 1.13 using a holybro kakute f7 with some modules turned off to enable the ekf. Minor changes in these builds also result in unpredictable behavior where sometimes the gnss driver refuses to start or the entire nuttx shell stops working. Trying to communicate to the nuttx shell using qgroundcontrol then results in a hard crash of the flight controller where the usb connection is stopped by the flight controller. These firmwares are built on a completely different machine.

At first we were thinking of it being due to our code changes, but since it is also happening for my personal drone with different hardware and we've heard other people in slack noting the same problems, we believe that something is fundamentally going wrong in the build process.

How can we best proceed to further investigate the cause of this issue?

dagar commented 2 years ago

Until we have a better understanding I would try to collect enough information to debug these as separate issues.

For H7 there was an alignment fix that went into v1.13. https://github.com/PX4/PX4-Autopilot/pull/19724

Etruscian commented 2 years ago

Thanks for the reply.

We're going to try to apply the fix for the H7 on a firmware that is known bad to see if it fixes the issue. Reproducing the uavcan issue on 1.13 should be doable, we'll try. The freeze issue is harder since we need to fly to reproduce it and we would need to port our changes to 1.13 before we can fly safely.

For the kakute f7, I have an st link and ftdi adapter, so I can debug using either method. I will do that this weekend hopefully

Etruscian commented 2 years ago

Here is the serial console log of the boot of the kakute f7 (where it crashes when you try to use the shell.

NuttShell (NSH) NuttX-10.2.0 nsh> ERROR [PX4_MTD] failed to locate spi bus ERROR [PX4_MTD] mtd failure: -6 bus 2 address 0 class 1 nsh: ver: command not found nsh: hardfault_log: command not found nsh: mft: command not found nsh: mft: command not found INFO [param] selected parameter backup file /fs/microsd/parameters_backup.bson nsh: ver: command not found Board architecture defaults: /etc/init.d/rc.board_arch_defaults Board defaults: /etc/init.d/rc.board_defaults ERROR [param] Parameter ATT_ACC_COMP not found. ERROR [param] Parameter ATT_W_ACC not found. ERROR [param] Parameter ATT_W_GYRO_BIAS not found. Loading airframe: /etc/init.d/airframes/4001_quad_x nsh: tone_alarm: command not found sercon: Registering CDC/ACM serial driver sercon: Successfully registered the CDC/ACM serial driver INFO [dataman] data manager file '/fs/microsd/dataman' size is 62560 bytes nsh: send_event: command not found nsh: rgbled: command not found nsh: rgbled_ncp5623c: command not found Board sensors: /etc/init.d/rc.board_sensors mpu6000 #0 on SPI bus 4 rotation 6 bmp280 #0 on I2C bus 1 (external) address 0x76 nsh: icm20948_i2c_passthrough: command not found nsh: hmc5883: command not found nsh: ist8308: command not found nsh: ist8310: command not found nsh: lis2mdl: command not found nsh: lis3mdl: command not found qmc5883l #0 on I2C bus 1 (external) address 0xD nsh: rm3100: command not found nsh: ak09916: command not found nsh: pwm_out: command not found ERROR [parameters] get: param 65535 invalid ERROR [parameters] get: param 65535 invalid ERROR [parameters] get: param 65535 invalid ERROR [parameters] get: param 65535 invalid ERROR [mixer] can't open /dev/pwm_output0

ERROR [mixer] failed to load mixer ERROR [init] Failed loading mixer: /etc/mixers/quad_x.main.mix nsh: tune_control: command not found ekf2 [361:237] nsh: mc_hover_thrust_estimator: command not found INFO [mavlink] mode: Normal, data rate: 1200 B/s on /dev/ttyS0 @ 115200B Board extras: /etc/init.d/rc.board_extras INFO [logger] logger started (mode=all) INFO [logger] Start file log (type: full)

NuttShell (NSH) NuttX-10.2.0 nsh> INFO [logger] [logger] /fs/microsd/log/sess142/log100.ulg INFO [logger] Opened full log file: /fs/microsd/log/sess142/log100.ulg

Etruscian commented 2 years ago

The nuttx shell works perfectly fine over the serial console btw

Etruscian commented 2 years ago

If I try to use the shell over mavlink, I get a hard fault over the serial console (below).

According to the PC(R15), this happens in nsh_session where the greeting is put in the output stream. The calling method is nsh_consolemain according to R14.

INFOarm_hardfault: Hard Fault: arm_hardfault: IRQ: 3 regs: 0x2004696c arm_hardfault: BASEPRI: 000000f0 PRIMASK: 00000000 IPSR: 00000003 CONTROL: 00000000 arm_hardfault: CFAULTS: 00008200 HFAULTS: 40000000 DFAULTS: 00000000 BFAULTADDR: 00000294 AFAULTS: 00000000 arm_hardfault: PANIC!!! Hard fault: 40000000 up_assert: Assertion failed at file:armv7-m/arm_hardfault.c line: 135 task: mavlink_shell up_registerdump: R0: 00000000 00000001 00000000 00000000 00000000 00000000 00000000 00000000 up_registerdump: R8: 00000000 00000000 00000000 00000000 00000000 20046a40 0801d513 0801dd82 up_registerdump: xPSR: 21000000 BASEPRI: 000000f0 CONTROL: 00000000 up_registerdump: EXC_RETURN: ffffffe9 up_dumpstate: sp: 20020b30 up_dumpstate: IRQ stack: up_dumpstate: base: 200209c0 up_dumpstate: size: 00000200 up_dumpstate: used: 00000188 up_stackdump: 20020b20: 20020b30 20020bc0 080d3d91 08021e1f 000000f0 00000000 00000000 00000000 up_stackdump: 20020b40: 20046a40 0801d513 0801dd82 0000002d e000ed2c e000ed2c 00000294 00000000 up_stackdump: 20020b60: 00000000 00000000 00000000 00000000 00000000 08019af7 0801a6e1 080195a3 up_stackdump: 20020b80: 40000000 00000000 00000294 00000000 20020bc0 00000003 00000003 0801ad7f up_stackdump: 20020ba0: 000000f0 08018329 000000f0 2004696c 00000000 00000000 00000000 08018217 up_dumpstate: sp: 20046a40 up_dumpstate: User stack: up_dumpstate: base: 200462b0 up_dumpstate: size: 000007e8 up_dumpstate: used: 00000134 up_stackdump: 20046a40: 00000000 00000001 00000000 00000000 08075aad 00000000 00000000 0801d513 up_stackdump: 20046a60: 00000000 00000000 20024ae0 08075abf 00000000 080244d5 08075aad 00000001 up_stackdump: 20046a80: 00000000 0801bebb 00000000 00000000 00000000 00000000 00000260 00000820 up_taskdump: Idle Task: PID=0 Stack Used=352 of 726 up_taskdump: hpwork: PID=1 Stack Used=332 of 1264 up_taskdump: lpwork: PID=2 Stack Used=892 of 1616 up_taskdump: init: PID=3 Stack Used=2168 of 2928 up_taskdump: wq:manager: PID=4 Stack Used=420 of 1256 up_taskdump: dataman: PID=133 Stack Used=820 of 1208 up_taskdump: wq:lp_default: PID=135 Stack Used=756 of 1920 up_taskdump: wq:hp_default: PID=168 Stack Used=1156 of 1904 up_taskdump: gps: PID=553 Stack Used=1228 of 1680 up_taskdump: wq:nav_and_controllers: PID=266 Stack Used=1188 of 2240 up_taskdump: wq:rate_ctrl: PID=267 Stack Used=2268 of 3152 up_taskdump: wq:INS0: PID=268 Stack Used=4372 of 6000 up_taskdump: wq:SPI4: PID=173 Stack Used=1472 of 2336 up_taskdump: navigator: PID=782 Stack Used=1068 of 1920 up_taskdump: wq:I2C1: PID=175 Stack Used=876 of 2336 up_taskdump: mavlink_if0: PID=752 Stack Used=1932 of 2728 up_taskdump: mavlink_rcv_if0: PID=754 Stack Used=1476 of 4576 up_taskdump: mavlink_if1: PID=851 Stack Used=1860 of 2824 up_taskdump: mavlink_rcv_if1: PID=852 Stack Used=1756 of 4576 up_taskdump: commander: PID=278 Stack Used=1308 of 3224 up_taskdump: mavlink_shell: PID=855 Stack Used=308 of 2024 up_taskdump: logger: PID=824 Stack Used=2668 of 3648 up_taskdump: log_writer_file: PID=825 Stack Used=596 of 1176 up_taskdump: wq:ttyS4: PID=764 Stack Used=772 of 1632

Etruscian commented 2 years ago

Removing the mixer module and instead using the control allocator module makes the shell over mavlink work without a problem. It still gave a hard fault when clicking through the tabs of qgroundcontrol (see log below).

According to the PC (R15) in the dump and the obj dump of the elf file, it crashed inside memset. The calling method according to R14 is uORB::DeviceNode::Write.

arm_hardfault: Hard Fault: arm_hardfault: IRQ: 3 regs: 0x20049eb4 arm_hardfault: BASEPRI: 000000f0 PRIMASK: 00000000 IPSR: 00000003 CONTROL: 00000000 arm_hardfault: CFAULTS: 00000400 HFAULTS: 40000000 DFAULTS: 00000000 BFAULTADDR: 00000000 AFAULTS: 00000000 arm_hardfault: PANIC!!! Hard fault: 40000000 up_assert: Assertion failed at file:armv7-m/arm_hardfault.c line: 135 task: mavlink_rcv_if1 up_registerdump: R0: 00000000 00000000 000001b8 00000000 000001c0 00000000 00000000 00000000 up_registerdump: R8: 2004a110 00000001 00000001 00000001 000001c0 20049f88 080ca675 08025bf0 up_registerdump: xPSR: 21000200 BASEPRI: 000000f0 CONTROL: 00000000 up_registerdump: EXC_RETURN: ffffffe9 up_dumpstate: sp: 20020b30 up_dumpstate: IRQ stack: up_dumpstate: base: 200209c0 up_dumpstate: size: 00000200 up_dumpstate: used: 00000188 up_stackdump: 20020b20: 20020b30 20020bc0 080d3b39 08021e1f 000000f0 00000000 00000001 000001c0 up_stackdump: 20020b40: 20049f88 080ca675 08025bf0 0000002d e000ed2c e000ed2c 00000000 00000000 up_stackdump: 20020b60: 00000000 2004a110 00000001 00000001 00000001 08019af7 0801a6e1 080195a3 up_stackdump: 20020b80: 40000000 00000000 00000000 00000000 20020bc0 00000003 00000003 0801ad7f up_stackdump: 20020ba0: 000000f0 08018329 000000f0 20049eb4 00000000 00000000 00000000 08018217 up_dumpstate: sp: 20049f88 up_dumpstate: User stack: up_dumpstate: base: 200491e8 up_dumpstate: size: 000011e0 up_dumpstate: used: 000006d4 up_stackdump: 20049f80: 80000010 2004a110 00000038 2002c420 2004a110 00000038 00000000 080ca675 up_stackdump: 20049fa0: 2002c420 000001c0 2004a110 080ca62d 080fd804 00000000 000000f1 080ca78f up_stackdump: 20049fc0: 080f6a38 20048414 2004a110 08073b55 20047eb8 2004a0ec 2004a264 08085ee9 up_stackdump: 20049fe0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 up_stackdump: 2004a000: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 up_stackdump: 2004a020: 00000000 00000000 00000000 00000000 00000000 2004a108 000000f0 00000000 up_stackdump: 2004a040: 20049040 000000f0 20049074 2004a130 00000001 0000000a 00000000 ffffffe9 up_stackdump: 2004a060: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 up_stackdump: 2004a080: 00000000 2004a158 000000f0 20048600 20047eb8 20048600 00000000 00000000 up_stackdump: 2004a0a0: 00000000 0000002c 00000000 ffffffe9 2004a264 20048724 00000000 2004a264 up_stackdump: 2004a0c0: 20047eb8 000000f1 00000000 00000036 00000000 0000002c 00000000 080861a7 up_stackdump: 2004a0e0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 up_stackdump: 2004a100: 00000000 00000000 010100f1 01000000 03aca4db 00000000 00000000 00000000 up_stackdump: 2004a120: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 000000f1 up_stackdump: 2004a140: beff0101 00000100 2004a264 20047eb8 20048600 080889ed 00000000 20024fe8 up_stackdump: 2004a160: 20047eb8 08089157 0000002b 00000001 deadbeef deadbeef 000028ce 00000000 up_stackdump: 2004a180: 03ac8b66 00000000 deadbeef deadbeef deadbeef deadbeef deadbeef deadbeef up_stackdump: 2004a1a0: deadbeef 00000003 00000101 00000000 00000000 00000000 000020fd 4cbeffe7 up_stackdump: 2004a1c0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00f10000 up_stackdump: 2004a1e0: 36d10101 00000000 00000000 00000000 00000000 0101ff00 03ac8b66 00000000 up_stackdump: 2004a200: 000000b5 00000005 0000009c 000000c1 00000000 009c001e 0000001a 00000000 up_stackdump: 2004a220: deadbeef deadbeef deadbeef deadbeef deadbeef deadbeef deadbeef deadbeef up_stackdump: 2004a240: deadbeef deadbeef deadbeef deadbeef deadbeef deadbeef deadbeef deadbeef up_stackdump: 2004a260: deadbeef 20fd36d1 ffe70000 00004cbe 00000000 00000000 00000000 00000000 up_stackdump: 2004a280: 00000000 00000000 00000000 010100f1 00000000 00000000 00000000 00000000 up_stackdump: 2004a2a0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 01000000 up_stackdump: 2004a2c0: 00000001 00000000 00000000 00000000 00000000 00000000 00000000 00000000 up_stackdump: 2004a2e0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 up_stackdump: 2004a300: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 up_stackdump: 2004a320: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 up_stackdump: 2004a340: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 up_stackdump: 2004a360: 00000000 00000000 00000000 00000000 00000000 00000000 000036d1 00000000 up_stackdump: 2004a380: 00000000 20000000 200242d0 00000000 00000000 00000000 00000000 00000000 up_stackdump: 2004a3a0: 00000000 00000000 00000000 00000000 00000000 08089343 0808933d 08024311 up_stackdump: 2004a3c0: 08024309 00000000 00000040 80001200 080f5b5c 20000001 00000000 20046b80 up_taskdump: Idle Task: PID=0 Stack Used=352 of 726 up_taskdump: hpwork: PID=1 Stack Used=332 of 1264 up_taskdump: lpwork: PID=2 Stack Used=908 of 1616 up_taskdump: init: PID=3 Stack Used=2168 of 2928 up_taskdump: wq:manager: PID=4 Stack Used=452 of 1256 up_taskdump: wq:I2C1: PID=165 Stack Used=840 of 2336 up_taskdump: mavlink_rcv_if0: PID=742 Stack Used=1572 of 4576 up_taskdump: commander: PID=263 Stack Used=1308 of 3224 up_taskdump: logger: PID=813 Stack Used=2596 of 3648 up_taskdump: log_writer_file: PID=814 Stack Used=652 of 1176 up_taskdump: gps: PID=527 Stack Used=1228 of 1680 up_taskdump: wq:ttyS4: PID=754 Stack Used=764 of 1632 up_taskdump: mavlink_if1: PID=820 Stack Used=1844 of 2824 up_taskdump: mavlink_rcv_if1: PID=821 Stack Used=1748 of 4576 up_taskdump: dataman: PID=118 Stack Used=820 of 1208 up_taskdump: mavlink_shell: PID=823 Stack Used=1028 of 2024 up_taskdump: wq:hp_default: PID=152 Stack Used=1024 of 1904 up_taskdump: navigator: PID=761 Stack Used=1092 of 1920 up_taskdump: wq:nav_and_controllers: PID=250 Stack Used=1276 of 2240 up_taskdump: wq:rate_ctrl: PID=251 Stack Used=2268 of 3152 up_taskdump: wq:INS0: PID=252 Stack Used=4372 of 6000 up_taskdump: wq:lp_default: PID=125 Stack Used=868 of 1920 up_taskdump: wq:SPI4: PID=158 Stack Used=1500 of 2336 up_taskdump: mavlink_if0: PID=735 Stack Used=1844 of 2728

Etruscian commented 2 years ago

@dagar Any idea what could cause different threads to go out of memory by changing a (supposedly) completely unrelated part of the code?