Closed deksprime closed 4 years ago
I saw the same hard fault in FW
If it is any help getting to the bottom of this, here is a link to a fixed wing log with this same hard fault "type". Happened during normal "mission mode" flight as the vehicle was coming in for landing. No idea what happened, but we are very interested in finding out why: https://logs.px4.io/plot_app?log=fcb05575-9317-49d7-932a-31aaf9dccd7d
Here is another fixed wing hard fault log, different hard fault "type" though. may or may-not be related to running into some trees... Chicken-or-egg? Not sure what happened first, the hard fault or the tree: https://review.px4.io/plot_app?log=5a04b85e-dac6-4404-885f-2aa7cb529482
Two of these, (@M-Skelton and one from @M-Skelton) are in the fmu module when running as a task.
@M-Skelton which board are you using?
@dagar I was using an fmuV4 board.
@deksprime did you flash the version via QGC? If not, can you provide the binary that you flashed?
The only log that points to a valid upstream commit comes from @deksprime. Nevetheless it helps of course collecting different incidents, but we need to make sure we're not chasing different things.
Here's how to debug hardfaults: https://dev.px4.io/en/debug/gdb_debugging.html#debugging-hard-faults-in-nuttx
@bkueng The version was flashed through the QGC choosing the Stable firmware option
Thanks. Can you provide the following:
Let's collect what we know:
Looking only at the initial report (which runs stable w/o additional changes, but the other logs show the same hardfault):
semaphore/sem_post.c
: ASSERT(sem->semcount < SEM_VALUE_MAX);
, with SEM_VALUE_MAX = 0x7FFF
(gdb) info line *0x08005939
Line 356 of "armv7-m/up_assert.c" starts at address 0x8005938 <up_assert+420>
and ends at 0x8005946 <up_assert+434>.
(gdb) info line *0x080b7fa0
No line number information available for address 0x80b7fa0
(gdb) info line *0x08011188
Line 94 of "armv7-m/gnu/up_switchcontext.S"
starts at address 0x8011188 <up_switchcontext+10> and ends at 0x801118a.
(gdb) info line *0x080074db
Line 119 of "semaphore/sem_post.c" starts at address 0x80074da <sem_post+38>
and ends at 0x80074de <sem_post+42>.
(gdb) info line *0x08009e21
Line 925 of "../../src/drivers/stm32/drv_hrt.c"
starts at address 0x8009e20 <hrt_tim_isr+132>
and ends at 0x8009e2a <hrt_tim_isr+142>.
(gdb) info line *0x08009d9d
Line 609 of "../../src/drivers/stm32/drv_hrt.c"
starts at address 0x8009d9c <hrt_tim_isr> and ends at 0x8009da0 <hrt_tim_isr+4>
The assertion failure can be due to 2 causes:
The only place matching this stacktrace is the logger, scheduling its main loop (there's a similar stack trace pattern in uORB, but there's a condition that excludes it as a candidate). The semaphore in the logger can indeed overflow, if the hrt keeps firing, but the logger's main loop is blocked for some reason (e.g. a higher-prio task running busy, or a problem in the logger itself). For the counter to overflow, the logger must be blocked for at least almost 2 minutes.
https://github.com/PX4/Firmware/pull/8979 fixes the semaphore counter overflow, but not the root cause here.
@dagar can we have the .elf
file available on S3 right next to the .px4
file? This will help with debugging such cases, as we can be sure the binary matches the release (I had to download the same toolchain as CI uses for this now).
Did anyone of you notice if the QGC (mavlink) connection got lost about 2 minutes before the hardfault happened?
Did anyone of you notice if the QGC (mavlink) connection got lost about 2 minutes before the hardfault happened?
I'm not the original poster, but interestingly in our flight the QGC MAVlink telemtry .TLOG went for some seconds longer than the SDcard. Screen shots below and .TLOG attached in the ZIP folder.
2018-01-25 15-26-01.zip .ULOG:
Unfortunately, I wasn't able to track what was happening on the QGC. I haven't been able to find the SD card of the copter in which this hardfault has happened so I didn't share a flight log. I will share a flight log right away if I find the SD card which I hope might eliminate some other possible causes.
Thanks for the info. Based on this, it's more likely a memory corruption. And to narrow it down further, I will need to know which drivers & modules are/were running.
Can you provide me with the following:
With regards to this log: https://logs.px4.io/plot_app?log=fcb05575-9317-49d7-932a-31aaf9dccd7d
Hardware:
We have custom init scripts to set parameters, load drivers, etc., I've attached a zip folder with our custom scripts that should show all that.
PX4-init-scripts-master.zip
There might be some small differences in parameters between this zip folder and
what was flown in the log, but I think nothing that would affect sensors, hardware, or logging.
To add another datapoint to this investigation, we've been going through old SDcard logs and came across another semaphore/sem_post.c Hardfault on a different vehicle. This must have happened during ground handling, not a flight. Identical hardware setup to our other fixed wing hard fault logs. Should be the same init scripts as the ZIP folder I attached previously.
Log: https://review.px4.io/plot_app?log=21d2c119-3aa8-4881-96f3-d11d8b01e380
Interesting, this is quite a while back (Jul 14 2017 20:37:28), and it shows the same pattern.
I did not see anything special from the startup scripts & params, the setup and configuration does not look out of the ordinary.
Here is another semaphore hardfault we came across while going through old SDcards. Same configuration as the other logs I've post previously. This was from a while back. Just adding to the issue here in case someone ever finds something in the future.
Log: https://review.px4.io/plot_app?log=3d54eac4-996d-4590-89ad-708ab64bd2eb
@bkueng Could you have a look? @thomasgubler FYI
It looks like the same issue. A noticeable difference is that there are no param changes in the latest report, whereas in most previous logs there were. And there is an airspeed sensor error: [cal] Airspeed sensor is reporting errors (260)
.
I still see the following potential causes:
I was running a capture test.
fmu mode_pwm4cap1
fmu test
With 100Khz square wave on FMU_CH5
Under high load:
I would say the priorities are not correct. MPU9250 ISR is posting the semaphore higher then the poll thread rate of read.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Closing as stale.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@davids5 does this still need following up?
@julianoes - yes someone should repeat the test with capture and see if the problem still exists.
This issue has been automatically marked as stale because it has not had recent activity. Thank you for your contributions.
@dinomani is this something you could try at some point?
This issue has been automatically marked as stale because it has not had recent activity. Thank you for your contributions.
We were not able to reproduce this issue in 2 years, closing.
I have come across a hardfault as shared in this fault log. Before this log, I have been extensively tuning the rollrate PID's. I was changing the
MC_ROLLRATE_P
parameter while the copter was on flight but I got this hard fault after the flight was done. I suspect that the issue might be related to changing parameters around 10 times or so during the flight but I haven't got any direct proof of this. I'm using Pixhawk the Cube as the flight controller and testing on 1.7.3 stable.