ArduPilot / ardupilot

ArduPlane, ArduCopter, ArduRover, ArduSub source
http://ardupilot.org/
GNU General Public License v3.0
10.9k stars 17.4k forks source link

Pixhawk / Copter: segfault on connect via USB #1161

Closed sgofferj closed 8 years ago

sgofferj commented 10 years ago

I just tried to connect and this happened: Connect USB cable: Normal startup. Start APM Planner, select /dev/ttyACM0, click connect ----> Pixhawk freezes (really freezes - LED stops blinking).

Debug console says:

Assertion failed at file:armv7-m/up_hardfault.c line: 184 task: ArduCopter sp: 2000ffe4 IRQ stack: base: 20010050 size: 00000ffc 2000ffe0: 2000ff44 2000ffe4 1000f418 00001ffc 0809613f 000000b8 080cc424 2000ff44 20010000: 080a41fd 2000ff44 08096215 00000010 08096419 080963fd 080a7355 2000d360 20010020: 00000003 08096375 080a4229 00000000 200102d8 adf92276 0000048f 080a3d8f 20010040: 1000f22c 00000000 08096375 2000ff44 10005130 00000001 00000000 00000000 sp: 080cc3cf User stack: base: 1000f418 size: 00001ffc R0: 2000ffe8 08096165 080902ad 080cc3c3 00000004 00000000 080902cd 00000000 R8: 08096215 1000f22c 000000b0 08090299 080902ad 00000000 080cc3c3 00000004 xPSR: 00000000 BASEPRI: 00000000 CONTROL: 00000000 EXC_RETURN: 000000b0

Drones-discuss: https://groups.google.com/forum/#!topic/drones-discuss/CaQcQzD-tIM

geeksville commented 10 years ago

From the mail thread - steps to repro with just mavproxy:

Hi ya'll,

I think I found a new easier steps to reproduce for this (at least it causes the big yellow led to stop blinking / stops talking to mavproxy). Unfortunately, I really need to do one more thing before dronecon so I can't help with this till Sun. But posting here in case it is useful (mostly for Tridge):

$ mavproxy.py --master=/dev/ttyACM0,115200 Logging to mav.tlog MAV> ter calibration completeAPM: GROUND START Init Gyro**STABILIZE> Mode STABILIZE fence breach

INS

G_off: -0.00, 0.02, 0.01 A_off: 0.00, 0.00, 0.00 A_scale: 1.00, 1.00, 1.00

Ready to FLY APM: PreArm: RC not calibrated APM: PX4v2 00270030 31334705 39343031 Received 377 parameters

STABILIZE> STABILIZE> STABILIZE> APM: PreArm: RC not calibrated

mavproxy.py --master=/dev/ttyACM0,115200 Logging to mav.tlog Running script /home/kevinh/.mavinit.scr MAV> fence breach STABILIZE> Mode STABILIZE APM: PreArm: RC not calibrated

STABILIZE>


100% repeatable the 5 times I tried - I'm really sorry I don't have time today to hook up a console cable and do some debugging. If no one else gets to this by Sunday I can look at it.

geeksville commented 10 years ago

Ugh. The steps to repro 'every time' are now no longer working. The exact same build that I had before AVC with the same autopilot is no longer showing the problem. @rmackay9 @tridge @Craig3DRobotics. I am also unable to reproduce with the linux build of APM planner. Randy, what were your steps to repro with apm planner?

sgofferj commented 10 years ago

I'm still standing by for instructions on how to do the tests @tridge requested on DD.

tridge commented 10 years ago

Hi Stefan, if you can still reproduce the problem then what I think we need to do next is a bisection search over the builds on firmware.diydrones.com. @Stefan, On firmware.diydrones.com you will find that every build of master is kept by date. If you start with a build that you know is broken, then work backwards until you find a build that is not broken. In log(N) time you should be able to narrow the bug down to a small range of commits. If you can tell us which date introduced the bug then we can examine those changes carefully to try to find the bug. This assumes it is a software bug, or at least a bug that was triggered by a software change.

geeksville commented 10 years ago

I just encountered this bug again. I had a serial console hooked up, but alas no JTAG. At least in this 'crash' it seems like ardupilot is still running, just USB was hosed.

Interesting things:

nsh> cp rc.APM /dev/ttyACM0 nsh: cp: open failed: Transport endpoint is not connected

geeksville commented 10 years ago

and this time I got it to happen again. yellow led stuck on, no panic crashdump on serial port and the following threads still running.

Processes: 11 total, 5 running, 6 sleeping CPU usage: 28.23% tasks, 0.27% sched, 71.50% idle Uptime: 188.032s total, 120.969s idle

PID COMMAND CPU(ms) CPU(%) USED/STACK PRIO(BASE) STATE 0 Idle Task 120968 71.504 0/ 0 0 ( 0) READY 1 hpwork 3328 1.869 612/ 3992 192 (192) w:sig 2 lpwork 981 0.445 396/ 3992 50 ( 50) READY 3 init 1290 0.000 1364/ 4088 100 (100) w:sem 29 4303 2.404 708/ 2040 59 ( 59) w:sem 10 px4io 2284 1.157 908/ 2040 240 (240) w:sem 13 fmuservo 893 0.445 860/ 2040 100 (100) w:sem 26 ArduCopter 26403 3.561 1132/ 8184 10 ( 10) READY 27 8587 4.541 580/ 2040 181 (181) w:sem 28 17254 10.774 676/ 2040 60 ( 60) READY 31 top 196 3.027 1300/ 2992 100 (100) RUN

sh> ps PID PRI SCHD TYPE NP STATE NAME 0 0 FIFO TASK READY Idle Task() 1 192 FIFO KTHREAD WAITSIG hpwork() 2 50 FIFO KTHREAD WAITSIG lpwork() 3 100 FIFO TASK RUNNING init() 10 240 FIFO TASK WAITSEM px4io() 13 100 FIFO TASK WAITSEM fmuservo() 26 10 FIFO TASK READY ArduCopter() 27 181 FIFO PTHREAD WAITSEM (20007df8) 28 60 FIFO PTHREAD WAITSEM (20007df8) 29 59 FIFO PTHREAD WAITSEM (20007df8) nsh>

usb is dead.

geeksville commented 10 years ago

Other datapoints (I'll keep reusing this comment to keep my notes):

geeksville commented 10 years ago

snapshot of a joint debug session with tridge (I'm now trying to find a way to get to ram on this board from the shell):

[2:02:25 PM] Kevin Hester: hi tridge - got a minute? [2:05:13 PM] Andrew Tridgell: hi kevin, yes, but can't do voice [2:07:40 PM] Kevin Hester: ok [2:07:51 PM] Kevin Hester: so i've lucked into this problem occurring [2:08:29 PM] Kevin Hester: if the yellow light stops flashing am i right to presume the most likely cause is that for some reason the apm scheduler is not calling notify_update? [2:08:57 PM] Kevin Hester: the funny thing is that the apm thread isn't totally blocked, because i see some small (3%) cpu usage from it [2:09:47 PM] Kevin Hester: also - the usb link seems dead from the px4 side: nsh> cp rc.APM /dev/ttyACM0 nsh: cp: open failed: Transport endpoint is not connected [2:09:51 PM] Andrew Tridgell: have you got jtag attached? [2:10:05 PM] Kevin Hester: no - didn't have the connector soldered on this particular board [2:10:12 PM] Andrew Tridgell: darn! [2:10:25 PM] Kevin Hester: i can go back to my other board but was working on other stuff when i stumbled on this [2:10:32 PM] Andrew Tridgell: does cp work for tty on NuttX? (I use echo >) [2:10:50 PM] Kevin Hester: to ttyACM0? [2:11:09 PM] Andrew Tridgell: yes [2:11:19 PM] Andrew Tridgell: what GCS do you have connected? [2:12:26 PM] Kevin Hester: i was using cp because echo didn't work [2:12:37 PM] Kevin Hester: nsh> echo fish >/dev/ttyACM0 nsh: echo: open failed: Invalid argument nsh> echo >/dev/ttyACM0 fish

/dev/ttyACM0 fish nsh> [2:12:56 PM] Andrew Tridgell: its "echo hello > /dev/ttyACM0" [2:13:06 PM] Andrew Tridgell: invalid argument is strange for that [2:13:09 PM] Kevin Hester: oh - the space matters [2:13:11 PM] Kevin Hester: ? [2:13:14 PM] Andrew Tridgell: as it shouldn't be doing an ioctl [2:13:15 PM] Kevin Hester: i'll try that [2:13:24 PM] Andrew Tridgell: yeah, nsh is fussy [2:13:39 PM] Andrew Tridgell: its not bash :) [2:13:39 PM] Kevin Hester: nsh> echo fish > /dev/ttyACM0 nsh: echo: open failed: Invalid argument [2:13:53 PM] Andrew Tridgell: does ls /dev show ttyACM0 ? [2:13:56 PM] Kevin Hester: nsh> cp rc.APM /dev/ttyACM0 nsh: cp: open failed: Transport endpoint is not connected [2:14:00 PM] Kevin Hester: yes it does [2:14:19 PM] Andrew Tridgell: i wonder if its out of file handles? [2:14:23 PM] Kevin Hester: of note: I found two ways to crash it [2:15:04 PM] Kevin Hester: mavproxy (as previously mentioned), also just connecting with minicom and sending nothing from the linux pc I see a few mavlink heartbeats from the device then it dies [2:15:07 PM] Andrew Tridgell: try: mpu6000 test [2:15:29 PM] Kevin Hester: nsh> mpu6000 test mpu6000: /dev/mpu6000accel open failed (try 'mpu6000 start' if the driver is not running): No such file or directory [2:15:45 PM] Kevin Hester: oh [2:15:49 PM] Kevin Hester: it didn't find it at boot [2:15:57 PM] Kevin Hester: btw - this particular board is a 2.2 [2:15:58 PM] Kevin Hester: crap [2:16:02 PM] Kevin Hester: just noticed [2:16:03 PM] Andrew Tridgell: ok, try: hmc5883 test [2:16:16 PM] Andrew Tridgell: i want to see if it can open another fd [2:16:38 PM] Kevin Hester: nsh> hmc5883 test hmc5883: /dev/hmc5883 open failed (try 'hmc5883 start' if the driver is not running: No such file or directory nsh> ls /dev/hmc* /dev/hmc: nsh: ls: no such directory: /dev/hmc_ nsh> [2:17:01 PM] Kevin Hester: doing a cat of APM.rc to console worked [2:17:07 PM] Kevin Hester: so presumably that got a fd [2:17:12 PM] Andrew Tridgell: try: lsm303d test [2:17:24 PM] Kevin Hester: passed [2:17:27 PM] Andrew Tridgell: hmm, ok, so open() does work [2:17:45 PM] Kevin Hester: also interesting: [2:18:27 PM] Kevin Hester: sometimes when I lose USB the yellow led keeps blinking (and arducopter is using the expected 30% of cpu) - but sometimes the yellow light stops blinking - nsh still works but arducopter only using 3% of cpu [2:18:45 PM] Andrew Tridgell: might be two separate bugs [2:18:47 PM] Kevin Hester: thus my interest in why would the led not be blinking/scheduler [2:19:14 PM] Andrew Tridgell: i'm not sure if there is a state in notify with no blink [2:19:16 PM] Kevin Hester: i can switch back to my known good 2.4 board. i only just noticed this was a 2.2 [2:19:23 PM] Andrew Tridgell: can you paste output of "perf" ? [2:19:52 PM] Kevin Hester: from looking at notify it looks like if notify::update() gets called it would always blink in this mode [2:19:53 PM] Kevin Hester: sure [2:20:11 PM] Kevin Hester: oooh cool - didn't know about that one [2:20:14 PM] Kevin Hester: nsh> perf APM_overrun: 0 events APM_loop: 29999 events, 58509554us elapsed, 1950 avg, min 452us max 3962us APM_rcout: 59978 events, 221433us elapsed, 3 avg, min 3us max 450us APM_rcin: 2849701 events, 16414241us elapsed, 5 avg, min 4us max 530us lsm303d_accel_resched: 36373 events lsm303d_extremes: 0 events lsm303d_reg7_resets: 0 events lsm303d_reg1_resets: 0 events lsm303d_mag_read: 303333 events, 17153343us elapsed, 56 avg, min 54us max 57us lsm303d_accel_read: 2423236 events, 142816935us elapsed, 58 avg, min 57us max 60us l3gd20_errors: 0 events l3gd20_reschedules: 31381 events l3gd20_read: 2304065 events, 149714063us elapsed, 64 avg, min 63us max 70us ADC samples: 3032570 events, 9329006us elapsed, 3 avg, min 2us max 12us ms5611_buffer_overflows: 182680 events ms5611_comms_errors: 0 events ms5611_measure: 249942 events, 5806671us elapsed, 23 avg, min 17us max 372us ms5611_read: 249942 events, 29405056us elapsed, 117 avg, min 36us max 552us io rc #: 0 events io write: 0 events, 0us elapsed, 0 avg, min 0us max 0us io update: 136337 events, 145147074us elapsed, 1064 avg, min 996us max 2087us io_badidle : 0 events io_idle : 272709 events io_uarterrs : 0 events io_protoerrs: 0 events io_dmaerrs : 0 events io_crcerrs : 0 events io_timeouts : 0 events io_retries : 0 events io_dmasetup : 272713 events, 4956100us elapsed, 18 avg, min 14us max 305us io_txns : 272713 events, 137498481us elapsed, 504 avg, min 158us max 885us DMA allocations: 2 events APM_uartE: 2597458 events, 16640972us elapsed, 6 avg, min 4us max 7788us APM_uartD: 5188820 events, 99136581us elapsed, 19 avg, min 4us max 8148us APM_uartC: 2597963 events, 15142472us elapsed, 5 avg, min 4us max 7947us APM_uartB: 2597551 events, 13683780us elapsed, 5 avg, min 4us max 7982us APM_uartA: 2597589 events, 49502234us elapsed, 19 avg, min 4us max 8193us APM_storage_errors: 1 events APM_storage: 1 events, 4538us elapsed, 4538 avg, min 4538us max 4538us APM_delay: 2595754 events, 3000917488us elapsed, 1156 avg, min 6us max 101033us APM_IO_timers: 1491669 events, 20559781us elapsed, 13 avg, min 1us max 2699us APM_timers: 2849961 events, 25146722us elapsed, 8 avg, min 3us max 558us EKF_FuseSideslip: 0 events, 0us elapsed, 0 avg, min 0us max 0us EKF_FuseAirspeed: 0 events, 0us elapsed, 0 avg, min 0us max 0us EKF_FuseMagnetometer: 0 events, 0us elapsed, 0 avg, min 0us max 0us EKF_FuseVelPosNED: 0 events, 0us elapsed, 0 avg, min 0us max 0us EKF_CovariancePrediction: 0 events, 0us elapsed, 0 avg, min 0us max 0us EKF_UpdateFilter: 0 events, 0us elapsed, 0 avg, min 0us max 0us DF_errors: 0 events DF_fsync: 0 events, 0us elapsed, 0 avg, min 0us max 0us DF_write: 0 events, 0us elapsed, 0 avg, min 0us max 0us nsh> [2:21:33 PM] Andrew Tridgell: all looks normal to me [2:21:53 PM] Andrew Tridgell: what does "ps" give? [2:22:12 PM] Kevin Hester: nsh> ps PID PRI SCHD TYPE NP STATE NAME 0 0 FIFO TASK READY Idle Task() 1 192 FIFO KTHREAD WAITSIG hpwork() 2 50 FIFO KTHREAD WAITSIG lpwork() 3 100 FIFO TASK RUNNING init() 10 240 FIFO TASK WAITSEM px4io() 13 100 FIFO TASK WAITSEM fmuservo() 26 10 FIFO TASK READY ArduCopter() 27 181 FIFO PTHREAD WAITSEM (20007df8) 28 60 FIFO PTHREAD READY (20007df8) 29 59 FIFO PTHREAD WAITSEM (20007df8) nsh> [2:22:19 PM] Kevin Hester: looked the same as in a 'good' case [2:23:21 PM] Andrew Tridgell: its not normal [2:23:28 PM] Andrew Tridgell: prio of AduCopter is 10 [2:23:36 PM] Andrew Tridgell: thats the "overtime" priority [2:23:47 PM] Andrew Tridgell: it means we have something stuck, good! [2:23:56 PM] Kevin Hester: oh - interesting [2:24:11 PM] Andrew Tridgell: run perf again please and paste again [2:24:18 PM] Kevin Hester: so ac was spinning and nuttx has code to bump down priority to keep things working? [2:24:20 PM] Andrew Tridgell: i want to see if loop counter is going up [2:24:29 PM] Andrew Tridgell: apm has code to do that [2:24:48 PM] Andrew Tridgell: (I put it in to cope with devs assuming they could spin readoing from serial etc) [2:25:06 PM] Kevin Hester: good idea ;-) [2:25:11 PM] Kevin Hester: loop is stuck [2:25:21 PM] Andrew Tridgell: APM_loop is still 29999 ? [2:25:25 PM] Kevin Hester: yep [2:25:38 PM] Andrew Tridgell: ok, great, so we now know its a sw bug, excellent! [2:25:52 PM] Andrew Tridgell: now we want a backtrace ;) [2:25:58 PM] Kevin Hester: is there a command in nuttx to dump the backtrace? [2:26:03 PM] Andrew Tridgell: it means we have a loop somewhere that is not exiting [2:26:07 PM] Kevin Hester: or at least look at raw memory? [2:26:28 PM] Andrew Tridgell: not as far as I know [2:26:31 PM] Andrew Tridgell: jtag ..... [2:27:03 PM] Andrew Tridgell: try "task" [2:27:28 PM] Kevin Hester: ok - it seems fairly reproducable now (though if i move things perhaps not). I don't have a jtag connector on this board, so I'm going to add a option on ps to dump tcbs [2:27:31 PM] System: Available commands: /me [text] /add [skypename+] /alertson [text] /alertsoff /help For more help please see http://www.skype.com/go/help.chathelp [2:27:35 PM] Kevin Hester: and stack [2:27:46 PM] Andrew Tridgell: leave this board running if you can [2:27:54 PM] Andrew Tridgell: i may think of more debug [2:28:01 PM] Andrew Tridgell: does "task" show anything useful? [2:28:12 PM] Kevin Hester: task said command not found [2:28:20 PM] Andrew Tridgell: damn [2:29:12 PM] Kevin Hester: what is hexdump? [2:29:17 PM] Andrew Tridgell: dunno [2:29:27 PM] Kevin Hester: i'll grep [2:30:30 PM] Kevin Hester: oh nice [2:30:39 PM] Kevin Hester: it does a hex dump of a file with offset control [2:30:44 PM] Kevin Hester: and /dev/ram0 exists [2:30:51 PM] Andrew Tridgell: have fun! [2:31:05 PM] Kevin Hester: thx - i'll get a stack trace for ardupilot [2:31:06 PM] Kevin Hester: but [2:31:30 PM] Kevin Hester: at least whereever it is spinning [2:32:04 PM] Kevin Hester: hmm [2:32:23 PM] Kevin Hester: what thread bumps the priority down to 10. does the scheduler do that itself? [2:32:29 PM] Kevin Hester: to itself? [2:32:57 PM] Andrew Tridgell: can you check if any other counters are stuck? [2:33:02 PM] Andrew Tridgell: APM_rcout looks low [2:33:16 PM] Kevin Hester: nsh> perf APM_overrun: 0 events APM_loop: 29999 events, 58509554us elapsed, 1950 avg, min 452us max 3962us APM_rcout: 75609 events, 277681us elapsed, 3 avg, min 3us max 450us APM_rcin: 3593298 events, 20695629us elapsed, 5 avg, min 4us max 530us lsm303d_accel_resched: 45900 events lsm303d_extremes: 0 events lsm303d_reg7_resets: 0 events lsm303d_reg1_resets: 0 events lsm303d_mag_read: 382384 events, 21630453us elapsed, 56 avg, min 54us max 57us lsm303d_accel_read: 3054743 events, 180062412us elapsed, 58 avg, min 57us max 60us l3gd20_errors: 0 events l3gd20_reschedules: 39568 events l3gd20_read: 2904519 events, 188757208us elapsed, 64 avg, min 63us max 70us ADC samples: 3823080 events, 11747286us elapsed, 3 avg, min 2us max 12us ms5611_buffer_overflows: 231525 events ms5611_comms_errors: 0 events ms5611_measure: 315069 events, 7306311us elapsed, 23 avg, min 17us max 372us ms5611_read: 315069 events, 36987204us elapsed, 117 avg, min 36us max 552us io rc #: 0 events io write: 0 events, 0us elapsed, 0 avg, min 0us max 0us io update: 171862 events, 182905248us elapsed, 1064 avg, min 996us max 2087us io_badidle : 0 events io_idle : 343757 events io_uarterrs : 0 events io_protoerrs: 0 events io_dmaerrs : 0 events io_crcerrs : 0 events io_timeouts : 0 events io_retries : 0 events io_dmasetup : 343761 events, 6224454us elapsed, 18 avg, min 14us max 305us io_txns : 343761 events, 173279788us elapsed, 504 avg, min 158us max 885us DMA allocations: 2 events APM_uartE: 3283890 events, 20887694us elapsed, 6 avg, min 4us max 7788us APM_uartD: 6561686 events, 126092241us elapsed, 19 avg, min 4us max 8148us APM_uartC: 3284395 events, 18415830us elapsed, 5 avg, min 4us max 7947us APM_uartB: 3283983 events, 16952339us elapsed, 5 avg, min 4us max 7982us APM_uartA: 3284021 events, 61995781us elapsed, 18 avg, min 4us max 8193us APM_storage_errors: 1 events APM_storage: 1 events, 4538us elapsed, 4538 avg, min 4538us max 4538us APM_delay: 3282183 events, 3788641963us elapsed, 1154 avg, min 6us max 101033us APM_IO_timers: 1882421 events, 25536609us elapsed, 13 avg, min 1us max 2699us APM_timers: 3593559 events, 31606126us elapsed, 8 avg, min 3us max 558us EKF_FuseSideslip: 0 events, 0us elapsed, 0 avg, min 0us max 0us EKF_FuseAirspeed: 0 events, 0us elapsed, 0 avg, min 0us max 0us EKF_FuseMagnetometer: 0 events, 0us elapsed, 0 avg, min 0us max 0us EKF_FuseVelPosNED: 0 events, 0us elapsed, 0 avg, min 0us max 0us EKF_CovariancePrediction: 0 events, 0us elapsed, 0 avg, min 0us max 0us EKF_UpdateFilter: 0 events, 0us elapsed, 0 avg, min 0us max 0us DF_errors: 0 events DF_fsync: 0 events, 0us elapsed, 0 avg, min 0us max 0us DF_write: 0 events, 0us elapsed, 0 avg, min 0us max 0us nsh> [2:34:11 PM] Andrew Tridgell: i can't see any others stuck, but a "comm" may help [2:34:24 PM] Kevin Hester: what is comm? [2:34:45 PM] Andrew Tridgell: comm is unix cmd to look for common lines between files [2:35:11 PM] Andrew Tridgell: so we now know the bug is in the mainline code, and its a loop [2:35:17 PM] Andrew Tridgell: some loop that never exits [2:35:18 PM] Kevin Hester: oh i see - i thought you were talking nuttx [2:35:40 PM] Kevin Hester: hmm [2:35:43 PM] Andrew Tridgell: it may be copter specific [2:35:53 PM] Kevin Hester: if some of the counts are going up [2:35:57 PM] Kevin Hester: (as they are) [2:36:06 PM] Andrew Tridgell: most counters are in the other threads [2:36:10 PM] Kevin Hester: ah! [2:36:14 PM] Kevin Hester: ok [2:36:29 PM] Andrew Tridgell: we mostly want timing of IO etc, not main tasks [2:36:46 PM] Kevin Hester: is loop the count of the # of times the scheduler has iterated through its array of tasks? [2:36:47 PM] Andrew Tridgell: thats why we add perf counters there [2:37:02 PM] Andrew Tridgell: its the number of times the loop() function has been called [2:37:25 PM] Andrew Tridgell: (ie. arduino sketch holdover) [2:37:27 PM] Kevin Hester: and loop is the the top scheduler function? (i've never looked into the apm scheduler) [2:37:44 PM] Andrew Tridgell: arduino sketches have setup() and loop() [2:38:05 PM] Andrew Tridgell: main() in arduino is effectively: main() { setup(); while (1) loop(); } [2:38:39 PM] Kevin Hester: so which thread lowers ardupilot to 10? [2:39:01 PM] Kevin Hester: because presumably arducopter thread is spinning somewhere now [2:39:14 PM] Andrew Tridgell: we have a timer that is reset at the start of each loop(). If the timer goes off then we know loop() has gone over time, and we lower main thread prio to 10 [2:39:39 PM] Kevin Hester: ah i see - and that timer callback is from an isr or something? [2:39:53 PM] Andrew Tridgell: look for loop_overtime() in libraries/AP_HAL_PX4/HAL_PX4_Class.cpp [2:39:58 PM] Kevin Hester: ok - thx [2:40:12 PM] Andrew Tridgell: its an hrt call (yes, basically an interrupt) [2:40:24 PM] Kevin Hester: that makes sense [2:40:41 PM] Andrew Tridgell: its set to 100k usec, so 0.1s [2:41:11 PM] Andrew Tridgell: see above file, line 164 [2:41:33 PM] Andrew Tridgell: / this ensures a tight loop waiting on a lower priority driver will eventually give up some time for the driver to run. It will only ever be called if a loop() call runs for more than 0.1 second / hrt_call_after(&loop_overtime_call, 100000, (hrt_callout)loop_overtime, NULL); [2:41:57 PM] Kevin Hester: oooh! i think i found a great way to take a dump [2:42:03 PM] Andrew Tridgell: the line before that increments the perf counter [2:42:04 PM] Andrew Tridgell: oh? [2:42:06 PM] Kevin Hester: cp ram0 /sdcard/ramcopy [2:42:17 PM] Andrew Tridgell: ok, worth a try! [2:43:46 PM] Andrew Tridgell: i'll be back in 20 mins or so

geeksville commented 10 years ago

[2:55:36 PM] Kevin Hester: yay! [2:55:41 PM] Kevin Hester: there is a memory dump [2:55:42 PM] Kevin Hester: http://nuttx.org/Documentation/NuttShell.html#cmdxd [3:15:43 PM] Kevin Hester: ok - walking through task list by hand with xd (ugh) - I'm definitely going to add a dump tcb command to nuttx [3:15:47 PM] Kevin Hester: this sucks ;-) [3:45:50 PM] Kevin Hester: ugh - my map file isn't exactly the same as whatever is on the target. I'm going to need to rebuild. I'll also change ps to include the tcb addr, stack base and stack pointer

tridge commented 10 years ago

A bit more explanation of what Kevin and I found.

Kevin and I noticed that when the freeze bug happens that the APM_loop counter stops incrementing and the priority of the main thread drops to

  1. That indicates that loop() is not returning, and the overtime loop protection code is kicking in:

    https://github.com/diydrones/ardupilot/blob/master/libraries/AP_HAL_PX4/HAL_PX4_Class.cpp#L164

that code temporarily lowers the priority of the main thread to 10 if it takes more than 0.1 seconds to run loop().

The most likely cause is some code in the main thread that is in an infinite loop. I have now added some extra debug code to print the AP_Scheduler task number that is being run when this error happens. It will be printed on both the HAL console (USB) and the debug serial console.

https://github.com/diydrones/ardupilot/commit/849c4905fb9723b0e7dfb0c24ca355fe97b9b33d so anyone who can reproduce the issue should look for a "Overtime" message on the console and tell us the task number.

geeksville commented 10 years ago

Ok - I found the place where the spinning is occurring. the apm task is 14, which is the one_hz_loop. My local build had modifications so I could dump the stackpointer and TCB for the ArduCopter thread. I didn't bother decoding the whole stack (though I have it in case we want to dig deeper). The punchline is:

In every case (both the loss of comms and the LED stuck on phenotypes) the USB stack dies first. This eventually leads to the arducopter thread spinning forever in:

size_t PX4UARTDriver::write(uint8_t c) { ...

while (BUF_SPACE(_writebuf) == 0) {
    if (_nonblocking_writes) {
        return 0;
    }
    hal.scheduler->delay(1);
}

... }

This is getting called from inside of GCS_MAVlink::send_text(), which as mentioned is callled somewhere from inside of the one_hz_loop. I didn't bother decoding further because it was not useful. It is trying to write to the (messed up) USB ttyACM device and that device will never pull from writebuf.

So. We have two problems: 1) P4UARTDriver:::write should be fixed to not spin forever (after some longish timeout we should bail no matter what or in the case of mavlink streams we should always set nonblocking writes). This should be fixed unrelated to problem 2. 2) There is (probably) a bug in the PX4 uart serial driver which is causing it to wedge in some cases.

LorenzMeier commented 10 years ago

I'm tinkering with a NuttX update - not sure it'll help, but it might be worth trying at some point.

tridge commented 10 years ago

that explains why it affects copter more than plane. Copter leaves the serial connections blocking until you arm. When you arm they become non-blocking. Plane marks all ports non-blocking at end of init. So it shouldn't cause a crash in flight.

geeksville commented 10 years ago

@rmackay9 @tridge - why does copter use blocking writes for mavlink streams? I can't think of a good usecase. Is it just that they want 'old' mavlink messages hopefully sent to the gcs when it is first connected?

tridge commented 10 years ago

copter has historically had a lot of console->printf() on startup, combined with a smaller default hal.uartA transmit buffer size to save memory on APM2. In the past that has caused loss of bytes in startup messages when the port is non-blocking. Plane has larger transmit buffer size and writes less messages to the console on startup. Additionally, plane defaults to armed on startup, so must go non-blocking straight away. Copter always has a user controlled arming step, so starting blocking to get messages out was considered safe, then goes non-blocking on arming. We could change some of the factors behind this, but really what we need to do is fix the USB stack.

geeksville commented 10 years ago

re: fixing the usb stack - totally agree. i'll poke at it soon (got to do a couple of things I promised others first)

geeksville commented 10 years ago

One thing that helps is that when the usb port goes bad any io to the old file descriptor returns an invalid error code, so we can start with where the USB driver decides to change that FD.

LorenzMeier commented 10 years ago

Greg Nutt has been very proactive in helping - if we can create a small app in the upstream NuttX repo reproducing this I'm sure he'll spend some time on it.

geeksville commented 10 years ago

My openocd broke with my latest ubuntu, so I'm going to switch to the 'well travelled px4' path of a blackmagic. It should be here on monday. I'll spend some time debugging the USB problem in a regular px4 load, because I worry how hard it will be to reproduce if we change much. For instance:

(note: removed note about GPS, because installing gps didn't make the failure go away). However, it seems that even non 'failing' boots will always eventually fail if left with a USB serial link open long enough. Sort of 'ERB roulette' for each packet exchanged?

sgofferj commented 10 years ago

I finally got to some testing and I could reproduce the segfault on terminal connect with AP2 with a lot of builds, including the 3.1.5 release build. As I could reproduce it also with the 3.1.5, I didn't check every build in between - just a handfull in various time intervals. One thing I noted, however, is that with newer builds, it seems easier to reproduce. With 3.1.5, I had to do quite a few attempts to get it to crash. I also updated my AP2 to the latest code yesterday night (UTC+3) and with that I couldn't yet create a segfault on MAVLink connect and also on terminal connect, it doesn't crash the PH every time now.

rmackay9 commented 10 years ago

I think this is largely resolved with this check-in which turns off blocking writes at the end of copter initialisation. https://github.com/diydrones/ardupilot/commit/7f9cd203771d5bd71a08567acbfe703ef355e749

I guess our next step is to either consider this issue closed or do the right thing and get Greg Nutt to resolve the underlying issue as Lorenz suggests above.

rmackay9 commented 9 years ago

Lorenz has reported there was a NuttX bug to do with memcpy that could cause memory corruption and all the horrible things that can result from that. We know it affected USB communication so perhaps this was the cause of this issue.

The NuttX fix will go out with AC3.2.1 and the next release of Plane.

If anyone is up to it, it might be good to test that we can't reproduce this anymore.

sgofferj commented 9 years ago

I can do that as soon as I seem to have been most successful in reproducing it... Please write a comment in the issue when a testable release is out so I remember.

geeksville commented 9 years ago

Alas. Yesterday I tried a build with this great fix. Unfortunately, USB comms still die when connecting from my linux desktop to the PX4.

On Thu Jan 29 2015 at 11:12:11 PM Stefan Gofferje notifications@github.com wrote:

I can do that as soon as I seem to have been most successful in reproducing it... Please write a comment in the issue when a testable release is out so I remember.

— Reply to this email directly or view it on GitHub https://github.com/diydrones/ardupilot/issues/1161#issuecomment-72158773 .

hsteinhaus commented 9 years ago

Had an assertion fault just a minute ago while USB was connected and vehicle was supplied via battery. Multiple resets via FMU reset button resulted in the same crash. After dis/reconnecting battery and USB, I was no longer able to reproduce the fault.

Software used: lastest PX4/Firmware with NuttX form the submodule (not the diydrones forks) and APM master, running UAVCAN ESCs and GNSS.

LorenzMeier commented 9 years ago

@geeksville @hsteinhaus Could you please provide the exact GIT hashes for PX4/Firmware and PX4/Nuttx and confirm you ran distclean / make archives? I just want to be sure we're not chasing ghosts.

Also please try to avoid to just provide it doesn't work feedback. There is very little to go on and its not helping much to actually resolve the issue. I would appreciate if you could provide stack traces (and the associated ELF file) and a little more context. E.g. I don't even know if you had this issue before, if its specific to a particular computer, if you can reproduce 100% or not..

@geeksville Could you also please try these two: nuttx_bringup branch in PX4/Firmware. And this NuttX branch: https://github.com/PX4/NuttX/tree/usbupdate

geeksville commented 9 years ago

I finally figured out (sort of) why arducopter USB crashes every time with my lenovo laptop (and how to reproduce it reliably). My laptop has two USB ports, one is USB3 the other is only USB2. The USB2 port works 100% of the time, the USB3 port crashes 100% of the time (with current master). I have two USB3 hubs downstream of my normal USB3 port. If I plug in pixhawk into one of them it will die (stop talking serial - investigated no further but can in a few weeks) but the other hub (also USB3) pixhawk works fine with.

(this behavior was also true with the nuttx_bringup branch. But currently I'm using: APM: ArduCopter V3.3-dev (af7765c5) APM: PX4: 3f083e02 NuttX: a164ab1b

On Sat, Jan 31, 2015 at 7:09 AM Lorenz Meier notifications@github.com wrote:

@geeksville https://github.com/geeksville @hsteinhaus https://github.com/hsteinhaus Could you please provide the exact GIT hashes for PX4/Firmware and PX4/Nuttx and confirm you ran distclean / make archives? I just want to be sure we're not chasing ghosts.

Also please try to avoid to just provide it doesn't work feedback. There is very little to go on and its not helping much to actually resolve the issue. I would appreciate if you could provide stack traces (and the associated ELF file) and a little more context. E.g. I don't even know if you had this issue before, if its specific to a particular computer, if you can reproduce 100% or not..

@geeksville https://github.com/geeksville Could you also please try these two: nuttx_bringup branch in PX4/Firmware. And this NuttX branch: https://github.com/PX4/NuttX/tree/usbupdate

— Reply to this email directly or view it on GitHub https://github.com/diydrones/ardupilot/issues/1161#issuecomment-72321248 .

rmackay9 commented 9 years ago

pushing to AC3.4.

rmackay9 commented 8 years ago

This is very old so I think I'm going to close it. There have been some windows related USB improvements that have recently gone into master that may have improved the situation.

I can re-open this issue if people complain and/or the issues reappear.