ArduPilot / ardupilot

ArduPlane, ArduCopter, ArduRover, ArduSub source
http://ardupilot.org/
GNU General Public License v3.0
10.98k stars 17.51k forks source link

watchdog reset problems #11232

Open Jaaaky opened 5 years ago

Jaaaky commented 5 years ago

Bug report

Issue details

_Testing master code I've found some problems due to/or related to watchdog reset: 1- accelcalsimple mav cmd instantly triggers watchdog reset; I've sent pull request #11231 for a fix. 2- turning on Q_PLANE =1 and SCHED_LOOP_RATE to 400 on pixhawk triggers IOMCU_RESET frequently. I thought it maybe related to sdcard, but removed the card completely and no much improvement. 3- After iomcu reset "prearm: barometer not healthy" is triggered on, preventing safe arming. 4- Last and most annoying; pixhawk connection died after accelcal - the usual one not simple -; IOMCU is RESET during calibration, usually multiple times, causing USB interface failure. And I'd to power cycle the board to be able to reconnect. The problem repeats almost every time accelcal is tried. Using master code; few months ago I noticed that servo outputs is usually reset and halted on accel calibration until it's done. - Tell me if it needs a separate bug report. So I think it's related, as IOMCU is somehow halted during calibration which caused IOMCU watchdog reset. As I understand; expect_delay_ms is not passed to IOMCU and it's currently unable to make use of it as no timerthread is run there. Correct me if I'm wrong.

Version master commit #4a237af09307 arduplane

Platform [ ] All [ ] AntennaTracker [ ] Copter [x] Plane [ ] Rover [ ] Submarine Just tested on ArduPlane.

Hardware type What autopilot hardware was used? (Pixhawk, Cube, Pixracer, Navio2, etc) Pixhawk

Logs Is any needed?

By the way, I was trying to trace the accelcal problem, modifiying IOMCU code, buidling iofirmware and flashing it with arduplane.apj, after many trials IOMCU stopped working completely, on any firmware even nuttx old ones. Not sure if it's a mere hardware problem; afraid it's caused by IOMCU too many watchdog resest, or frequent firmware flashing. Just recording the note in case someone faces the same problem on master code.

tridge commented 5 years ago

thanks! some of these issues are now fixed, but I'd like to run through them all with more testing

Jaaaky commented 5 years ago

Fllow-up;

Do we really need hal.scheduler->delay(2000); before it? I've tried - on Pixhawk1 - to skip it on watchdog_reset; the reset was amazingly fast; almost instant. I didn't get any schedular panic. But I don't know what could happen on slower boards. What if IOMCU was also stuck - waiting for the 2 second watchdog reset timeout - at that critical moment ? FMU schedular panic would occur ? Another watchdog reset would happen? Can small ~500ms IOMCU watchdog avoid this?

Would you like adding IOMCU reset counter like this to printf to help debugging? https://github.com/ArduPilot/ardupilot/compare/master...Jaaaky:iomcu_reset_counter?expand=1

Jaaaky commented 5 years ago

@tridge I think there is for sure a problem in IOMCU. It resets too often usually every 13 to 15 minutes without apparent reason. I've tested this on multiple different pixhawk1 boards. Just powered on USB with LOG_DISARMED enabled and leave it for some hours. No cables or external sensors or servos connected. Tested with and without SD card. I found that WATCHDOG timeout doesn't make real difference in reset rate. So I've changed it permanently to "IWDGD.RLR = 0x1FF" on my builds and did hours of flight testing without a problem. It resets, but no noticeable problems so far.

But I think we should find a way to know why it resets. I saw you've added some useful log info for FMU watchdog which help diagnosing cause and the stuck task. Isn't it possible for IOMCU to send such info to FMU while resetting too?

Jaaaky commented 5 years ago

Hi @tridge .. I recently got a case where IOMCU didn't recover after frozen. It reset one or two times then kept frozen until I did a power reset. Is there any way to have IO logs to debug such issues? I can try to work on this if you guide me on.

IamPete1 commented 3 years ago

any updates? have these issues been resolved?

IamPete1 commented 2 years ago

@Jaaaky any update?

Jaaaky commented 2 years ago

@IamPete1 Not sure if you can consider it as resolved.. For initially reported issue it should be resolved but for the cases where IOMCU didn't recover after frozen, no it's still there.. It can be reproduced using ICE engine with "bad" throttle servo mounting

peterbarker commented 2 months ago

I have a feeling this has to have been electrical.

I haven't heard of any similar cases.

@Jaaaky did you ever get to the bottom of this?

Jaaaky commented 2 months ago

Yes, it's due to EMI pulses due to some bad spark plugs when the throttle servo is mounted on top of the engine. The IOMCU has to be hard reset.