ArduPilot / ardupilot

ArduPlane, ArduCopter, ArduRover, ArduSub source
http://ardupilot.org/
GNU General Public License v3.0
10.75k stars 17.2k forks source link

Copter: WatchDog reset in-flight #14582

Open VDLJu opened 4 years ago

VDLJu commented 4 years ago

Bug report

Issue details Autopilot was reset by a watchdog in-flight when executing a mission. Reset happened after executing the last mission item, which was a command, LOITER_WAIT for 10 seconds. Reset happened about few seconds after execution of LOITER_WAIT.

Following watchdog error line was logged: image

Task: -2 if the fast loop had started FL: Fault Line 100, the source code line number where the fault occurred. FT: Fault type 3. 3 = Hard Fault (the most common) FA: 404947019, Fault Address (in memory) FP: 183, Thread Priority ICSR : 4196355, Interrupt Control and State Register

Logs bin log pre WD reset bin log after WD reset Telemetry log

Version ArduCopter 4.0.3

Platform [ ] All [ ] AntennaTracker [X] Copter [ ] Plane [ ] Rover [ ] Submarine

Airframe type X4 copter

Hardware type Cube black

Previous discussion in the forum, link to it https://discuss.ardupilot.org/t/crash-ac-4-0-3-watchdog-reset-in-flight-while-executing-a-mission/57652

rmackay9 commented 4 years ago

Thanks for the very detailed report including logs.

tridge commented 4 years ago

@VDLJu I have looked at this quite carefully and unfortunately I don't yet have a clue as to the cause. I notice a high baudrate on 921600 on telem1. Was there a companion computer attached? If so, do you have a record of the mavlink stream to/from the companion computer?

tridge commented 4 years ago

the closest thing I have to a clue on this one so far is noticing the thread priority of 183 in the WLOG message. A priority of 183 means the monitor thread was running at the time of the fault. The monitor thread uses very little CPU (it sleeps almost all the time), so the fact it was running could be significant. I did wonder if the stack size of the monitor thread, which is 512, is enough when there is a delay that triggers the MON logging. I setup a test to reproduce that and found it does have enough stack (about 192 bytes free when logging MON msg). Right now the only guess I have is a nested interrupt happening during a MON message write causing stack corruption, but I can't prove that at all, and can't reproduce it

VDLJu commented 4 years ago

@VDLJu I have looked at this quite carefully and unfortunately I don't yet have a clue as to the cause. I notice a high baudrate on 921600 on telem1. Was there a companion computer attached? If so, do you have a record of the mavlink stream to/from the companion computer?

Unfortunately the companion computer doesn't log autopilot state. It mostly saves some external events, like image capture etc and does some realtime functions like translates mavlink to RC telemetry format. High baudrate is just there to minimize latency.

Let me know if I can help you somehow

VDLJu commented 4 years ago

I thought that it's better to share some previous flights, in case these can give some insight to this mystery.

A flight -1 prior the crash

A flight -2 prior the crash

mmk0102 commented 4 years ago

FL: Fault Line 100, the source code line number where the fault occurred - can it give us some information?

rmackay9 commented 4 years ago

@mmk0102, yes, I think Peter and Tridge did use that information and narrowed down which line was last executed before the watchdog was executed but it wasn't clear how this could possible cause the problem.

peterbarker commented 3 years ago

@tridge was this the DMA-teardown/setup race condition bug?

kumariitian121 commented 2 years ago

anyone working on this bug?

rmackay9 commented 2 years ago

@kumariitian121,

I suspect this particular watchdog has been fixed and this was on a pretty old version of AP (4.0.x). Have you encountered a watchdog reset with 4.1.x or 4.2.0?

amefabris commented 1 year ago

I did, using FW 4.2.0 on a mroControlZeroF7, it gets triggered every time that I switch to LAND mode. I did a lot of flights with the same FW on a mroControlZeroH7 and I never encountered this issue...