ArduPilot / ardupilot

ArduPlane, ArduCopter, ArduRover, ArduSub source
http://ardupilot.org/
GNU General Public License v3.0
10.86k stars 17.31k forks source link

Copter locks up in flight, motors stop, crashes vehicle w/ Pixhawk 2 #11642

Closed Pedals2Paddles closed 5 years ago

Pedals2Paddles commented 5 years ago

Bug report

There are now at least three different people with 4 different Solos running 3.7-dev that have had ArduCopter crash and stop in flight.

Thus Far, it has not happened to anyone using the Green Cube. And to my knowledge it hasn't happened to anyone else running any other hardware. It appears to be isolated to the old 3DR manufactured Pixhawk 2 for the Solo.

I've put the latest master from yesterday up for users to install since I think it has some better logging. And hoping some users will be willing to test it knowing their Solo could still crash at anytime.

This appears to be what happened back in this issue that at the time seemed like a watchdog issue. As it turns out, it was the watchdog doing its job. (https://github.com/ArduPilot/ardupilot/issues/11296)

This is a video from a tablet that was screen recording at the time of the incident. It shows that the video stream, wifi link, and companion computer were still powered up and running, despite the autopilot failing. https://www.youtube.com/watch?v=a96sfrEvaWY&feature=youtu.be&t=141. This corresponds to log # 4 in the pre enhanced logging zip file.

Version ArduCopter master 3.7-dev

Platform [ ] All [ ] AntennaTracker [X] Copter [ ] Plane [ ] Rover [ ] Submarine

Airframe type 3DR Solo

Hardware type Pixhawk 2 cube that is OEM on the Solo. But this is really no different than any other pixhawk 2 hardware on any other vehicle. Vehicles with this failure as far as we know all were equipped with the Solo gimbal.

Logs This first zip is 4 different logs from before the new enhanced logging. One of these incidents happened with Watchdog enabled. The others were either before the watchdog or with it disabled. Solo.Shutdown.Logs.Before.New.Logging.zip

Once we get logs using the new logs, I'll put them below.

AndKe commented 5 years ago

I am on vacation for 5 more weeks before I can test. Did you update the firmware source that Solex uses ?

Pedals2Paddles commented 5 years ago

Discussed at length on this evening's dev call. Could be one or more of several possible issues. Biggest thing we need now is logs of the failures using the most current master that includes Tridge's enhanced failure logging. I've posted this on our facebook group and hope users will test and provide logs.

pkocmoud commented 5 years ago

@Pedals2Paddles I updated my Stock Solo with Gimbal to 3.7 using SidePilot. I had a really nice flight, great work on this. Do you have an estimation on the amount of flight time the failed aircraft flew before the issue occurred. To establish a estimated MTBF.

Pedals2Paddles commented 5 years ago

There doesn't seem to be a consistent time I'm aware of.

PhantomShuttle commented 5 years ago

I also encountered this situation. After repeated tests, the restart will occur when the external magnetometer is connected via the iic line. The restart error shows 0x800 and the restart time is uncertain. When the external iic magnetometer is removed, there will be no restart. I suspect that the stm32 iic hardware is faulty.

tridge commented 5 years ago

@PhantomShuttle do you have logs of the issue happening? I'd really like to see a log with this happening on master as we have extra logging that could help us track it down.

Pedals2Paddles commented 5 years ago

Lots of people have been flying master without incident. I believe there was some thought it could have just been corrupt builds? @peterbarker @tridge @davidbuzz

peterbarker commented 5 years ago

Poke @PhantomShuttle - logs would be useful!

Pedals2Paddles commented 5 years ago

Testing so far:

Pedals2Paddles commented 5 years ago

On today's dev call, we concluded the probable cause is an flooding of interrupts on i2c crashing the flight controller. The interrupt storm is likely be caused by noise or flakey connections on any of the i2c devices or their wiring. @tridge is going to work some defensive code to detect and prevent this from happening.

This is something that could happen to any vehicle. The Solo is more vulnerable due to the many i2c devices on many small little vulnerable wires. This includes the leg compass & wires, the smart battery & SMBUS wires, and the motor pod LEDs & wires. Given that multiple instances of this crash happening were on modified solos, that increases the probability of one of these things being compromised even further.

This also could explains why it is so difficult for other people to reproduce, and easy for the same repeat cases to reproduce over and over on the same vehicle. If there isn't a hardware or wiring anomaly, it's never going to happen. If there is, it will probably keep happening.

So I think we're on on final approach to a solution here. More follow after testing from Tridge.

proficnc commented 5 years ago

Do we have scope traces showing evidence of this?

That system is one of the most proven and tested systems in the market. Not that I want to defend I2C, it’s terrible, and needs to go, but why now? This issue didn’t exist before... the cables to each location are well designed and sized for their use. Most runs on the solo are shielded, and all the cables are short... again, I know this is no shining star of engineering perfection, but there is no excuse in the world for any i2c issues to bring down the flight controller.

Pedals2Paddles commented 5 years ago

Nope, no scope traces. It's just where process of elimination is leading. Especially since the users having this issue had recently been opening up and moving things around. Including Peter's test solo that reproduced it. The SMBUS wires for the battery are actually very very vulnerable. And the compass connection to the board is also vulnerable. Both generally when trying to pry the main board up and out. The ESC wires could be as well when someone is yanking on them trying to pry the main board up and out. I don't think this has happened to an unmodified solo actually. So I would agree if it isn't being roughed up, it is probably perfectly safe.

As far as it not happening ever before, it may actually have happened before as we just couldn't have known. Absent the new watchdog, it would have been diagnosed as a power failure or battery failure. And there have been many of those over the years. And most of them probably were power/battery failures, but some could certainly have been this as well. Now the watchdog is making it known.

Or we could all be wrong since it's speculation, intermittent, and difficult to reproduce. Tridge's test code is going to induce an i2c interrupt storm to see if the results match. Hoping they do so we can be somewhat confident in our best guess!

tridge commented 5 years ago

I think we now know fully what was happening, see fix in #12134 detailed discussion here: http://www.chibios.com/forum/viewtopic.php?f=35&t=5198&p=36118#p36118