ArduPilot / ardupilot

ArduPlane, ArduCopter, ArduRover, ArduSub source
http://ardupilot.org/
GNU General Public License v3.0
10.75k stars 17.19k forks source link

UAVCAN message corruption on ArduPlane #8166

Closed JamesStewy closed 6 years ago

JamesStewy commented 6 years ago

Issue details

We are experiencing an issue where UAVCAN messages sent from ArduPilot have a very high corruption rate as seen on a Zubax Babel. Exactly why this starts happening is a little unclear however when it starts working incorrectly it does so with a small percentage of messages sent from ArduPilot corrupting and over a period of about 1-10 seconds this rate of corruption increases to nearly every message. Once the message corruption starts we have not found a way to get it to stop apart from a cold reboot. This corruption appears to affect all sent messages from ArduPilot including uavcan.equipment.actuator.ArrayCommand, uavcan.equipment.esc.RawCommand, uavcan.protocol.NodeStatus and uavcan.protocol.GetNodeInfo responses. Corrupted messages appear in the UAVCAN GUI Tool Bus Monitor as messages whose value is not present and replaced with Transfer could not be decoded along with errors such as CRC mismatch: expected xxxx, got xxxx for payload byte, Toggle bit value x incorrect on frame x, or EOT not found.

We have found that this issue does not occur as described above in Manual mode, but testing in Auto, FBWA and RTL has all resulted in corruption occurring. We have also found that the SCHED_LOOP_RATE parameter has a big effect. With SCHED_LOOP_RATE=300 (recommend value for quadplanes), the safety switch in and with constant RC inputs, ArduPilot puts out around 1800 CAN frames per second and will fail normally within a minute (although we have had it take up to a few minutes to fail). The same test but with SCHED_LOOP_RATE=50 results in ArduPilot only outputting around 300 CAN frames per second and it does not fail (we ran multiple tests including one over 40 minutes and everything continued to work). In these tests constant RC inputs are important as the esc/actuator message output rate from ArduPilot appears to be proportional to the amount of ‘activity’ that is occurring.

All our testing to date has been on multiple Pixhawk 2.1s running ArduPlane 3.8.4. To confirm this isn’t an issue with the CAN network we have tested on a network with other UAVCAN devices all of which continue to work even when ArduPilot starts sending corrupted messages. We have also tested ArduPilot alone with just a Babel connected and the corruption still occurs.

Steps to replicate
  1. Enable UAVCAN actuator and/or ESC outputs in the parameters.
  2. Set SCHED_LOOP_RATE to 300.
  3. Set mode to FBWA.
  4. Press in safety switch.
  5. Provide constant RC inputs until UAVCAN messages start corrupting.
Another Issue

This issue is potentially unrelated but, in our testing, we also found that with UAVCAN enabled it was possible to force ArduPlane into io failsafe due to fmu shutdown by pressing the safety switch too soon after booting. We found that waiting 10 seconds after all the starting noises had completed was enough to mitigate the issue.

Version

ArduPlane 3.8.4

Platform

[ ] All [ ] AntennaTracker [ ] Copter [x] Plane [ ] Rover [ ] Submarine

Airframe type

Quadplane

Hardware type

Pixhawk 2.1

tridge commented 6 years ago

I also found my USBTin adapter, so now I can see the error messages in the GUI as well

tridge commented 6 years ago

after discussions with @pavel-kirienko I've changed the timeout to 1ms

JamesStewy commented 6 years ago

@tridge I tried #7882 using px4-v3, the port labeled as CAN 1 and a 300Hz loop rate and it definitely is a big improvement however after a few minutes the message rate still slows to almost nothing due to message corruption and the gui tool still eventually crashes.

Out of intrest, where was the memory leak?

magicrub commented 6 years ago

@JamesStewy it was a tricky thing where resends would get queued unnecessarily

tridge commented 6 years ago

I've made a fair bit more progress today. I ended up writing new two UAVCAN sniffers. One is here: http://uav.tridgell.net/UAVCAN/uavcan_listen.py that works on Linux and displays message counts. It seemingly showed a very high rate of message loss. I was a bit suspicious, so I wrote a new sniffer that uses a Pixhawk1. I've checked that into libraries/AP_UAVCAN/examples/UAVCAN_sniffer in the #7882 The PH1 based sniffer is a much better way of looking at this issue, as it bypasses any problems with queue length on Linux can devices. It gives results like this:

uavcan.equipment.air_data.StaticPressure: 29 uavcan.equipment.air_data.StaticTemperature: 29 uavcan.equipment.ahrs.MagneticFieldStrength: 20 uavcan.protocol.NodeStatus: 6 uavcan.equipment.gnss.Fix: 10 uavcan.equipment.gnss.Auxiliary: 1 uavcan.equipment.actuator.ArrayCommand: 45 uavcan.equipment.esc.RawCommand: 368

that is printed at 1Hz, showing the count of message types.

I have pushed some improvements to the PR #7882 that prevent the issues that you have been seeing. It now works nicely on HAL_ChibiOS and HAL_PX4 (its better on ChibiOS, but HAL_PX4 isn't too bad). One key change I made is to add a parameter CAN_D1_UC_SRV_RT which is the maximum rate in Hz that servo msgs are sent. It defaults to 50. That leaves more bandwidth for ESCs.

I've run with the new PR for quite a long time under both HAL_PX4 and HAL_ChibiOS with both servo and ESC bitmasks at 255. It seems reliable for me.

tridge commented 6 years ago

I'm going to close this now, as I believe all the required fixes are merged in ArduPilot master. To summaries what the problem was: 1) the Linux based CAN sniffers could not keep up with the packet rates involved in ESC and servo output 2) we were sending ESC and servo msgs very inefficiently in AP_UAVCAN (eg. holding semaphores while doing bus operations) 3) the spin times and delays were much too long 4) the ChibiOS wait() method in libuavcan was rounding down to nearest ms, so when it asked for 200usec (for example) it actually did a poll, which meant the thread sucked up all the available cpu time 5) we were inappropriately sending servo msgs at 400Hz. We now have a parameter for max servo rate (default 50Hz)

With all the changes I can run a copter with HAL_ChibiOS (fmuv3) with SCHED_LOOP_RATE=800 and get 800Hz CAN ESC msgs out reliably with both CAN ports enabled, and still have plenty of spare CPU left. I've left it to run for over an hour with no issues. I can do the same with HAL_PX4, although it does get a fair number of scheduling misses at 800Hz. The CAN ESC msgs do get out though (at around 790Hz).

It would be nice to work out how to make the Linux sniffers handle high packet rates, but that isn't an ArduPilot bug. Meanwhile we can use the UAVCAN_sniffer firmware I put in AP_UAVCAN/examples to sniff at high rates without msg loss.

@JamesStewy many thanks for reporting this, it was a complex problem and led to some really significant improvements in our uavcan support

Evan1010 commented 6 years ago

Perhaps slightly off-topic, but has this been tested with live UAVCAN servos/actuators? I've lately hunted for UAVCAN servos/actuators, but to no avail. Are the implementations you use all external, DIY control boards for regular PWM servos? Or just SITL nodes?

The closest I've found is the CBS-15 produced by Currawong, which was originally going to include support for UAVCAN, but after a lack of customer interest, UAVCAN was dropped (I am currently querying the potential for a continuation, however).

Just let me know if this question is better suited to the discussion forum.

pavel-kirienko commented 6 years ago

@Evan1010 I heard that these support UAVCAN https://www.uavcomp.com/electronics/micro-uartcan-servos/

proficnc commented 6 years ago

Hitech make UAVCAN servos

Evan1010 commented 6 years ago

@proficnc Hitech, or Hitec? I read the same a couple days back, but was only able to find large RS485 (optional) servos on their commercial site. If you don't mind, do you have a part number or information link? I'll contact Hitec directly either way, but I'd rather be prepared when speaking to them.

@pavel-kirienko Thank you for the lead - I've contacted them for more information.

proficnc commented 6 years ago

Hitec sorry about the spelling! They announced them at AUVSI

proficnc commented 6 years ago

MD250MW-CAN

Evan1010 commented 6 years ago

@proficnc Thank you, with the part number I was able to find a document with a product listing.

proficnc commented 6 years ago

Awesome

antoinealb commented 6 years ago

Where did you find that this is a UAVCAN servo? I only found that it supports CAN. Also do you know when they should be available?

Evan1010 commented 6 years ago

@antoinealb I just started talking to them, and they have responded positively that UAVCAN is an available option. That said, it could still be a misunderstanding (i.e., they referred to them at one point as "CAN UAV servos"). I likely can't quote availability, but they've been very good about responding.

@pavel-kirienko @proficnc I was asked "the CAN Sample Point and the Baud-Rate" for the system. I understand that Baud-Rate is probably 1 Mbit/s, but I can't think what "CAN Sample Point" means. Perhaps its PWM Resolution (i.e., 2048 steps), or update rate (50Hz - 300Hz)?

pavel-kirienko commented 6 years ago

@Evan1010 it is the point within a single bit when the signal is sampled.

image

The recommendations for UAVCAN are listed here: http://uavcan.org/Specification/8._Hardware_design_recommendations/#can-bus-parameters

JamesStewy commented 6 years ago

Hi @tridge, sorry for disappearing. Thank you for fixing this issue.

Do you currently have a time frame for a beta of ArduPlane 3.9 with this fix in it? If not, would it be possible to backport it to AdruPlane 3.8?

Thanks, James