Closed JamesStewy closed 6 years ago
I also found my USBTin adapter, so now I can see the error messages in the GUI as well
after discussions with @pavel-kirienko I've changed the timeout to 1ms
@tridge I tried #7882 using px4-v3
, the port labeled as CAN 1
and a 300Hz loop rate and it definitely is a big improvement however after a few minutes the message rate still slows to almost nothing due to message corruption and the gui tool still eventually crashes.
Out of intrest, where was the memory leak?
@JamesStewy it was a tricky thing where resends would get queued unnecessarily
I've made a fair bit more progress today. I ended up writing new two UAVCAN sniffers. One is here: http://uav.tridgell.net/UAVCAN/uavcan_listen.py that works on Linux and displays message counts. It seemingly showed a very high rate of message loss. I was a bit suspicious, so I wrote a new sniffer that uses a Pixhawk1. I've checked that into libraries/AP_UAVCAN/examples/UAVCAN_sniffer in the #7882 The PH1 based sniffer is a much better way of looking at this issue, as it bypasses any problems with queue length on Linux can devices. It gives results like this:
uavcan.equipment.air_data.StaticPressure: 29 uavcan.equipment.air_data.StaticTemperature: 29 uavcan.equipment.ahrs.MagneticFieldStrength: 20 uavcan.protocol.NodeStatus: 6 uavcan.equipment.gnss.Fix: 10 uavcan.equipment.gnss.Auxiliary: 1 uavcan.equipment.actuator.ArrayCommand: 45 uavcan.equipment.esc.RawCommand: 368
that is printed at 1Hz, showing the count of message types.
I have pushed some improvements to the PR #7882 that prevent the issues that you have been seeing. It now works nicely on HAL_ChibiOS and HAL_PX4 (its better on ChibiOS, but HAL_PX4 isn't too bad). One key change I made is to add a parameter CAN_D1_UC_SRV_RT which is the maximum rate in Hz that servo msgs are sent. It defaults to 50. That leaves more bandwidth for ESCs.
I've run with the new PR for quite a long time under both HAL_PX4 and HAL_ChibiOS with both servo and ESC bitmasks at 255. It seems reliable for me.
I'm going to close this now, as I believe all the required fixes are merged in ArduPilot master. To summaries what the problem was: 1) the Linux based CAN sniffers could not keep up with the packet rates involved in ESC and servo output 2) we were sending ESC and servo msgs very inefficiently in AP_UAVCAN (eg. holding semaphores while doing bus operations) 3) the spin times and delays were much too long 4) the ChibiOS wait() method in libuavcan was rounding down to nearest ms, so when it asked for 200usec (for example) it actually did a poll, which meant the thread sucked up all the available cpu time 5) we were inappropriately sending servo msgs at 400Hz. We now have a parameter for max servo rate (default 50Hz)
With all the changes I can run a copter with HAL_ChibiOS (fmuv3) with SCHED_LOOP_RATE=800 and get 800Hz CAN ESC msgs out reliably with both CAN ports enabled, and still have plenty of spare CPU left. I've left it to run for over an hour with no issues. I can do the same with HAL_PX4, although it does get a fair number of scheduling misses at 800Hz. The CAN ESC msgs do get out though (at around 790Hz).
It would be nice to work out how to make the Linux sniffers handle high packet rates, but that isn't an ArduPilot bug. Meanwhile we can use the UAVCAN_sniffer firmware I put in AP_UAVCAN/examples to sniff at high rates without msg loss.
@JamesStewy many thanks for reporting this, it was a complex problem and led to some really significant improvements in our uavcan support
Perhaps slightly off-topic, but has this been tested with live UAVCAN servos/actuators? I've lately hunted for UAVCAN servos/actuators, but to no avail. Are the implementations you use all external, DIY control boards for regular PWM servos? Or just SITL nodes?
The closest I've found is the CBS-15 produced by Currawong, which was originally going to include support for UAVCAN, but after a lack of customer interest, UAVCAN was dropped (I am currently querying the potential for a continuation, however).
Just let me know if this question is better suited to the discussion forum.
@Evan1010 I heard that these support UAVCAN https://www.uavcomp.com/electronics/micro-uartcan-servos/
Hitech make UAVCAN servos
@proficnc Hitech, or Hitec? I read the same a couple days back, but was only able to find large RS485 (optional) servos on their commercial site. If you don't mind, do you have a part number or information link? I'll contact Hitec directly either way, but I'd rather be prepared when speaking to them.
@pavel-kirienko Thank you for the lead - I've contacted them for more information.
Hitec sorry about the spelling! They announced them at AUVSI
MD250MW-CAN
@proficnc Thank you, with the part number I was able to find a document with a product listing.
Awesome
Where did you find that this is a UAVCAN servo? I only found that it supports CAN. Also do you know when they should be available?
@antoinealb I just started talking to them, and they have responded positively that UAVCAN is an available option. That said, it could still be a misunderstanding (i.e., they referred to them at one point as "CAN UAV servos"). I likely can't quote availability, but they've been very good about responding.
@pavel-kirienko @proficnc I was asked "the CAN Sample Point and the Baud-Rate" for the system. I understand that Baud-Rate is probably 1 Mbit/s, but I can't think what "CAN Sample Point" means. Perhaps its PWM Resolution (i.e., 2048 steps), or update rate (50Hz - 300Hz)?
@Evan1010 it is the point within a single bit when the signal is sampled.
The recommendations for UAVCAN are listed here: http://uavcan.org/Specification/8._Hardware_design_recommendations/#can-bus-parameters
Hi @tridge, sorry for disappearing. Thank you for fixing this issue.
Do you currently have a time frame for a beta of ArduPlane 3.9 with this fix in it? If not, would it be possible to backport it to AdruPlane 3.8?
Thanks, James
Issue details
We are experiencing an issue where UAVCAN messages sent from ArduPilot have a very high corruption rate as seen on a Zubax Babel. Exactly why this starts happening is a little unclear however when it starts working incorrectly it does so with a small percentage of messages sent from ArduPilot corrupting and over a period of about 1-10 seconds this rate of corruption increases to nearly every message. Once the message corruption starts we have not found a way to get it to stop apart from a cold reboot. This corruption appears to affect all sent messages from ArduPilot including
uavcan.equipment.actuator.ArrayCommand
,uavcan.equipment.esc.RawCommand
,uavcan.protocol.NodeStatus
anduavcan.protocol.GetNodeInfo
responses. Corrupted messages appear in the UAVCAN GUI Tool Bus Monitor as messages whose value is not present and replaced withTransfer could not be decoded
along with errors such asCRC mismatch: expected xxxx, got xxxx for payload byte
,Toggle bit value x incorrect on frame x
, orEOT not found
.We have found that this issue does not occur as described above in
Manual
mode, but testing inAuto
,FBWA
andRTL
has all resulted in corruption occurring. We have also found that theSCHED_LOOP_RATE
parameter has a big effect. WithSCHED_LOOP_RATE=300
(recommend value for quadplanes), the safety switch in and with constant RC inputs, ArduPilot puts out around 1800 CAN frames per second and will fail normally within a minute (although we have had it take up to a few minutes to fail). The same test but withSCHED_LOOP_RATE=50
results in ArduPilot only outputting around 300 CAN frames per second and it does not fail (we ran multiple tests including one over 40 minutes and everything continued to work). In these tests constant RC inputs are important as the esc/actuator message output rate from ArduPilot appears to be proportional to the amount of ‘activity’ that is occurring.All our testing to date has been on multiple Pixhawk 2.1s running ArduPlane 3.8.4. To confirm this isn’t an issue with the CAN network we have tested on a network with other UAVCAN devices all of which continue to work even when ArduPilot starts sending corrupted messages. We have also tested ArduPilot alone with just a Babel connected and the corruption still occurs.
Steps to replicate
SCHED_LOOP_RATE
to300
.FBWA
.Another Issue
This issue is potentially unrelated but, in our testing, we also found that with UAVCAN enabled it was possible to force ArduPlane into io failsafe due to fmu shutdown by pressing the safety switch too soon after booting. We found that waiting 10 seconds after all the starting noises had completed was enough to mitigate the issue.
Version
ArduPlane 3.8.4
Platform
[ ] All [ ] AntennaTracker [ ] Copter [x] Plane [ ] Rover [ ] Submarine
Airframe type
Quadplane
Hardware type
Pixhawk 2.1