ArduPilot / ardupilot

ArduPlane, ArduCopter, ArduRover, ArduSub source
http://ardupilot.org/
GNU General Public License v3.0
10.73k stars 17.18k forks source link

UAVCAN message corruption on ArduPlane #8166

Closed JamesStewy closed 6 years ago

JamesStewy commented 6 years ago

Issue details

We are experiencing an issue where UAVCAN messages sent from ArduPilot have a very high corruption rate as seen on a Zubax Babel. Exactly why this starts happening is a little unclear however when it starts working incorrectly it does so with a small percentage of messages sent from ArduPilot corrupting and over a period of about 1-10 seconds this rate of corruption increases to nearly every message. Once the message corruption starts we have not found a way to get it to stop apart from a cold reboot. This corruption appears to affect all sent messages from ArduPilot including uavcan.equipment.actuator.ArrayCommand, uavcan.equipment.esc.RawCommand, uavcan.protocol.NodeStatus and uavcan.protocol.GetNodeInfo responses. Corrupted messages appear in the UAVCAN GUI Tool Bus Monitor as messages whose value is not present and replaced with Transfer could not be decoded along with errors such as CRC mismatch: expected xxxx, got xxxx for payload byte, Toggle bit value x incorrect on frame x, or EOT not found.

We have found that this issue does not occur as described above in Manual mode, but testing in Auto, FBWA and RTL has all resulted in corruption occurring. We have also found that the SCHED_LOOP_RATE parameter has a big effect. With SCHED_LOOP_RATE=300 (recommend value for quadplanes), the safety switch in and with constant RC inputs, ArduPilot puts out around 1800 CAN frames per second and will fail normally within a minute (although we have had it take up to a few minutes to fail). The same test but with SCHED_LOOP_RATE=50 results in ArduPilot only outputting around 300 CAN frames per second and it does not fail (we ran multiple tests including one over 40 minutes and everything continued to work). In these tests constant RC inputs are important as the esc/actuator message output rate from ArduPilot appears to be proportional to the amount of ‘activity’ that is occurring.

All our testing to date has been on multiple Pixhawk 2.1s running ArduPlane 3.8.4. To confirm this isn’t an issue with the CAN network we have tested on a network with other UAVCAN devices all of which continue to work even when ArduPilot starts sending corrupted messages. We have also tested ArduPilot alone with just a Babel connected and the corruption still occurs.

Steps to replicate
  1. Enable UAVCAN actuator and/or ESC outputs in the parameters.
  2. Set SCHED_LOOP_RATE to 300.
  3. Set mode to FBWA.
  4. Press in safety switch.
  5. Provide constant RC inputs until UAVCAN messages start corrupting.
Another Issue

This issue is potentially unrelated but, in our testing, we also found that with UAVCAN enabled it was possible to force ArduPlane into io failsafe due to fmu shutdown by pressing the safety switch too soon after booting. We found that waiting 10 seconds after all the starting noises had completed was enough to mitigate the issue.

Version

ArduPlane 3.8.4

Platform

[ ] All [ ] AntennaTracker [ ] Copter [x] Plane [ ] Rover [ ] Submarine

Airframe type

Quadplane

Hardware type

Pixhawk 2.1

auturgy commented 6 years ago

@EShamaev @magicrub

hdtechk commented 6 years ago

If can H and can L wires are not the exact same length the signal from can H and can L will be out of sync and messages will be corrupted. I see this on motorcycles all the time when someone puts tall handlebars on there bikes. Not saying that is your issue. But it's worth mentioning

pavel-kirienko commented 6 years ago

If can H and can L wires are not the exact same length

It has nothing to do with corrupted CAN frames; if that were the case, you wouldn't see anything on the bus monitor in the first place. My bet is that the problem is caused by the TX buffer getting overrun. When that happens, the CAN stack has to triage the outgoing CAN frames, letting higher priority frames out while sacrificing lower priority ones, hence broken transfers. The TX queue is being read by the driver from one end and written by the stack from the other end; apparently the read/write rate difference is such that it takes about a minute for the queue to overrun (judging by the description from @JamesStewy ). I can propose two solutions:

JamesStewy commented 6 years ago

apparently the read/write rate difference is such that it takes about a minute for the queue to overrun

This does line up with our experience that their is a cumulative nature to it.

My bet is that the problem is caused by the TX buffer getting overrun

Is there a particular mechanism which would cause this to become a permanent issue? As soon as the issue occurs, even if you back off the data rate, it does not recover until a reboot.

pavel-kirienko commented 6 years ago

Could be that the driver's inner state gets somehow damaged. I think you should get a debugger and make sure that the driver handles queue overrun properly.

magicrub commented 6 years ago

This should be tested on a ChibiOS build. The CPU usage is much lower, I bet the problem goes away. However, if we are hitting the queue limits then it may not matter. Just too much data and the bus is busy.

Question: how many motors is it considered for?

jschall commented 6 years ago

I have seen issues with the UAVCAN GUI tool where it cannot keep up and starts to behave this way. This is due to a pyuavcan bug where the data type signature is computed on every transfer (https://github.com/UAVCAN/pyuavcan/pull/31)

Even after fixing this bug, at very high bus utilization (4000+ frames per second) I've seen the UAVCAN GUI tool start to display node errors. Restarting the GUI tool seems to fix it temporarily.

Let's confirm that this is actually an ardupilot issue.

JamesStewy commented 6 years ago

@pavel-kirienko

get a debugger and make sure that the driver handles queue overrun properly

We can absolutly do this test but what should we be looking out for? How will we know if the driver (I assume that means libuavcan) handles queue overrun successfully?

@magicrub

how many motors is it considered for?

All of our testing was done with 8 outputs including one fixed wing motor and 4 quad motors.

This should be tested on a ChibiOS build.

What versions of ArduPlane should we use to test with ChibiOS?

I bet the problem goes away

We have already seen that the problem goes away if you reduce the data rate (and presumably the CPU load as well at the same time). However, the lingering threat of the CAN side of ardupilot failing terminally worries us.

@jschall

Restarting the GUI tool seems to fix it temporarily

This doesn't line up with our testing. Restarting the GUI tool did not subside the errors even temporarily. It should also be noted that while the issue is occuring, the GUI tool sometimes crashes due to collecting to many CAN errors.

Let's confirm that this is actually an ardupilot issue

We have also tested using a simple board with an ST microcontroller running libuavcan connected on the bus. The board converts the last received uavcan.equipment.actuator.ArrayCommand message into a PWM signal to actuate a servo. When the issue starts occuring you can see the same thing that is suggested by the GUI tool. Only a very small fraction of messages get through intact. This test has been completed with and without a Babel attached to the network and it makes no difference.

However, despite saying all this, when I next get a chance I will run the GUI tool with (UAVCAN/pyuavcan#31) applied and report back if it makes a difference.

jschall commented 6 years ago

What versions of ArduPlane should we use to test with ChibiOS?

Per @magicrub: should just be master, built for fmuv2

jschall commented 6 years ago

Can we get your parameters? We are trying to reproduce your problem.

JamesStewy commented 6 years ago

These parameters cause the issue for us: quadplane - high SCHED_LOOP_RATE.txt

The same parameters but with SCHED_LOOP_RATE lowered to 50: quadplane - low SCHED_LOOP_RATE.txt

JamesStewy commented 6 years ago

Tested using the GUI tool with (UAVCAN/pyuavcan#31) applied and it didn't make a difference.

Also tested using master ardupilot and it ultimately didn't fix the problem. A combination of providing constant transmitter inputs and shaking the pixhawk in FWBA a bit was enough to cause the failure. Over multiple tests it did appear to take longer (maybe 50% longer) to fail.

JamesStewy commented 6 years ago

Completed some more tests.

ArduCopter 3.5.5 and 3.6.0-rc1 both have the same issue (@rmackay9).

Tried ArduPlane 3.7.1 as well as using PX4 but both did not output 'uavcan.equipment.actuator.ArrayCommand' messages which likely reduced the amount of CAN frames enough for the issue to not occur.

Is there anything else that anyone would like to see tested to try to narrow down on the issue?

magicrub commented 6 years ago

just to confirm if it's a bandwdith vs cpu problem, can you reduce the CAN_D1_UC_ESC_BM and CAN_D1_UC_SRV_BM masks. Right now they're both set to forward any servo data to the UAVCAN bus. It is a bitfield. Can you try reducing the bits down one by and see if it suddenly stops? if so, it's probably a bandwidth issue.

Also, are you using terminators on your bus?

JamesStewy commented 6 years ago

Tested by shaking constantly shaking the Pixhawk in FBWA runnning master ArduPlane.

SCHED_LOOP_RATE Both bitmasks Message output rate CAN frames/sec Test duration Resulted in failure
50 255 50Hz (steady) 400 >10 min No
300 255 270-295Hz (varies) 2400 <1 min Yes
300 1 300Hz (steady) 600 >15 min No
50 65535 50Hz (steady) 800 ~6 min Yes

are you using terminators on your bus?

Yes

magicrub commented 6 years ago

Next test would be to use ChibiOS. It's a new OS we're switching too which is much faster.

http://firmware.us.ardupilot.org/Plane/2018-04/2018-04-27-21:04/fmuv3/

Use the apj file.

Make sure to back up your params before uploading this firmware. I'm not sure anyone has done a vtol flight with it so I'd suggest just a bench test for now.

JamesStewy commented 6 years ago

I loaded the apj file using Mission Planner and set all the CAN_* parameters as normal but when I safety in there are no ESC or Actuator CAN messages.

magicrub commented 6 years ago

And you've loaded all your old params and rebooted?

JamesStewy commented 6 years ago

My steps:

  1. Use Mission Planner to flash the latest stable ArduCopter (to reset the params)
  2. Wait for tones then hard reboot
  3. Flash apj file using Mission Planner
  4. Wait for tones then hard reboot (one thing is that only the boot tone sounds not the completed IO firmware upload tone)
  5. Set ARMING_CHECK = 0, SCHED_LOOP_RATE = 300, CAN_P1_DRIVER = 1 and CAN_P2_DRIVER = 1
  6. Hard reboot
  7. Check all the CAN parameters have appeared
  8. Press safety switch

On the GUI Tool org.ardupilot appears on the node list but it does not output the ESC or Actuator messages on safety in. Another note is that when I boot the apj build the boot tone sounds slightly wrong and it doesn't make the ready to fly tone despite arming checks being disabled.

Is there a difference between the apj file and building and flashing ardupilot master using waf?

magicrub commented 6 years ago

Depends how you're building. We are moving away from using NuttX OS, which is the "px4-v2" (or v3) target and changing it to "fmuv2" (or v3) which is the ChibiOS target. The ChibiOS build will have a different file extension but otherwise should run the same way.

JamesStewy commented 6 years ago

Ok. That makes sense. Up until now we have been building master against NuttX with px4-v3.

I rebuilt using fmuv3 and flashed using waf and had the same issue as described in my previous comment. I connected up a debugger and found that the ChibiOS version doesn't appear to send the channel values though to the AP_UAVCAN stuff. So I implemented that based off what is in AP_HAL_PX4: https://github.com/MonashUAS/ardupilot/tree/uavcan_esc_chibios.

Running that new version with SCHED_LOOP_RATE=300 is definitely the best I have seen. For a brief period of time (~2-4 minutes) after pressing the safety switch the ESC and Actuator messages come at a steady 300Hz but it ultimately still failed with the same issue as all other versions I have tried.

tridge commented 6 years ago

@JamesStewy thanks for your patch! It did make me think of a problem though. Just as HAL_PX4, this unconditionally sends all the UAVCAN actuator messages. I assume that means it uses the bus bandwidth for them even if you don't have any CAN ESCs. For example, our OctaQuad quadplane has 13 actuators, 8 of which are ESCs. The ESCs are not on CAN, but we have UAVCAN GPS and compass. Would this change mean that the bus could get clogged up badly with UAVCAN ESC messages and the GPS and mag could end up failing. Unless there is something else that prevents this in the uavcan layer, I think we need a "we have CAN ESCs and want to output actuators at X Hz" parameter, or auto-detect based on what we find on the bus.

JamesStewy commented 6 years ago

My patch, like HAL_PX4, unconditionally sends all the channel values to AP_UAVCAN however the CAN_D1_UC_ESC_BM and CAN_D1_UC_SRV_BM bitmask parameters filter these channels. So it is currently possible to configure AP to only output a subset of ESC and Actuator messages to the bus.

I do agree that a parameter to set the CAN actuator and ESC output rate would be useful.

magicrub commented 6 years ago

Agreed, an output rate would be nice. Doesn't really solve the problem though because, like in Tridge's example but with CAN ESCs, the vtol motors would probably want to update faster than the fwd motors so they would all end up running at 300 anyway.

If the problem only happens after a few minutes, and nuttxx Gail's sooner than ChibiOS, it smells like a memory leak. Tridge, can you check the mem/stack usage with UAVCAN sending to an esc?

tridge commented 6 years ago

@JamesStewy we should really set these automatically based on what nodes are detected on the bus.

tridge commented 6 years ago

@JamesStewy any chance you could test https://github.com/ArduPilot/ardupilot/pull/7882 ? That removes the need for your PR for HAL_ChibiOS/RCOutput

JamesStewy commented 6 years ago

@tridge I tried #7882 built for both ChibiOS and NuttX and neither resolved this issue.

However #7882 does look like a good idea and the conversation in this thread might suggests that this code should go in its own loop with a configurable rate.

tridge commented 6 years ago

@JamesStewy if you set either CAN_D1_UC_ESC_BM or CAN_D1_UC_SRV_BM to zero then does that prevent the issue? I'm thinking we should change the default to zero for both of those, and ask users of UAVCAN servos or ESCs to set them correctly.

JamesStewy commented 6 years ago

@tridge @magicrub @pavel-kirienko @EShamaev @jschall

We would really like to see this issue resolved as we at MonashUAS truly believe that CAN ESCs and actuators provide an enormous amount of flexibility.

However, currently it is our view that using CAN ESCs and Actuators with any build of ArduPilot is dangerous and we are not willing to fly with it until we can be relatively confident that the issues outlined in this thread are resolved. Current documentation suggests that

Copter, Plane and Rover support UAVCAN Electronic Speed Controllers (ESCs)

and as far as we can tell, anybody who takes this suggestion is unknowingly taking a significant risk and could experience a terminal failure, with very little warning, which results in a complete loss of control.

We have resources available which can be put towards solving this issue. However, @JMare and I are not as familiar with the ArduPilot and UAVCAN codebases as you are nor their design decisions and goals. We would really appreciate some guidance on how to test for and identify the root cause of this issue.

@pavel-kirienko suggested at the top of this thread that the UAVCAN driver could be overrunning and/or having its internal state damaged under high load. Is this code in ArduPilot or libuavcan?

@pavel-kirienko is there a way to test libuavcan in isolation to see if it is the cause of the issue?

@tridge @magicrub, as I have said before I think a fixed, configurable UAVCAN output loop is a good idea long term and it is something I am willing to implement. Should the loop be done using AP_SCHEDULER, a separate thread or something else? Should this be done on top of #7882? As, @tridge pointed out this isn’t a solution to this issue though as if it is set too high (say 300Hz) then the issue will still occur.

@tridge I saw your suggestion in the ArduPilot gitter that the ESC and Actuator bitmasks should default to 0 and I agree that is a good solution for the short term.

Is there anything stopping #7882 getting merged that we can help with?

@jschall where you able to replicate the issue using the parameters I posted?

Thanks, James

pavel-kirienko commented 6 years ago

is there a way to test libuavcan in isolation to see if it is the cause of the issue?

This is highly unlikely because it is tested against its own unit testing suite, but you're welcome to review and amend it: https://github.com/UAVCAN/libuavcan/tree/master/libuavcan/test

It is more likely that the issue might be related to the libuavcan driver implemented in ArduPilot; I suspect it is not systematically tested at all (as in, there are no unit tests).

magicrub commented 6 years ago

@JamesStewy sorry to beat a dead hoarse here, but we're having trouble reproducing this. Can you please provide the following:

magicrub commented 6 years ago

Also, have you tried reducing your CAN esc/servo bitmask values? _SRV_BM and _ESC_BM How about when setting _SRV_BM=0?

JamesStewy commented 6 years ago

git hash of your version. If you compiled it yourself, please provide the make/waf command you used

I have tested many versions including versions downloaded through Mission Planner. Most of out original testing was done using ArduPlane 3.8.5.

When I have been building from source I have done the following from the root directory of ArduPilot:

./waf configure --board px4-v3           (or fmuv3 for ChibiOS)
./waf plane
./waf --targets bin/arduplane --upload

full param list

For parameters there is a parameter file attached in a previous comment of mine that contains extra configuration for quadplanes. None of that is necessary for me though to replicate the issue. All I have had to change from the default parameters is ARMING_CHECK = 0 (for bench testing), SCHED_LOOP_RATE = 300 (increase the CAN data rate), CAN_P1_DRIVER = 1 and CAN_P2_DRIVER = 1. So _SRV_BM and _ESC_BM are the default value of 255.

exact hardware&software you're using to test

Pixhawk 2.1 running some version of ArduPilot. A PC running Mission Planner and the UAVCAN GUI Tool.

list of all hardware attached including the autopilot hardware

Buzzer. Safety Switch. On the CAN bus, only a Zubax Babel which is in turn connected to the PC. Pixhawk powered and connected over USB to the PC (we have replicated this issue using a battery).

have you tried reducing your CAN esc/servo bitmask values?

This test was done using using master ArduPlane (at the time) built for px4-v3. I can't find the exact git commit I used unfortunately.

How about when setting _SRV_BM=0?

I can try that for you later today.

Replication steps:

  1. Use Mission Planner to flash the latest stable ArduCopter (to reset the params)
  2. Wait for tones then hard reboot
  3. Use Mission Planner to flash the latest stable ArduPlane
  4. Wait for tones then hard reboot
  5. Set ARMING_CHECK = 0, SCHED_LOOP_RATE = 300, CAN_P1_DRIVER = 1 and CAN_P2_DRIVER = 1
  6. Hard reboot
  7. Check all the CAN parameters have appeared
  8. Switch to FBWA
  9. Press safety switch
  10. Start the UAVCAN GUI tool and start a capture on the bus monitor and a subscriber for uavcan.equipment.actuator.ArrayCommand (you should see the messages/sec is around 20)
  11. Rotate the Pixhawk from side to side (what you should see is the messages/sec should increase dramatically towards the value of SCHED_LOOP_RATE)
  12. Keep going until the messages/sec starts dropping or messages in the bus monitor start decoding incorrectly (there is a list of the GUI tool error messages you can see in the original post). Alternatively you can just keep going until the GUI tool crashes from having to many transport errors (though you then loose the ability to see what happened).

EDIT: Full parameter list at step 7

jschall commented 6 years ago

@magicrub this sounds like exactly what we did. Do you see any differences?

JamesStewy commented 6 years ago

How about when setting _SRV_BM=0?

Tested using the methodology I described above but with the actuator mask set to 0.

SCHED_LOOP_RATE ESC bitmasks ESC Message output rate CAN frames/sec Test duration Resulted in failure
300 255 300Hz (steady) 900 >15 min No
300 65535 300Hz (steady) 1500 ~10 min Yes

In the second test when it fails the valid message rate drops to about 100-150 messages per second.

JamesStewy commented 6 years ago

Here is a video of us replicating the issue. Test was done with ArduPlane 3.8.5 and the video starts at step 7 with the parameters attached to this comment.

After the test we scrolled through the output in the bus monitor to the point at which the failure occurs. In this video you can see that there is a very distinct line where the message corruption really starts to occur.

OXINARF commented 6 years ago

Good news: I've been able to reproduce this and I've been able to narrow the issue down.

It's actually quite easy to see the errors, especially when looking at the UAVCAN GUI log file. To catch an error there isn't even a need to rotate the Cube that much, just picking it up is enough. With Copter I got frequent errors without even moving it at all.

Working around it for now is easy, use just one port. After setting CAN_P2_DRIVER to 0 I wasn't able to reproduce this in any way. @JamesStewy can you please test this and confirm you don't get any issues this way?

I'm a bit suspicious of this being an ArduPilot issue because the errors are 99.9% of the time about the toggle flag being wrong in the multiple frame transfers. If this was corruption caused by the ArduPilot CAN driver I'd expect to see CRC errors and I didn't catch a single one. Still, it's quite possible the bug is in ArduPilot as the code around interfaces is quite... complex. I also took a look at the libuavcan code and the multiple frame transfer code seems correct to me. Libuavcan is quite complex itself so it's also possible I missed something.

I will try to investigate this more this week.

OXINARF commented 6 years ago

I take back the part about CRC. I was searching the log files now and I just realized that last night I actually got CRC errors in the one time the UAVCAN GUI interface got completely frozen - that only happens by continuing to rotate the Pixhawk (if you put it down, it will eventually get back to no errors) and I couldn't reproduce it with Copter (that has a higher FPS than Plane).

There are CRC errors even in single frame messages so I'm much more suspicious of ArduPilot's code now.

pavel-kirienko commented 6 years ago

There are CRC errors even in single frame messages

Single frame messages are not CRC-protected by UAVCAN (they are protected by the CAN bus itself, but you won't see corrupted frames in the GUI tool, they are rejected by the hardware at the lowest level).

On Tue, May 15, 2018 at 1:53 PM, Francisco Ferreira < notifications@github.com> wrote:

I take back the part about CRC. I was searching the log files now and I just realized that last night I actually got CRC errors in the one time the UAVCAN GUI interface got completely frozen - that only happens by continuing to rotate the Pixhawk (if you put it down, it will eventually get back to no errors) and I couldn't reproduce it with Copter (that has a higher FPS than Plane).

There are CRC errors even in single frame messages so I'm much more suspicious of ArduPilot's code now.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ArduPilot/ardupilot/issues/8166#issuecomment-389125665, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJUZEWEdcwMdVEdQ8OdcRPHmjr7-kRGks5tyrOggaJpZM4TU-88 .

JamesStewy commented 6 years ago

@OXINARF I tried setting CAN_P2_DRIVER to 0 with CAN_P1_DRIVER still set to 1 on ArduPlane 3.8.5 and it seemed to disable CAN entirely. When I plugged in the babel in to he port labeled CAN 1 on the pixhawk it showed that no node status messages, no esc messages. However, when I switched the babel over to the port labeled CAN 2 it started working again.

But even stranger, the corruption issue still occurs with the babel plugged into CAN 2 however it behaves differently. It only fails after about 30 sec of shaking but when it does, it sitll outputs at around 50 successful messages a second (rather than the normal <1). Furthermore, if you leave the pixhawk to sit for a bit (it anecdotally seems to be relative to how long you shook it initially) then normal functionality returns (which never happens on the port labeled CAN 1).

So I tried the same test I normally do, LOOP_RATE 300, P1_DRIVER 1, P2_DRIVER 1 and with the babel plugged into CAN 1. After a bit of shaking it fails like normal and is completely unrecoverable. But, without rebooting the pixhawk, if I move the babel to CAN 2 I again see the behaviour I described just above.

As I final test I tried switching around the P1_DRIVER and P2_DRIVER values and then CAN seemed to be entirely disabled again no matter which port I plugged the babel into.

pavel-kirienko commented 6 years ago

Beware that CAN1 and CAN2 are mislabeled on some revisions of the hardware (at least Pixhawk 2.1 has this issue).

Generally, STM32's bxCAN macrocell does not permit usage of CAN2 without CAN1 (unless specific workarounds are in place; I doubt they are).

OXINARF commented 6 years ago

Single frame messages are not CRC-protected by UAVCAN (they are protected by the CAN bus itself, but you won't see corrupted frames in the GUI tool, they are rejected by the hardware at the lowest level).

@pavel-kirienko Damn, then my search was bugged in some place. Anyway, I'm more convinced this is an ArduPilot issue.

@JamesStewy As @pavel-kirienko says, CAN1 and CAN2 are mislabeled in the Cube, use CAN2 port (I only used that by the way). Interesting that you are still seeing issues with P2_DRIVER set to 0, I couldn't reproduce it then but I will try again.

OXINARF commented 6 years ago

@JamesStewy I think I have good news.

One note first: yesterday, to make sure it wasn't a UAVCAN GUI issue, I used python-can and some methods in pyuavcan to check that I could also catch the errors there. Apparently python-can has bugs in its SLCAN driver so, because it was in Linux, I instead used SocketCAN on top of the SLCAN interface. It's a simple script, parsing the frames and transfers, but it worked for the purpose, it caught the errors.

I've re-tested again with P2_DRIVER set to 0:

I spent a bit of time trying to understand why wasn't the script working, thinking it certainly was my bug. After a while, I inverted my approach: why was UAVCAN GUI tool failing? What I soon realized was that, when using SLCAN (in Windows and Linux), I only had UAVCAN parsing errors after a driver error: SLCAN driver receive queue is full! After that UAVCAN errors start showing, which is expected, messages are being dropped due to a full queue.

So, in Linux, I went into pyuavcan SLCAN driver and increased the receive queue size. After 3 minutes of (almost steady) 2400 FPS I didn't have a single error.

@pavel-kirienko I understand there is an issue with the queue size in some OSes, but apparently not on Linux so I would recommend increasing it there. Also, either discarding all messages when queue is full or giving a more prominent warning would be nice.

One thing that I'm still trying to think about is that I don't get errors in my script using SocketCAN but I do get them very fast when using UAVCAN GUI. From my look at the pyuavcan SocketCAN driver it looks like it doesn't have a receive queue and depends purely on receive being called. I thought this might be the reason for the toggle errors (messages being lost), but I don't see how that explains getting CRC errors?

pavel-kirienko commented 6 years ago

@OXINARF are you referring to this constant? https://github.com/UAVCAN/pyuavcan/blob/0fd4122fe6e128aabd00d5ca88aba5cc82518c9c/uavcan/driver/slcan.py#L50

From my look at the pyuavcan SocketCAN driver it looks like it doesn't have a receive queue and depends purely on receive being called.

Yes, it relies on the RX queue maintained by the kernel. You can control its size manually if needed; for example:

ifconfig slcan0 txqueuelen 100
OXINARF commented 6 years ago

@pavel-kirienko Yes, that one. I've reverted and re-applied and behaviour was the same. Unless there is something else at play that I'm missing I'd say that missing messages was the current issue.

Yes, it relies on the RX queue maintained by the kernel. You can control its size manually if needed; for example: ifconfig slcan0 txqueuelen 100

Doesn't that increase just the transmit queue? That's what I found with my Googling. I tried to set socket option SO_RCVBUF (as well as SO_RCVBUFFORCE) but it didn't seem to make any difference.

Just to be cristal clear: using both ports (P1 and P2) isn't fixed and there is a bug (very likely in ArduPilot). Only using P1 seems to not have a problem.

pavel-kirienko commented 6 years ago

Doesn't that increase just the transmit queue?

That particular example does. There's a similar command for RX queue.

I've reverted and re-applied and behaviour was the same. Unless there is something else at play that I'm missing I'd say that missing messages was the current issue.

May I ask you to submit a pull request to PyUAVCAN increasing the maximum queue size to a much larger value (say, 1 million) for non-OSX systems?

OXINARF commented 6 years ago

That particular example does. There's a similar command for RX queue.

I didn't try with ifconfig, but tried with ip and it didn't have a rxqueuelen. Everything I found on the internet said that didn't exist and you had to use SO_RCVBUF.

May I ask you to submit a pull request to PyUAVCAN increasing the maximum queue size to a much larger value (say, 1 million) for non-OSX systems?

Sure, probably not today, I'm tired of looking at CAN 😄

OXINARF commented 6 years ago

@pavel-kirienko As requested: https://github.com/UAVCAN/pyuavcan/pull/33 (damn, that was a fast merge 😃)

I have more info in the SocketCAN case: I had an error in my code so it wasn't actually increasing the buffer. Unfortunately, even after increasing it by more than 38x I could still get errors after a while - it took much longer for that to happen though so it seems to be related to full buffer as well (when I comment the buffer increase code it just takes a second to get errors) - I should also mention that increasing the buffer size requires elevated privileges so it's not a good solution. I'll recommend that you add a reading thread in there too.

pavel-kirienko commented 6 years ago

I'll recommend that you add a reading thread in there too.

I encourage all to contribute that. Fast merge guaranteed. ;)

tridge commented 6 years ago

@JamesStewy would you be able to test the latest changes I made on PR #7882 The key change is to use setTxTimeout(0) on the periodic messages (ESC, servo and LEDs). That prevents the uavcan layer from keeping the messages queued. What seemed to be happening is we were pushing out messages faster than it could handle, and asking for the uavcan layer to keep them for 20ms. For repeated messages that really doesn't make sense. It led to a massive memory leak too. @pavel-kirienko are there any problems with using a timeout of 0?