ArduPilot / ardupilot

ArduPlane, ArduCopter, ArduRover, ArduSub source
http://ardupilot.org/
GNU General Public License v3.0
10.25k stars 16.87k forks source link

UAVCAN message deorder #12139

Open kh4 opened 4 years ago

kh4 commented 4 years ago

Bug report

Issue details

UAVCAN message fragments are occasionally sent in wrong order causing them to be discarded by receiving party.

The behavior is seemingly a race between software and hardware and happens when the operation flow is something like following

1) transmit mailbox 0 filled with first fragment 2) transmit mailbox 1 filled with second fragment 3) transmit mailbox 2 filled with third fragment

As end result the fragments are sent out as 1,2,4,3.

It is a bit tricky to reproduce this but it happens regularly when there is a bit of misconfiguration (P2_DRIVER set to 2) Setup: CAN_D1_PROTOCOL 1 CAN_D1_UC_NODE 10 CAN_D1_UC_SRV_VM 31 ## five servos need to be sent to make 4 fragment message CAN_D1_US_SRV_RT 10 CAN_D2_PROTOCOL 0 CAN_P1_DRIVER 1 CAN_P2_DRIVER 2

I believe the problem could be corrected by enabling TXFP bit in CAN_MCR register to make the tx mailboxes act as FIFO.

Version CubeBlack 00320024 3036510C 31353833 ChibiOS: d2030d88 ArduCopter V3.6.10 (1c04a91e)

Platform [ ] All [ ] AntennaTracker [ X] Copter [ ] Plane [ ] Rover [ ] Submarine

Airframe type Hex

Hardware type Cube Black

Logs

The following is a dump of CAN packets using some test code, the xx are results from canardHandleRxFrame()

Packet capture msec EID data.... 84090: 8103f20a 42 d4 01 00 00 bc 02 8f 84091: 8103f20a 00 00 bc 03 00 00 bc 2f 84092: 8103f20a 04 00 00 bc 05 00 00 0f 84092: 8103f20a bc 6f 84190: 8103f20a 42 d4 01 00 00 bc 02 90 84191: 8103f20a 00 00 bc 03 00 00 bc 30 84192: 8103f20a bc 70 -14 84193: 8103f20a 04 00 00 bc 05 00 00 10 84208: 9001550a cf 0e 00 00 00 00 00 cf NS10 84292: 8103f20a 42 d4 01 00 00 bc 02 91 -14 84293: 8103f20a 00 00 bc 03 00 00 bc 31 -15 84294: 8103f20a 04 00 00 bc 05 00 00 11 -14 84295: 8103f20a bc 71 -15 84392: 8103f20a 42 d4 01 00 00 bc 02 92 84393: 8103f20a 00 00 bc 03 00 00 bc 32 84394: 8103f20a 04 00 00 bc 05 00 00 12 84395: 8103f20a bc 72 84494: 8103f20a 42 d4 01 00 00 bc 02 93 84495: 8103f20a 00 00 bc 03 00 00 bc 33 84496: 8103f20a 04 00 00 bc 05 00 00 13 84497: 8103f20a bc 73 84595: 8103f20a 42 d4 01 00 00 bc 02 94 84595: 8103f20a 00 00 bc 03 00 00 bc 34 84596: 8103f20a 04 00 00 bc 05 00 00 14 84597: 8103f20a bc 74 84695: 8103f20a 42 d4 01 00 00 bc 02 95 84696: 8103f20a 00 00 bc 03 00 00 bc 35 84697: 8103f20a bc 75 -14 84698: 8103f20a 04 00 00 bc 05 00 00 15 84795: 8103f20a 42 d4 01 00 00 bc 02 96 -14 84796: 8103f20a 00 00 bc 03 00 00 bc 36 -15 84797: 8103f20a bc 76 -15 84797: 8103f20a 04 00 00 bc 05 00 00 16 -14 84896: 8103f20a 42 d4 01 00 00 bc 02 97 84897: 8103f20a 00 00 bc 03 00 00 bc 37 84898: 8103f20a 04 00 00 bc 05 00 00 17 84899: 8103f20a bc 77

Pulseview capture on CANBUS ardu_uavcan_deorder.zip

OXINARF commented 4 years ago

Can you test this on master to make sure the problem is also reproducible there? Can you also test this with NuttX on Copter 3.6.x?

I'm pretty sure this is an issue with ChibiOS CAN driver on 3.6 branch and it is fixed on master, but I'd like confirmation this isn't something else.

It is a bit tricky to reproduce this but it happens regularly when there is a bit of misconfiguration (P2_DRIVER set to 2)

Just to clarify that's not a misconfiguration, it's a perfectly acceptable configuration.

I believe the problem could be corrected by enabling TXFP bit in CAN_MCR register to make the tx mailboxes act as FIFO.

We definitely do not want that, CAN messages need to sent based on priority (given by their ID).

kh4 commented 4 years ago

Compiled and loaded up latest from master Frame: UNKNOWN fmuv3 00320024 3036510C 31353833 ChibiOS: 0997003f ArduCopter V3.7.0-dev (4954c190)

The behavior is quite different, I do not clearly see wrong order, however some fragments are simple not sent at all (this happens quite a lot, maybe 5% of all frames). This now with pretty much as sane CAN config as possible CAN_D1_PROTOCOL | 1 CAN_D1_UC_ESC_BM | 0 CAN_D1_UC_NODE | 10 CAN_D1_UC_SRV_BM | 31 CAN_D1_UC_SRV_RT | 25 CAN_D2_PROTOCOL | 0 CAN_P1_BITRATE | 1000000 CAN_P1_DRIVER | 1 CAN_P2_BITRATE | 1000000 CAN_P2_DRIVER | 1

As enabling TXFP is not possible some other way of enforcing order on UAVCAN message fragments is needed, the hardware is not able to help in this case.

From STM32 reference manual

Transmit priority

By identifier

When more than one transmit mailbox is pending, the transmission order is given by the identifier of the message stored in the mailbox. The message with the lowest identifier value has the highest priority according to the arbitration of the CAN protocol. If the identifier values are equal, the lower mailbox number will be scheduled first.

By transmit request order

The transmit mailboxes can be configured as a transmit FIFO by setting the TXFP bit in the CAN_MCR register. In this mode the priority order is given by the transmit request order. This mode is very useful for segmented transmission.

OXINARF commented 4 years ago

Ok, that's totally different. Do you have CAN peripherals connected on both ports?

As enabling TXFP is not possible some other way of enforcing order on UAVCAN message fragments is needed, the hardware is not able to help in this case.

We have, messages with same ID (fragments) aren't allowed at the same time.

kh4 commented 4 years ago

Ok, that's totally different. Do you have CAN peripherals connected on both ports?

In this case no since I am developing a single port peripherial, I can have a second one on the other port.

We have, messages with same ID (fragments) aren't allowed at the same time.

Yes to solve the 3 mailbox race the minimal fix would be to defer loading message with eid equal to a pending message in mbox3.

OXINARF commented 4 years ago

In this case no since I am developing a single port peripherial, I can have a second one on the other port.

OK, can you test setting CAN_P2_DRIVER to 0 and see if you still have loss of messages?

Yes to solve the 3 mailbox race the minimal fix would be to defer loading message with eid equal to a pending message in mbox3.

Yes, as I said, we already have that - but the CAN driver in Copter 3.6 ChibiOS likely has bugs.

kh4 commented 4 years ago

Same behavior with P2 disabled, short capture from the bus below. I have also confirmed with logic analyser that the packets are not at all on the bus (to isolate rx errors)

tsmsec: EID result | data... 372364: 8103f20a 0 | 42 d4 01 00 00 bc 02 84 372364: 8103f20a 0 | 00 00 bc 03 00 00 bc 24 372364: 8103f20a 0 | 04 00 00 bc 05 00 00 04 372365: 8103f20a 0 | bc 64 372405: 8103f20a 0 | 42 d4 01 00 00 bc 02 85 372407: 8103f20a 0 | 00 00 bc 03 00 00 bc 25 372407: 8103f20a 0 | 04 00 00 bc 05 00 00 05 fragment 4 lost 372447: 8103f20a -14 | 42 d4 01 00 00 bc 02 86 372447: 8103f20a -15 | 00 00 bc 03 00 00 bc 26 372447: 8103f20a -14 | 04 00 00 bc 05 00 00 06 372447: 8103f20a -15 | bc 66 372487: 8103f20a 0 | 42 d4 01 00 00 bc 02 87 372487: 8103f20a 0 | 00 00 bc 03 00 00 bc 27 372487: 8103f20a 0 | 04 00 00 bc 05 00 00 07 372487: 8103f20a 0 | bc 67 372529: 8103f20a 0 | 42 d4 01 00 00 bc 02 88 372529: 8103f20a 0 | 00 00 bc 03 00 00 bc 28 372529: 8103f20a 0 | 04 00 00 bc 05 00 00 08 372530: 8103f20a 0 | bc 68 372569: 8103f20a 0 | 42 d4 01 00 00 bc 02 89 372569: 8103f20a 0 | 00 00 bc 03 00 00 bc 29 372570: 8103f20a 0 | 04 00 00 bc 05 00 00 09 372570: 8103f20a 0 | bc 69 372610: 8103f20a 0 | 42 d4 01 00 00 bc 02 8a 372610: 8103f20a 0 | 00 00 bc 03 00 00 bc 2a fragments 3 and 4 lost 372652: 8103f20a -15 | 42 d4 01 00 00 bc 02 8b 372652: 8103f20a -14 | 00 00 bc 03 00 00 bc 2b 372654: 8103f20a -15 | 04 00 00 bc 05 00 00 0b fragment 4 lost 372694: 8103f20a 0 | 42 d4 01 00 00 bc 02 8c 372694: 8103f20a 0 | 00 00 bc 03 00 00 bc 2c 372694: 8103f20a 0 | 04 00 00 bc 05 00 00 0c 372695: 8103f20a 0 | bc 6c 372734: 8103f20a 0 | 42 d4 01 00 00 bc 02 8d 372735: 8103f20a 0 | 00 00 bc 03 00 00 bc 2d 372735: 8103f20a 0 | 04 00 00 bc 05 00 00 0d 372735: 8103f20a 0 | bc 6d 372777: 8103f20a 0 | 42 d4 01 00 00 bc 02 8e 372777: 8103f20a 0 | 00 00 bc 03 00 00 bc 2e 372777: 8103f20a 0 | 04 00 00 bc 05 00 00 0e 372777: 8103f20a 0 | bc 6e 372819: 8103f20a 0 | 42 d4 01 00 00 bc 02 8f 372819: 8103f20a 0 | 00 00 bc 03 00 00 bc 2f 372819: 8103f20a 0 | 04 00 00 bc 05 00 00 0f 372820: 8103f20a 0 | bc 6f 372860: 8103f20a 0 | 42 d4 01 00 00 bc 02 90 fragment 2,3,4 lost 372902: 8103f20a -14 | 42 d4 01 00 00 bc 02 91 372902: 8103f20a -15 | 00 00 bc 03 00 00 bc 31 372902: 8103f20a -14 | 04 00 00 bc 05 00 00 11 372902: 8103f20a -15 | bc 71 372942: 8103f20a 0 | 42 d4 01 00 00 bc 02 92 372942: 8103f20a 0 | 00 00 bc 03 00 00 bc 32 372942: 8103f20a 0 | 04 00 00 bc 05 00 00 12 fragment 4 lost 372982: 8103f20a -14 | 42 d4 01 00 00 bc 02 93 372982: 8103f20a -15 | 00 00 bc 03 00 00 bc 33 372982: 8103f20a -14 | 04 00 00 bc 05 00 00 13 372984: 8103f20a -15 | bc 73 373024: 8103f20a 0 | 42 d4 01 00 00 bc 02 94 373024: 8103f20a 0 | 00 00 bc 03 00 00 bc 34 373024: 8103f20a 0 | 04 00 00 bc 05 00 00 14 373025: 8103f20a 0 | bc 74

kh4 commented 4 years ago

FYI: I am using this on top of 3.6.10 as a workaround.

0001-Workaround-for-UAVCAN-message-deorder.patch.txt