eclipse-cyclonedds / cyclonedds

Eclipse Cyclone DDS project
https://projects.eclipse.org/projects/iot.cyclonedds
Other
891 stars 363 forks source link

Malformed packets when streaming large messages over Ethernet #2074

Closed mode89 closed 2 months ago

mode89 commented 3 months ago

I'm trying to send large messages (~3 MB, 10 Hz) over Ethernet between computers A and B.

Initially, both computers were configured with MaxMessageSize of 65000B, A was a publisher and B was a subscriber. Everything worked fine.

Then I tried switching roles of computers: B is a publisher and A is a subscriber. In that setting, subscriber spams a lot of recvUC: malformed packet received from vendor ... messages. Despite of the presence of this warning, the ROS callback is triggered normally, and no ROS messages is lost. The warnings disappear only after I reduce MaxMessageSize down to 2783B.

Using ROS Humble with CycloneDDS 0.10.4. Verbosity set to none.

mode89 commented 3 months ago

Also tried two other computers (let's call them C and D) in the role of subscriber, and neither of them report this warning.

eboasson commented 3 months ago

Then I tried switching roles of computers: B is a publisher and A is a subscriber. In that setting, subscriber spams a lot of recvUC: malformed packet received from vendor ... messages. Despite of the presence of this warning, the ROS callback is triggered normally, and no ROS messages is lost. The warnings disappear only after I reduce MaxMessageSize down to 2783B.

The default behaviour of Cyclone is to use MaxMessageSize only for the initial transmission of the message, and to use MaxRexmitMessageSize for retransmits. The latter is by default small enough to not require IP fragmentation. I suspect that the spam happens on the initial transmission, and the data gets through thanks to very conservative settings it uses for retransmits.

The FragmentSize setting controls the size of the data fragments, it defaults to 1344 bytes. There is some overhead (this is the fragment size of the data, but you also have various (sub)message headers that need to go into the "Message" but I think it would squeeze two fragments into one Message still. It does on master: I end up with datagrams with payloads of 2756 and 2552 bytes.

I would expect those to still require IP fragmentation, so it seems that it can't be that on one machine something is horribly broken only for fragmented IP packets. Still, I suspect it has something to do with the IP stack misbehaving or truncating packets.

It would be interesting if you could capture the malformed ones with Wireshark and could get a Cyclone trace on the subscriber. With a bit of luck that will help understanding the reason this is happening.

mode89 commented 3 months ago

Thanks for explanation :)

Here are the traces.

eboasson commented 3 months ago

Thanks @mode89 ! I have never seen anything quite like it, so I am really glad I am able to have a look at.

Every Ethernet frame from offset 1456 (or so) to the end looks like garbage, e.g.:

No.     Time           Source                SPort  Destination           DPort  Protocol Length Info
   2285 1.819608930    192.168.42.1                 192.168.42.87                IPv4     1514   Fragmented IP protocol (proto=UDP 17, off=0, ID=
7fb2) [Reassembled in #2294]
0570  87 00 99 23 cb 00 68 e4 40 af de a9 07 35 1b b1   ...#..h.@....5..
0580  0a 7b d5 cb 90 ee 98 93 38 9a 13 65 70 a5 36 16   .{......8..ep.6.
0590  f1 09 e1 d0 e0 e9 d4 6f e0 00 7b 41 d7 f9 0c b2   .......o..{A....
05a0  35 41 7a da b7 15 2c e4 e3 ed 52 e4 ee 05 01 08   5Az...,...R.....
05b0  ff ff df df ff ff ff ff ff f7 f7 ff fe ff ff ef   ................
05c0  ff ff 7f ff ff fb ff ff ff ff db ff ff ff ff ff   ................
05d0  ff ff bf ff db ff ff ff fd ff ff df ff ff df f7   ................
05e0  fb 7d ff ff f3 5f ff ff ff ef                     .}..._....
No.     Time           Source                SPort  Destination           DPort  Protocol Length Info
   2286 1.819609034    192.168.42.1                 192.168.42.87                IPv4     1514   Fragmented IP protocol (proto=UDP 17, off=1480, 
ID=7fb2) [Reassembled in #2294]
0570  bf be d6 dd f8 7b c3 ba f2 dd 03 3c b7 ce 0c 38   .....{.....<...8
0580  9a ab 57 bc 0b b8 a6 7e 9d 19 4c e7 e5 f0 07 65   ..W....~..L....e
0590  ce cc bf 22 06 88 f3 7a c8 e7 63 4f 35 ff d9 c7   ..."...z..cO5...
05a0  27 30 ff b7 12 ee 38 32 da 68 41 9a ee 05 01 08   '0....82.hA.....
05b0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
05c0  00 00 00 00 00 00 10 80 00 00 00 00 00 00 00 00   ................
05d0  00 00 04 00 02 08 00 00 00 02 04 00 00 00 00 00   ................
05e0  04 80 00 00 00 00 00 08 00 00                     ..........
No.     Time           Source                SPort  Destination           DPort  Protocol Length Info
   2287 1.819609086    192.168.42.1                 192.168.42.87                IPv4     1514   Fragmented IP protocol (proto=UDP 17, off=2960, ID=7fb2) [Reassembled in #2294]
0590  b9 01 40 bf 35 1a d9 48 92 68 c3 f8 b2 12 6c f4   ..@.5..H.h....l.
05a0  c5 95 61 69 24 41 52 c4 17 ad 34 83 ee 05 01 08   ..ai$AR...4.....
05b0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
05c0  03 00 00 00 00 00 00 00 00 00 00 80 00 00 00 00   ................
05d0  63 0a d3 8e ad 0a 77 37 00 00 00 00 00 00 00 00   c.....w7........
05e0  00 00 00 00 00 00 00 00 00 00                     ..........

(E.g., 2285 is a packet number in the Wireshark). I don't believe point cloud contents would synchronise with IP fragmentation like that.

There are other oddities, too, the first UDP datagram (so multiple IP fragments) of each point cloud ends with a DDS protocol control submessage of which a few unimportant bytes are garbage, the last UDP datagram of each point cloud ends with garbage (there should be a protocol control submessage there, too), and everything UDP datagram in between ends on a valid protocol control submessage.

The "malformed" message comes out because that final control message is garbled in a way that triggers the warning. By the time it detects that, it has already processed everything up to it in the UDP datagram, and so has already delivered the data.

About that data, well, the (apparent) garbage you see above also makes into the reassembled UDP datagrams, then into the deserializer and all the way into your application 🤯

I have no idea how this can happen, but with your observation that it really only happens with sending data from A to B, I would guess there's something wrong at a low-level, and my guess would be broken Ethernet hardware. I am guess hardware because it usually appears to be either: essentially all bits 0 with a handful of 1s sprinkled in; or essentially all bits 1 with a handful of 0s sprinkled in and that doesn't look software-y to me.

(The last fragment I copied above is different, the 63 0a d3 8e ad 0a 77 37 doesn't fit so well. Maybe that would give a clue to someone more knowledgeable about this phenomenon, but not to me.) I don't think I can do much about it (well, if I am right and the data is corrupted, then SROS will detect the message tampering 😎), and I wouldn't trust the data with MaxMessageSize reduced to 2783 bytes either.

mode89 commented 2 months ago

That's a good observation!

The faulty computer is an embedded device prototype, so it sounds plausible, that there could be a hardware problem.

Thank you for help! )