eclipse-cyclonedds / cyclonedds

Eclipse Cyclone DDS project
https://projects.eclipse.org/projects/iot.cyclonedds
Other
845 stars 350 forks source link

Huge RTPS packets with all 0's as data #1628

Open omertal88 opened 1 year ago

omertal88 commented 1 year ago

Hello there,

In the last few days I've been struggling with a weird issue. Every now and then, I've been getting repeating huge packets of user traffic. In Wireshark I can see the many fragments sent, and the one packet who assembled them all. I can't figure out exactly the root cause, but I think there's some correlation for when the wireless connection is poor. Anyway, this creates so much traffic that is basically takes all my bandwidth. image From looking into the source port, I can infer the node which is sending those packets, and from the destination port, I can infer the receiver node. However, I can't for the life of me, understand why is the message containing nothing but all zeros, and how come it is so long. There's just no way that that node sent it. Here's one of those packets: image You can take my word that every fragment here is all zeros. Obviously this message contains no information.

If it is a known issue, please let me know. If you need me to send that Wireshark capture file, I will. Haven't been able to reproduce it while running the node on finest verbosity but when it will reproduce again, I will have that too.

This occurs on ROS2 galactic BTW.

BTW2, I previously reported this issue but it feels like a different case. I don't delete any subscriptions, and the messages in that bug weren't all 0's.

Thank you very much. Omer.

eboasson commented 1 year ago

Hi @omertal88 I find this a most curious story, and I have no plausible idea yet for a cause. In any case,

image

says:

16           DATA_FRAG
01           flags = 1 = little-endian
04 ff        octets-to-next-header = 0xff04 = 65284
00 00        extra flags = 0 (as always)
1c 00        octets-to-inline-QoS = 0x1c = 28
00 00 00 00  readerId = 0 = for all readers
00 00 4b 03  writerId = 0x4b03 = some application (or ROS 2 built-in) writer, no key
00 00 00 00  seq# high \
12 d7 00 00  seq# low   => 0xd712 = 55058
e2 00 00 00  starting fragment number = 0xe2 = 226
2d 00        fragments in submessage = 0x2d = 45
aa 05        fragment size = 0x5aa = 1450
18 02 06 00  sample size = 0x60218 = 393752

which all looks plausible for sending a new (not a retransmit) sample. I think that excludes the possibility of some issue with the retransmitting code.

If you can find out what topic is associated with this writer 0x4b03, that might help. I imagine a "finest" trace could be tricky to get, but you can make a "fine" (instead of "finest") trace: that'll give all the discovery activities but not the regular application data. I imagine that would not disturb the system by much and you could just leave it on while doing other things, waiting for it reproduce.

With the way Cyclone allocates entity ids, 0x4b03 would be the 75th reader/writer entity (or thereabouts) created by that ROS node. The writer for the discovery information that caused trouble in #1146 was created early on, with a lower entity id. So this detail supports your sense that this feels differently; this sounds more like a regular application topic.

I don't think it has something to do with services because all service requests and responses have a header at the very beginning of the data and it is impossible [^1] for that be all zero. I believe all the parameter manipulation happens through service invocations, so it would exclude those, too.

CDR (the serialisation format used by DDS) does have structure, and the only way you could get all zero is when the type contains nothing but basic types (ints and floats) and structs and arrays of them. No sequences (or a great many empty sequences), no strings (not even empty ones). If you have such a type, it could originate in the application. If you don't have such a type, it almost has to be the serialiser running amok. It is never done that to my knowledge, but ...

Something else to look at: are you using the Iceoryx integration? Because then there are more paths in the code, it might be publishing the data via Iceoryx and sending it out on the network at the same time, which is supported and does get tested, but it is an less-common situation.

The same thing holds for publishing serialized data from the ROS 2 API, because that's also less tested. I guess you're not replaying a rosbag recording, and that's the common reason for publishing serialized data.

Finally, if the size is always the same (or if this is the only huge sample), then you could try setting a conditional breakpoint, for example here: https://github.com/ros2/rmw_cyclonedds/blob/1d7c8d1f9172d467f34111c535f57d9f858ea3c6/rmw_cyclonedds_cpp/src/serdata.cpp#L123

[^1]: Can't be bothered right now to check whether it is truly impossible or at merely verging on it.