Closed omertal88 closed 2 years ago
Hi @omertal88, let me start with the self-help section of the answer:
I have a Wireshark capture where you can see how the traffic starts to climb up. Is there a way I can use Wireshark and the trace log to find the publisher (or topic name) which creates this traffic?
Yes. Each DATA(_FRAG) message, or really any message but these are the ones that triggered it, contain a writer GUID. You have to look at the dissection of the packet, usually the full GUID is formed by the guidPrefix
in the message header (*) and the writerEntityId
in the DATA submessage header.
Now, I assume you mean this (there are many more following, I think this is the first):
and so for the selected message, the full writer GUID is c5501001:e6776757:1d006724:403
. Those GUIDs also occur in the Cyclone DDS trace, formatted like I did just now.
Wireshark tries to map them to topic names, it can do only do that for the "built-in ones" (because they have hard-coded entity ids, and Wireshark shows them as, for example, writerEntityId: ENTITYID_BUILTIN_SUBSCRIPTIONS_WRITER (0x000004c2)
) and for those application writers for which the capture happens to include the SEDP message describing the writer. In those cases, Wireshark prints the topic name in the summary:
65772 350.983387360 192.0.5.15 192.0.5.35 RTPS 146 INFO_TS, DATA -> rt/drn6b5a05eb/rc/override/raw
as well as including the topic and type names and a reference to the SEDP message in the dissection. For the messages you are asking about, clearly the writer's SEDP messages is not in the capture and hence Wireshark won't help.
The Cyclone trace always starts at the beginning and so it always has the information (not doing log rotation can be an advantage). It also includes the topic name for each message because it makes life much more convenient while adding (relatively) little text to the (already gigantic) trace files. You may also want to have a look at this if you intend to also dive into the Cyclone DDS trace: https://gist.github.com/eboasson/56aa337c558151c7d9598214418f2fed which I started doing and meant to extend a bit further before turning it into a PR on Cyclone — I probably should just add it as it is ...
Sometimes you can make an informed guess and this is one of those cases. ROS 2 used to map each ROS 2 node to a DDS DomainParticipant, but most DDS implementations have horrendous overhead when you do that. Cyclone is a notable exception, but I failed to convince them not to change it and so now it uses a single DDS DomainParticipant per process (or more accurately "context" but they are almost synonymous). (https://github.com/ros2/design/pull/250 is a good starting point if you're curious.)
For the most part, this was a trivial change, but not for introspecting the ROS 2 graph: it used to be that you could derive all necessary information from the DDS discovery data, but that change means the mapping between readers/writers and ROS 2 nodes no longer exists in the discovery data. This was solved by adding a transient-local, keep-last-1 "ros_discovery_topic" containing a list of all ROS2 nodes and how the various endpoints relate to them. In short, it contains tons of GUIDs — for some reason stored as 24 byte arrays instead of 16 byte arrays ... The writer of this topic is one of the first writers created when a ROS application starts and so writer entity id 403
is a good candidate. Furthermore, if you look at the contents of the data, there is clear pattern which fits with it containing many GUIDs.
This topic gets published every time the application creates or deletes a ROS 2 subscription or publisher. If you look at the other traffic around the time this data is being published in the trace, you can see it correlates with DATA(r[UD])
and DATA(w[UD])
messages. That's how Wireshark indicates that it is a reader/writer SEDP message with flags "unregister" and "dispose", code-speak for "endpoint deleted". So that's what you are looking at here.
The end-result of trying to work around a nasty performance problem in some DDS implementations by changing the ROS 2 node-to-DDS DomainParticipant mapping is therefore a significant increase in data associated with discovery and that's what you're seeing here. As predicted. And your case is actually a mild case ...
(*) The protocol allows data from multiple DomainParticipants to be combined in a single packet, which is quite common in OpenSplice, but not with most implementations, and certainly not with ROS 2 because it only uses a single DDS DomainParticipant. There exists a special INFO_SRC
submessage that changes the source guidPrefix
for subsequent submessages in the same message.
Hello @eboasson
First allow me to show my deepest appreciation for your detailed answer.
Secondly, what did you mean by
changing the ROS 2 node-to-DDS DomainParticipant mapping I do assign ParticipantIDs statically for each node that needs to communicate over a wireless channel, to actually reduce the discovery traffic. Just wondering is that what you meant? How did you know I'm doing so, and also, how come that creates an increase in discovery traffic instead of decreasing it? After all, each node only needs to send discovery data to pre-defined ports.
Since I'm using a wireless channel, I can't use multicast discovery, would you say a better alternative is to use SPDP and just set MaxAutoParticipants to a reasonable number? I feel that is wasteful because most nodes don't need to interact with all the other nodes in my system. Edit: Just started reading ros2/design#250 and will probably know better after done reading....
Thirdly,
This topic gets published every time the application creates or deletes a ROS 2 subscription or publisher. If you look at the other traffic around the time this data is being published in the trace, you can see it correlates with DATA(r[UD]) and DATA(w[UD]) messages. That's how Wireshark indicates that it is a reader/writer SEDP message with flags "unregister" and "dispose", code-speak for "endpoint deleted". So that's what you are looking at here. I don't believe I create or delete any publisher / subscriber in my app during that time. Could it be that it happened in the other end?
Thank you again for your help. Omer
I do assign ParticipantIDs statically for each node that needs to communicate over a wireless channel, to actually reduce the discovery traffic. Just wondering is that what you meant? How did you know I'm doing so, and also, how come that creates an increase in discovery traffic instead of decreasing it? After all, each node only needs to send discovery data to pre-defined ports.
Since I'm using a wireless channel, I can't use multicast discovery, would you say a better alternative is to use SPDP and just set MaxAutoParticipants to a reasonable number? I feel that is wasteful because most nodes don't need to interact with all the other nodes in my system.
If you can't do multicast discovery, then what you're doing is the most efficient way. With the MaxAutoParticipants you indeed get a burst of traffic, which is usually more than you need.
How did you know I'm doing so,
If it had been what I meant ... I probably would have been able to tell from looking at the contents of the SPDP message (DATA(p)
in Wireshark): in that case I would have noticed that everything is unicast and that all the destination port numbers are in a small range. But I didn't mean and that and I didn't pay attention to those details 🙂
I don't believe I create or delete any publisher / subscriber in my app during that time
If you filter on (rtps.sm.wrEntityId==0x3c2||rtps.sm.wrEntityId==0x4c2)&&rtps.sm.id==0x15
you get all the SEDP DATA messages — so readers and writers in DDS. There are quite a few of them, among other times around 350s, all of them from 192.0.5.15. Merely seeing those messages tells you there is a lot of discovery traffic, what it doesn't tell you is why that traffic exists: it could because readers/writers are being created at that time, or it could be because other processes are joining and need to discover the existing readers/writers. The thing to look at is the sequence number (or you could look at ACKNACK
messages to see if these are caused by retransmit requests).
If you see many in a message (e.g., 59633), chances are they are retransmits, more likely than not because a new remote node was discovered (or an existing one rediscovered, if you have moving robots, that's perfectly reasonable).
The time stamp inside the messages tend to be: Feb 13, 2022 13:35:57.641412128 UTC for the ones that are definitely being retransmitted, with low sequence numbers. The very last one has a time stamp of Feb 13, 2022 14:45:16.716417248 UTC and sequence number 1168. I haven't studied them all, but it sure suggests something is being created/deleted. Eg.. 66624 is a new reader for topic rt/drn9d246540/rc/override/raw
(there is some name mangling done by ROS, so this is not exactly the topic name in the application).
You see them go out in bursts of 4, two to .35 and two to .75, but that must be because both nodes have two processes.
OK yeah that actually makes perfect sense to me. Because I think during the time of t=350, I swapped the publishers / subscriptions of two devices (.35 and .75). I do that from time to time (there's a use case for it), and usually as you can see below, the traffic goes down right after it. You can see that after t=351, there are no more messages like that. During the period of t>351, no new publishers / subscribers were created and yet traffic is very high.
I've noticed this pattern but not sure if it means anything:
ACKNACK which keeps saying the same sample is missing
followed by a big packet of data to ACKNACK sender.
The same sample 1545
is missing in all ACKNACK messages I see.
As can be seen here:
This ACKNACK packet was sent a lot after the one above...
(This is taken from a different Wireshark capture after I reproduced the issue)
This is the repeating pattern in the log file matching what I saw in Wireshark:
1644848885.586969 [0] recvUC: HDR(10f1562:3c0ac58b:2000000 vendor 1.15) len 68 from udp/192.0.5.35:51442
1644848885.586986 [0] recvUC: INFODST(16dc1001:7ca56c8b:6113ee07)
1644848885.587070 [0] recvUC: ACKNACK(#6467:1545/1:1 L(:1c1 25871.840551) 10f1562:3c0ac58b:2000000:204 -> 16dc1001:7ca56c8b:6113ee07:403 RX1545non-timed queue now has 1 items
1644848885.587104 [0] tev: xpack_addmsg 0x7f28001380 0x7f20015ae0 0(rexmit(16dc1001:7ca56c8b:6113ee07:403:#1545/1)): niov 0 sz 0 => now niov 3 sz 1536
1644848885.587215 [0] tev: nn_xpack_send 1536: 0x7f2800138c:20 0x7f20015bc8:64 0x7ef008f260:1452 [ udp/192.0.5.35:7423@3 ]
1644848885.587227 [0] tev: traffic-xmit (1) 1536
1644848885.587258 [0] recvUC: rexmit#1 maxseq:1545<1545<=1545defer_heartbeat_to_peer: 16dc1001:7ca56c8b:6113ee07:403 -> 10f1562:3c0ac58b:2000000:204 - queue for transmit
1644848885.587273 [0] recvUC: )
1644848885.587282 [0] recvUC: send_deferred_heartbeat: af09861aec84a6e1 -> 3c2690177a441a35 - queue for transmit
1644848885.587314 [0] recvUC: non-timed queue now has 1 items
1644848885.587339 [0] tev: xpack_addmsg 0x7f28001380 0x7f20015ae0 0(control): niov 0 sz 0 => now niov 2 sz 68
1644848885.587386 [0] tev: nn_xpack_send 68: 0x7f2800138c:20 0x7f20015bc8:48 [ udp/192.0.5.35:7423@3 ]
1644848885.587397 [0] tev: traffic-xmit (1) 68
1644848885.588364 [0] recvUC: HDR(10f739d:4d0a2125:1000000 vendor 1.15) len 100 from udp/192.0.5.75:58073
1644848885.588396 [0] recvUC: INFOTS(1631044034.633058760)
1644848885.588430 [0] recvUC: DATA(10f739d:4d0a2125:1000000:203 -> 0:0:0:0 #52050 L(:1c1 25871.841963) => EVERYONE
1644848885.588512 [0] recvUC: data(application, vendor 1.15): 10f739d:4d0a2125:1000000:203 #52050: ST0 rt/drn9d246540/flight_status/xtra_diagnostic_msgs::msg::dds_::FlightStatusStamped_:{{{1631044034,631644538},""},{3,1613,71,0,2,16,0}})
1644848885.589365 [0] tev: heartbeat(wr 16dc1001:7ca56c8b:6113ee07:403) suppressed, resched in 2 s (min-ack 1544, avail-seq 1545, xmit 1545)
1644848885.593823 [0] recvUC: HDR(10f1562:3c0ac58b:2000000 vendor 1.15) len 68 from udp/192.0.5.35:51442
1644848885.593859 [0] recvUC: INFODST(16dc1001:7ca56c8b:6113ee07)
1644848885.593922 [0] recvUC: ACKNACK(#6468:1545/1:1 L(:1c1 25871.847425) 10f1562:3c0ac58b:2000000:204 -> 16dc1001:7ca56c8b:6113ee07:403 RX1545non-timed queue now has 1 items
1644848885.594114 [0] recv: HDR(9ed91001:8569ed6:c6fe21e2 vendor 1.16) len 268 from udp/192.0.5.15:49517
1644848885.594189 [0] recv: INFOTS(1639147732.144081792)
1644848885.594226 [0] recv: DATA(9ed91001:8569ed6:c6fe21e2:100c2 -> 0:0:0:0 #1)
1644848885.594245 [0] recv: thread_cputime 0.624000000
1644848885.594375 [0] dq.builtin: data(builtin, vendor 1.16): 0:0:0:0 #1: ST0 /ParticipantBuiltinTopicData:{user_data=10<"enclave=/;">,protocol_version=2:1,vendorid=1:16,participant_lease_duration=10000000000,participant_guid={9ed91001:8569ed6:c6fe21e2:1c1},builtin_endpoint_set=3135,domain_id=0,default_unicast_locator={udp/192.0.5.15:7573},metatraffic_unicast_locator={udp/192.0.5.15:7572},adlink_participant_version_info=0:44:0:0:0:"xtend-gcu/0.8.0/Linux/Linux",cyclone_receive_buffer_size=425984}
1644848885.594404 [0] dq.builtin: SPDP ST0 9ed91001:8569ed6:c6fe21e2:1c1 (known) L(:1c1 25861.847970)
1644848885.594417 [0] dq.builtin: thread_cputime 0.396000000
1644848885.594531 [0] tev: xpack_addmsg 0x7f28001380 0x7f20015ae0 0(rexmit(16dc1001:7ca56c8b:6113ee07:403:#1545/1)): niov 0 sz 0 => now niov 3 sz 1536
1644848885.594649 [0] tev: nn_xpack_send 1536: 0x7f2800138c:20 0x7f20015bc8:64 0x7ef008f260:1452 [ udp/192.0.5.35:7423@3 ]
1644848885.594661 [0] tev: traffic-xmit (1) 1536
1644848885.593937 [0] recvUC: rexmit#1 maxseq:1545<1545<=1545defer_heartbeat_to_peer: 16dc1001:7ca56c8b:6113ee07:403 -> 10f1562:3c0ac58b:2000000:204 - queue for transmit
1644848885.594711 [0] recvUC: )
1644848885.594721 [0] recvUC: send_deferred_heartbeat: af09861aec84a6e1 -> 3c2690177a441a35 - queue for transmit
1644848885.594761 [0] recvUC: non-timed queue now has 1 items
1644848885.594786 [0] tev: xpack_addmsg 0x7f28001380 0x7f20015ae0 0(control): niov 0 sz 0 => now niov 2 sz 68
1644848885.594849 [0] tev: nn_xpack_send 68: 0x7f2800138c:20 0x7f20015bc8:48 [ udp/192.0.5.35:7423@3 ]
1644848885.594860 [0] tev: traffic-xmit (1) 68
1644848885.596683 [0] tev: heartbeat(wr 16dc1001:7ca56c8b:6113ee07:2703) suppressed, resched in inf s (min-ack 9223372036854775807, avail-seq -1, xmit 1339)
1644848885.598937 [0] recv: HDR(9ed91001:8569ed6:c6fe21e2 vendor 1.16) len 268 from udp/192.0.5.15:49517
1644848885.598974 [0] recv: INFOTS(1639147732.144081792)
1644848885.599013 [0] recv: DATA(9ed91001:8569ed6:c6fe21e2:100c2 -> 0:0:0:0 #1)
1644848885.599272 [0] recvUC: HDR(10f739d:4d0a2125:2000000 vendor 1.15) len 68 from udp/192.0.5.75:35766
1644848885.599306 [0] recvUC: INFODST(16dc1001:7ca56c8b:6113ee07)
1644848885.599382 [0] dq.builtin: data(builtin, vendor 1.16): 0:0:0:0 #1: ST0 /ParticipantBuiltinTopicData:{user_data=10<"enclave=/;">,protocol_version=2:1,vendorid=1:16,participant_lease_duration=10000000000,participant_guid={9ed91001:8569ed6:c6fe21e2:1c1},builtin_endpoint_set=3135,domain_id=0,default_unicast_locator={udp/192.0.5.15:7573},metatraffic_unicast_locator={udp/192.0.5.15:7572},adlink_participant_version_info=0:44:0:0:0:"xtend-gcu/0.8.0/Linux/Linux",cyclone_receive_buffer_size=425984}
1644848885.599408 [0] recvUC: ACKNACK(#7795:1545/1:1 L(:1c1 25871.852872) 10f739d:4d0a2125:2000000:204 -> 16dc1001:7ca56c8b:6113ee07:403 RX1545non-timed queue now has 1 items
1644848885.599421 [0] dq.builtin: SPDP ST0 9ed91001:8569ed6:c6fe21e2:1c1 (known) L(:1c1 25861.852986)
1644848885.599446 [0] tev: xpack_addmsg 0x7f28001380 0x7f20015ae0 0(rexmit(16dc1001:7ca56c8b:6113ee07:403:#1545/1)): niov 0 sz 0 => now niov 3 sz 1536
1644848885.599564 [0] tev: nn_xpack_send 1536: 0x7f2800138c:20 0x7f20015bc8:64 0x7ef008f260:1452 [ udp/192.0.5.75:7423@3 ]
1644848885.599577 [0] tev: traffic-xmit (1) 1536
1644848885.599611 [0] recvUC: rexmit#1 maxseq:1545<1545<=1545defer_heartbeat_to_peer: 16dc1001:7ca56c8b:6113ee07:403 -> 10f739d:4d0a2125:2000000:204 - queue for transmit
1644848885.599627 [0] recvUC: )
1644848885.599636 [0] recvUC: send_deferred_heartbeat: af09861aec84a6e1 -> 2b88dc2333c9b24 - queue for transmit
1644848885.599683 [0] recvUC: non-timed queue now has 1 items
1644848885.599708 [0] tev: xpack_addmsg 0x7f28001380 0x7f20015ae0 0(control): niov 0 sz 0 => now niov 2 sz 68
1644848885.599788 [0] tev: nn_xpack_send 68: 0x7f2800138c:20 0x7f20015bc8:48 [ udp/192.0.5.75:7423@3 ]
1644848885.599799 [0] tev: traffic-xmit (1) 68
1644848885.600729 [0] recvUC: HDR(88731001:c738445:f5822862 vendor 1.16) len 116 from udp/192.0.5.15:49931
1644848885.600757 [0] recvUC: INFOTS(1644848885.599981744)
1644848885.600792 [0] recvUC: DATA(88731001:c738445:f5822862:2b03 -> 0:0:0:0 #2567023 L(:1c1 25861.854324)
1644848885.600872 [0] recvUC: data(application, vendor 1.16): 88731001:c738445:f5822862:2b03 #2567023: ST0 rt/xtra_controller_abstraction/xtra_osd/geometry_msgs::msg::dds_::Point_:{0.000000,720.000000,0.000000} =>16dc1001:7ca56c8b:6113ee07:2404
1644848885.601707 [0] recvUC: )
1644848885.601735 [0] recvUC: HEARTBEAT(F#2567023:2567024..2567023 88731001:c738445:f5822862:2b03 -> 0:0:0:0:88731001:c738445:f5822862:2b03 -> 0:0:0:0 no-reliable-readers)
1644848885.606944 [0] recvUC: HDR(88731001:c738445:f5822862 vendor 1.16) len 116 from udp/192.0.5.15:49931
1644848885.606997 [0] recvUC: INFOTS(1644848885.606089424)
1644848885.607039 [0] recvUC: DATA(88731001:c738445:f5822862:2b03 -> 0:0:0:0 #2567024 L(:1c1 25861.860563)
1644848885.607092 [0] recvUC: data(application, vendor 1.16): 88731001:c738445:f5822862:2b03 #2567024: ST0 rt/xtra_controller_abstraction/xtra_osd/geometry_msgs::msg::dds_::Point_:{0.000000,720.000000,0.000000} =>16dc1001:7ca56c8b:6113ee07:2404
1644848885.608099 [0] recvUC: )
1644848885.608146 [0] recvUC: HEARTBEAT(F#2567024:2567025..2567024 88731001:c738445:f5822862:2b03 -> 0:0:0:0:88731001:c738445:f5822862:2b03 -> 0:0:0:0 no-reliable-readers)
1644848885.609381 [0] recvUC: HDR(10f1562:3c0ac58b:2000000 vendor 1.15) len 68 from udp/192.0.5.35:51442
1644848885.609413 [0] recvUC: INFODST(16dc1001:7ca56c8b:6113ee07)
1644848885.609493 [0] recvUC: ACKNACK(#6470:1545/1:1 L(:1c1 25871.862980) 10f1562:3c0ac58b:2000000:204 -> 16dc1001:7ca56c8b:6113ee07:403 RX1545non-timed queue now has 1 items
1644848885.609527 [0] tev: xpack_addmsg 0x7f28001380 0x7f20015ae0 0(rexmit(16dc1001:7ca56c8b:6113ee07:403:#1545/1)): niov 0 sz 0 => now niov 3 sz 1536
1644848885.609645 [0] tev: nn_xpack_send 1536: 0x7f2800138c:20 0x7f20015bc8:64 0x7ef008f260:1452 [ udp/192.0.5.35:7423@3 ]
1644848885.609657 [0] tev: traffic-xmit (1) 1536
1644848885.609691 [0] recvUC: rexmit#1 maxseq:1545<1545<=1545defer_heartbeat_to_peer: 16dc1001:7ca56c8b:6113ee07:403 -> 10f1562:3c0ac58b:2000000:204 - queue for transmit
1644848885.609723 [0] recvUC: )
1644848885.609732 [0] recvUC: send_deferred_heartbeat: af09861aec84a6e1 -> 3c2690177a441a35 - queue for transmit
1644848885.609759 [0] recvUC: non-timed queue now has 1 items
1644848885.609784 [0] tev: xpack_addmsg 0x7f28001380 0x7f20015ae0 0(control): niov 0 sz 0 => now niov 2 sz 68
1644848885.609865 [0] tev: nn_xpack_send 68: 0x7f2800138c:20 0x7f20015bc8:48 [ udp/192.0.5.35:7423@3 ]
1644848885.609877 [0] tev: traffic-xmit (1) 68
1644848885.616711 [0] recvUC: HDR(88731001:c738445:f5822862 vendor 1.16) len 116 from udp/192.0.5.15:49931
1644848885.616773 [0] recvUC: INFOTS(1644848885.616086352)
1644848885.616816 [0] recvUC: DATA(88731001:c738445:f5822862:2b03 -> 0:0:0:0 #2567025 L(:1c1 25861.870338)
1644848885.616873 [0] recvUC: data(application, vendor 1.16): 88731001:c738445:f5822862:2b03 #2567025: ST0 rt/xtra_controller_abstraction/xtra_osd/geometry_msgs::msg::dds_::Point_:{0.000000,720.000000,0.000000} =>16dc1001:7ca56c8b:6113ee07:2404
1644848885.617280 [0] recvUC: )
1644848885.617312 [0] recvUC: HEARTBEAT(F#2567025:2567026..2567025 88731001:c738445:f5822862:2b03 -> 0:0:0:0:88731001:c738445:f5822862:2b03 -> 0:0:0:0 no-reliable-readers)
1644848885.619332 [0] recvUC: HDR(10f1562:3c0ac58b:2000000 vendor 1.15) len 68 from udp/192.0.5.35:51442
1644848885.619387 [0] recvUC: INFODST(16dc1001:7ca56c8b:6113ee07)
1644848885.619485 [0] recvUC: ACKNACK(#6471:1545/1:1 L(:1c1 25871.872950) 10f1562:3c0ac58b:2000000:204 -> 16dc1001:7ca56c8b:6113ee07:403 RX1545non-timed queue now has 1 items
1644848885.619552 [0] tev: xpack_addmsg 0x7f28001380 0x7f20015ae0 0(rexmit(16dc1001:7ca56c8b:6113ee07:403:#1545/1)): niov 0 sz 0 => now niov 3 sz 1536
1644848885.619706 [0] tev: nn_xpack_send 1536: 0x7f2800138c:20 0x7f20015bc8:64 0x7ef008f260:1452 [ udp/192.0.5.35:7423@3 ]
1644848885.619718 [0] tev: traffic-xmit (1) 1536
1644848885.619753 [0] recvUC: rexmit#1 maxseq:1545<1545<=1545defer_heartbeat_to_peer: 16dc1001:7ca56c8b:6113ee07:403 -> 10f1562:3c0ac58b:2000000:204 - queue for transmit
1644848885.619769 [0] recvUC: )
1644848885.619778 [0] recvUC: send_deferred_heartbeat: af09861aec84a6e1 -> 3c2690177a441a35 - queue for transmit
1644848885.619807 [0] recvUC: non-timed queue now has 1 items
1644848885.619832 [0] tev: xpack_addmsg 0x7f28001380 0x7f20015ae0 0(control): niov 0 sz 0 => now niov 2 sz 68
1644848885.619876 [0] tev: nn_xpack_send 68: 0x7f2800138c:20 0x7f20015bc8:48 [ udp/192.0.5.35:7423@3 ]
1644848885.619887 [0] tev: traffic-xmit (1) 68
1644848885.626251 [0] recvUC: HDR(88731001:c738445:f5822862 vendor 1.16) len 116 from udp/192.0.5.15:49931
1644848885.626302 [0] recvUC: INFOTS(1644848885.626078864)
1644848885.626344 [0] recvUC: DATA(88731001:c738445:f5822862:2b03 -> 0:0:0:0 #2567026 L(:1c1 25861.879868)
1644848885.626417 [0] recvUC: data(application, vendor 1.16): 88731001:c738445:f5822862:2b03 #2567026: ST0 rt/xtra_controller_abstraction/xtra_osd/geometry_msgs::msg::dds_::Point_:{0.000000,720.000000,0.000000} =>16dc1001:7ca56c8b:6113ee07:2404
1644848885.626538 [0] recvUC: )
1644848885.626559 [0] recvUC: HEARTBEAT(F#2567026:2567027..2567026 88731001:c738445:f5822862:2b03 -> 0:0:0:0:88731001:c738445:f5822862:2b03 -> 0:0:0:0 no-reliable-readers)
1644848885.636503 [0] recvUC: HDR(88731001:c738445:f5822862 vendor 1.16) len 116 from udp/192.0.5.15:49931
1644848885.636549 [0] recvUC: INFOTS(1644848885.636343824)
1644848885.636587 [0] recvUC: DATA(88731001:c738445:f5822862:2b03 -> 0:0:0:0 #2567027 L(:1c1 25861.890115)
1644848885.636638 [0] recvUC: data(application, vendor 1.16): 88731001:c738445:f5822862:2b03 #2567027: ST0 rt/xtra_controller_abstraction/xtra_osd/geometry_msgs::msg::dds_::Point_:{0.000000,720.000000,0.000000} =>16dc1001:7ca56c8b:6113ee07:2404
1644848885.636721 [0] recvUC: )
1644848885.636737 [0] recvUC: HEARTBEAT(F#2567027:2567028..2567027 88731001:c738445:f5822862:2b03 -> 0:0:0:0:88731001:c738445:f5822862:2b03 -> 0:0:0:0 no-reliable-readers)
1644848885.638380 [0] recvUC: HDR(10f739d:4d0a2125:2000000 vendor 1.15) len 68 from udp/192.0.5.75:35766
1644848885.638417 [0] recvUC: INFODST(16dc1001:7ca56c8b:6113ee07)
1644848885.638611 [0] recvUC: ACKNACK(#7797:1545/1:1 L(:1c1 25871.891981) 10f739d:4d0a2125:2000000:204 -> 16dc1001:7ca56c8b:6113ee07:403 RX1545non-timed queue now has 1 items
1644848885.638638 [0] recvUC: rexmit#1 maxseq:1545<1545<=1545defer_heartbeat_to_peer: 16dc1001:7ca56c8b:6113ee07:403 -> 10f739d:4d0a2125:2000000:204 - queue for transmit
1644848885.638652 [0] recvUC: )
1644848885.638661 [0] recvUC: send_deferred_heartbeat: af09861aec84a6e1 -> 2b88dc2333c9b24 - queue for transmit
1644848885.638660 [0] tev: xpack_addmsg 0x7f28001380 0x7f20015ae0 0(rexmit(16dc1001:7ca56c8b:6113ee07:403:#1545/1)): niov 0 sz 0 => now niov 3 sz 1536
1644848885.638673 [0] recvUC: non-timed queue now has 1 items
1644848885.638912 [0] tev: nn_xpack_send 1536: 0x7f2800138c:20 0x7f20015bc8:64 0x7ef008f260:1452 [ udp/192.0.5.75:7423@3 ]
1644848885.638925 [0] tev: traffic-xmit (1) 1536
1644848885.638940 [0] tev: xpack_addmsg 0x7f28001380 0x7f20015960 0(control): niov 0 sz 0 => now niov 2 sz 68
1644848885.638984 [0] tev: nn_xpack_send 68: 0x7f2800138c:20 0x7f20015a48:48 [ udp/192.0.5.75:7423@3 ]
1644848885.638995 [0] tev: traffic-xmit (1) 68
1644848885.639320 [0] recvUC: HDR(10f1562:3c0ac58b:2000000 vendor 1.15) len 68 from udp/192.0.5.35:51442
1644848885.639368 [0] recvUC: INFODST(16dc1001:7ca56c8b:6113ee07)
1644848885.639422 [0] recvUC: ACKNACK(#6473:1545/1:1 L(:1c1 25871.892937) 10f1562:3c0ac58b:2000000:204 -> 16dc1001:7ca56c8b:6113ee07:403 RX1545non-timed queue now has 1 items
1644848885.639443 [0] recvUC: rexmit#1 maxseq:1545<1545<=1545defer_heartbeat_to_peer: 16dc1001:7ca56c8b:6113ee07:403 -> 10f1562:3c0ac58b:2000000:204 - queue for transmit
1644848885.639455 [0] recvUC: )
1644848885.639463 [0] recvUC: send_deferred_heartbeat: af09861aec84a6e1 -> 3c2690177a441a35 - queue for transmit
1644848885.639463 [0] tev: xpack_addmsg 0x7f28001380 0x7f20015960 0(rexmit(16dc1001:7ca56c8b:6113ee07:403:#1545/1)): niov 0 sz 0 => now niov 3 sz 1536
1644848885.639486 [0] recvUC: non-timed queue now has 1 items
1644848885.639559 [0] tev: nn_xpack_send 1536: 0x7f2800138c:20 0x7f20015a48:64 0x7ef008f260:1452 [ udp/192.0.5.35:7423@3 ]
1644848885.639605 [0] tev: traffic-xmit (1) 1536
1644848885.639617 [0] tev: xpack_addmsg 0x7f28001380 0x7f20015ae0 0(control): niov 0 sz 0 => now niov 2 sz 68
1644848885.639651 [0] tev: nn_xpack_send 68: 0x7f2800138c:20 0x7f20015bc8:48 [ udp/192.0.5.35:7423@3 ]
1644848885.639662 [0] tev: traffic-xmit (1) 68
1644848885.646542 [0] recvUC: HDR(88731001:c738445:f5822862 vendor 1.16) len 116 from udp/192.0.5.15:49931
1644848885.646590 [0] recvUC: INFOTS(1644848885.646323344)
1644848885.646639 [0] recvUC: DATA(88731001:c738445:f5822862:2b03 -> 0:0:0:0 #2567028 L(:1c1 25861.900156)
1644848885.646696 [0] recvUC: data(application, vendor 1.16): 88731001:c738445:f5822862:2b03 #2567028: ST0 rt/xtra_controller_abstraction/xtra_osd/geometry_msgs::msg::dds_::Point_:{0.000000,720.000000,0.000000} =>16dc1001:7ca56c8b:6113ee07:2404
1644848885.646744 [0] recvUC: )
1644848885.646761 [0] recvUC: HEARTBEAT(F#2567028:2567029..2567028 88731001:c738445:f5822862:2b03 -> 0:0:0:0:88731001:c738445:f5822862:2b03 -> 0:0:0:0 no-reliable-readers)
1644848885.653026 [0] recvUC: HDR(10f739d:4d0a2125:2000000 vendor 1.15) len 68 from udp/192.0.5.75:35766
1644848885.653072 [0] recvUC: INFODST(16dc1001:7ca56c8b:6113ee07)
1644848885.653168 [0] recvUC: ACKNACK(#7799:1545/1:1 L(:1c1 25871.906637) 10f739d:4d0a2125:2000000:204 -> 16dc1001:7ca56c8b:6113ee07:403 RX1545non-timed queue now has 1 items
1644848885.653204 [0] tev: xpack_addmsg 0x7f28001380 0x7f20015ae0 0(rexmit(16dc1001:7ca56c8b:6113ee07:403:#1545/1)): niov 0 sz 0 => now niov 3 sz 1536
1644848885.653321 [0] tev: nn_xpack_send 1536: 0x7f2800138c:20 0x7f20015bc8:64 0x7ef008f260:1452 [ udp/192.0.5.75:7423@3 ]
1644848885.653334 [0] tev: traffic-xmit (1) 1536
1644848885.653364 [0] recvUC: rexmit#1 maxseq:1545<1545<=1545defer_heartbeat_to_peer: 16dc1001:7ca56c8b:6113ee07:403 -> 10f739d:4d0a2125:2000000:204 - queue for transmit
1644848885.653380 [0] recvUC: )
1644848885.653389 [0] recvUC: send_deferred_heartbeat: af09861aec84a6e1 -> 2b88dc2333c9b24 - queue for transmit
1644848885.653416 [0] recvUC: non-timed queue now has 1 items
1644848885.653440 [0] tev: xpack_addmsg 0x7f28001380 0x7f20015ae0 0(control): niov 0 sz 0 => now niov 2 sz 68
1644848885.653490 [0] tev: nn_xpack_send 68: 0x7f2800138c:20 0x7f20015bc8:48 [ udp/192.0.5.75:7423@3 ]
1644848885.653500 [0] tev: traffic-xmit (1) 68
1644848885.656638 [0] recvUC: HDR(88731001:c738445:f5822862 vendor 1.16) len 116 from udp/192.0.5.15:49931
1644848885.656677 [0] recvUC: INFOTS(1644848885.656321872)
1644848885.656716 [0] recvUC: DATA(88731001:c738445:f5822862:2b03 -> 0:0:0:0 #2567029 L(:1c1 25861.910244)
1644848885.656884 [0] recvUC: data(application, vendor 1.16): 88731001:c738445:f5822862:2b03 #2567029: ST0 rt/xtra_controller_abstraction/xtra_osd/geometry_msgs::msg::dds_::Point_:{0.000000,720.000000,0.000000} =>16dc1001:7ca56c8b:6113ee07:2404
1644848885.656929 [0] recvUC: )
1644848885.656946 [0] recvUC: HEARTBEAT(F#2567029:2567030..2567029 88731001:c738445:f5822862:2b03 -> 0:0:0:0:88731001:c738445:f5822862:2b03 -> 0:0:0:0 no-reliable-readers)
1644848885.657001 [0] recvUC: HDR(10f1562:3c0ac58b:2000000 vendor 1.15) len 68 from udp/192.0.5.35:51442
1644848885.657014 [0] recvUC: INFODST(16dc1001:7ca56c8b:6113ee07)
1644848885.657066 [0] recvUC: ACKNACK(#6475:1545/1:1 L(:1c1 25871.910583) 10f1562:3c0ac58b:2000000:204 -> 16dc1001:7ca56c8b:6113ee07:403 RX1545non-timed queue now has 1 items
1644848885.657080 [0] recvUC: rexmit#1 maxseq:1545<1545<=1545defer_heartbeat_to_peer: 16dc1001:7ca56c8b:6113ee07:403 -> 10f1562:3c0ac58b:2000000:204 - queue for transmit
1644848885.657092 [0] recvUC: )
1644848885.657100 [0] recvUC: send_deferred_heartbeat: af09861aec84a6e1 -> 3c2690177a441a35 - queue for transmit
1644848885.657110 [0] recvUC: non-timed queue now has 2 items
1644848885.657290 [0] tev: xpack_addmsg 0x7f28001380 0x7f20015ae0 0(rexmit(16dc1001:7ca56c8b:6113ee07:403:#1545/1)): niov 0 sz 0 => now niov 3 sz 1536
1644848885.657430 [0] tev: nn_xpack_send 1536: 0x7f2800138c:20 0x7f20015bc8:64 0x7ef008f260:1452 [ udp/192.0.5.35:7423@3 ]
1644848885.657446 [0] tev: traffic-xmit (1) 1536
1644848885.657462 [0] tev: xpack_addmsg 0x7f28001380 0x7f20015960 0(control): niov 0 sz 0 => now niov 2 sz 68
1644848885.657499 [0] tev: nn_xpack_send 68: 0x7f2800138c:20 0x7f20015a48:48 [ udp/192.0.5.35:7423@3 ]
1644848885.657510 [0] tev: traffic-xmit (1) 68
1644848885.666834 [0] recvUC: HDR(88731001:c738445:f5822862 vendor 1.16) len 116 from udp/192.0.5.15:49931
1644848885.666890 [0] recvUC: INFOTS(1644848885.666669360)
1644848885.666932 [0] recvUC: DATA(88731001:c738445:f5822862:2b03 -> 0:0:0:0 #2567030 L(:1c1 25861.920456)
1644848885.666986 [0] recvUC: data(application, vendor 1.16): 88731001:c738445:f5822862:2b03 #2567030: ST0 rt/xtra_controller_abstraction/xtra_osd/geometry_msgs::msg::dds_::Point_:{0.000000,720.000000,0.000000} =>16dc1001:7ca56c8b:6113ee07:2404
1644848885.667051 [0] recvUC: )
1644848885.667163 [0] recvUC: HEARTBEAT(F#2567030:2567031..2567030 88731001:c738445:f5822862:2b03 -> 0:0:0:0:88731001:c738445:f5822862:2b03 -> 0:0:0:0 no-reliable-readers)
1644848885.667969 [0] recvUC: HDR(10f739d:4d0a2125:2000000 vendor 1.15) len 68 from udp/192.0.5.75:35766
1644848885.667997 [0] recvUC: INFODST(16dc1001:7ca56c8b:6113ee07)
1644848885.668062 [0] recvUC: ACKNACK(#7801:1545/1:1 L(:1c1 25871.921563) 10f739d:4d0a2125:2000000:204 -> 16dc1001:7ca56c8b:6113ee07:403 RX1545non-timed queue now has 1 items
1644848885.668079 [0] recvUC: rexmit#1 maxseq:1545<1545<=1545defer_heartbeat_to_peer: 16dc1001:7ca56c8b:6113ee07:403 -> 10f739d:4d0a2125:2000000:204 - queue for transmit
1644848885.668092 [0] recvUC: )
1644848885.668101 [0] recvUC: send_deferred_heartbeat: af09861aec84a6e1 -> 2b88dc2333c9b24 - queue for transmit
1644848885.668110 [0] recvUC: non-timed queue now has 2 items
1644848885.670705 [0] tev: xpack_addmsg 0x7f28001380 0x7f20015960 0(rexmit(16dc1001:7ca56c8b:6113ee07:403:#1545/1)): niov 0 sz 0 => now niov 3 sz 1536
1644848885.670871 [0] tev: nn_xpack_send 1536: 0x7f2800138c:20 0x7f20015a48:64 0x7ef008f260:1452 [ udp/192.0.5.75:7423@3 ]
1644848885.670891 [0] tev: traffic-xmit (1) 1536
1644848885.670906 [0] tev: xpack_addmsg 0x7f28001380 0x7f20015ae0 0(control): niov 0 sz 0 => now niov 2 sz 68
1644848885.670951 [0] tev: nn_xpack_send 68: 0x7f2800138c:20 0x7f20015bc8:48 [ udp/192.0.5.75:7423@3 ]
1644848885.670962 [0] tev: traffic-xmit (1) 68
1644848885.676468 [0] recvUC: HDR(88731001:c738445:f5822862 vendor 1.16) len 116 from udp/192.0.5.15:49931
1644848885.676513 [0] recvUC: INFOTS(1644848885.676319088)
1644848885.676555 [0] recvUC: DATA(88731001:c738445:f5822862:2b03 -> 0:0:0:0 #2567031 L(:1c1 25861.930078)
1644848885.676661 [0] recvUC: data(application, vendor 1.16): 88731001:c738445:f5822862:2b03 #2567031: ST0 rt/xtra_controller_abstraction/xtra_osd/geometry_msgs::msg::dds_::Point_:{0.000000,720.000000,0.000000} =>16dc1001:7ca56c8b:6113ee07:2404
1644848885.677094 [0] recvUC: )
1644848885.677124 [0] recvUC: HEARTBEAT(F#2567031:2567032..2567031 88731001:c738445:f5822862:2b03 -> 0:0:0:0:88731001:c738445:f5822862:2b03 -> 0:0:0:0 no-reliable-readers)
1644848885.678704 [0] recvUC: HDR(10f1562:3c0ac58b:2000000 vendor 1.15) len 68 from udp/192.0.5.35:51442
1644848885.678730 [0] recvUC: INFODST(16dc1001:7ca56c8b:6113ee07)
1644848885.678800 [0] recvUC: ACKNACK(#6477:1545/1:1 L(:1c1 25871.932295) 10f1562:3c0ac58b:2000000:204 -> 16dc1001:7ca56c8b:6113ee07:403 RX1545non-timed queue now has 1 items
1644848885.678817 [0] recvUC: rexmit#1 maxseq:1545<1545<=1545defer_heartbeat_to_peer: 16dc1001:7ca56c8b:6113ee07:403 -> 10f1562:3c0ac58b:2000000:204 - queue for transmit
1644848885.678830 [0] recvUC: )
1644848885.678838 [0] recvUC: send_deferred_heartbeat: af09861aec84a6e1 -> 3c2690177a441a35 - queue for transmit
1644848885.678850 [0] tev: xpack_addmsg 0x7f28001380 0x7f20015ae0 0(rexmit(16dc1001:7ca56c8b:6113ee07:403:#1545/1)): niov 0 sz 0 => now niov 3 sz 1536
1644848885.678864 [0] recvUC: non-timed queue now has 1 items
1644848885.679025 [0] tev: nn_xpack_send 1536: 0x7f2800138c:20 0x7f20015bc8:64 0x7ef008f260:1452 [ udp/192.0.5.35:7423@3 ]
1644848885.679036 [0] tev: traffic-xmit (1) 1536
1644848885.679050 [0] tev: xpack_addmsg 0x7f28001380 0x7f20015960 0(control): niov 0 sz 0 => now niov 2 sz 68
1644848885.679096 [0] tev: nn_xpack_send 68: 0x7f2800138c:20 0x7f20015a48:48 [ udp/192.0.5.35:7423@3 ]
1644848885.679106 [0] tev: traffic-xmit (1) 68
1644848885.686555 [0] recvUC: HDR(88731001:c738445:f5822862 vendor 1.16) len 116 from udp/192.0.5.15:49931
1644848885.686606 [0] recvUC: INFOTS(1644848885.686313200)
1644848885.687023 [0] recvUC: DATA(88731001:c738445:f5822862:2b03 -> 0:0:0:0 #2567032 L(:1c1 25861.940172)
1644848885.687088 [0] recvUC: data(application, vendor 1.16): 88731001:c738445:f5822862:2b03 #2567032: ST0 rt/xtra_controller_abstraction/xtra_osd/geometry_msgs::msg::dds_::Point_:{0.000000,720.000000,0.000000} =>16dc1001:7ca56c8b:6113ee07:2404
1644848885.687194 [0] recvUC: )
1644848885.687213 [0] recvUC: HEARTBEAT(F#2567032:2567033..2567032 88731001:c738445:f5822862:2b03 -> 0:0:0:0:88731001:c738445:f5822862:2b03 -> 0:0:0:0 no-reliable-readers)
1644848885.690659 [0] recvUC: HDR(10f739d:4d0a2125:2000000 vendor 1.15) len 68 from udp/192.0.5.75:35766
1644848885.690709 [0] recvUC: INFODST(16dc1001:7ca56c8b:6113ee07)
1644848885.690840 [0] recvUC: ACKNACK(#7803:1545/1:1 L(:1c1 25871.944256) 10f739d:4d0a2125:2000000:204 -> 16dc1001:7ca56c8b:6113ee07:403 RX1545non-timed queue now has 1 items
1644848885.690917 [0] tev: xpack_addmsg 0x7f28001380 0x7f20015960 0(rexmit(16dc1001:7ca56c8b:6113ee07:403:#1545/1)): niov 0 sz 0 => now niov 3 sz 1536
1644848885.691027 [0] tev: nn_xpack_send 1536: 0x7f2800138c:20 0x7f20015a48:64 0x7ef008f260:1452 [ udp/192.0.5.75:7423@3 ]
1644848885.691076 [0] tev: traffic-xmit (1) 1536
1644848885.691103 [0] recvUC: rexmit#1 maxseq:1545<1545<=1545defer_heartbeat_to_peer: 16dc1001:7ca56c8b:6113ee07:403 -> 10f739d:4d0a2125:2000000:204 - queue for transmit
1644848885.691119 [0] recvUC: )
1644848885.691128 [0] recvUC: send_deferred_heartbeat: af09861aec84a6e1 -> 2b88dc2333c9b24 - queue for transmit
1644848885.691159 [0] recvUC: non-timed queue now has 1 items
1644848885.691183 [0] tev: xpack_addmsg 0x7f28001380 0x7f20015960 0(control): niov 0 sz 0 => now niov 2 sz 68
1644848885.691234 [0] tev: nn_xpack_send 68: 0x7f2800138c:20 0x7f20015a48:48 [ udp/192.0.5.75:7423@3 ]
1644848885.691582 [0] tev: traffic-xmit (1) 68
1644848885.694805 [0] tev: heartbeat(wr 16dc1001:7ca56c8b:6113ee07:403) suppressed, resched in 2 s (min-ack 1544, avail-seq 1545, xmit 1545)
1644848885.696802 [0] recvUC: HDR(10f1562:3c0ac58b:2000000 vendor 1.15) len 68 from udp/192.0.5.35:51442
1644848885.698388 [0] recvUC: INFODST(16dc1001:7ca56c8b:6113ee07)
1644848885.698589 [0] recvUC: ACKNACK(#6479:1545/1:1 L(:1c1 25871.951943) 10f1562:3c0ac58b:2000000:204 -> 16dc1001:7ca56c8b:6113ee07:403 RX1545non-timed queue now has 1 items
1644848885.698661 [0] recvUC: rexmit#1 maxseq:1545<1545<=1545defer_heartbeat_to_peer: 16dc1001:7ca56c8b:6113ee07:403 -> 10f1562:3c0ac58b:2000000:204 - queue for transmit
1644848885.698678 [0] recvUC: )
1644848885.698699 [0] recvUC: send_deferred_heartbeat: af09861aec84a6e1 -> 3c2690177a441a35 - queue for transmit
1644848885.699313 [0] tev: xpack_addmsg 0x7f28001380 0x7f20015960 0(rexmit(16dc1001:7ca56c8b:6113ee07:403:#1545/1)): niov 0 sz 0 => now niov 3 sz 1536
1644848885.699427 [0] tev: nn_xpack_send 1536: 0x7f2800138c:20 0x7f20015a48:64 0x7ef008f260:1452 [ udp/192.0.5.35:7423@3 ]
1644848885.699439 [0] tev: traffic-xmit (1) 1536
1644848885.699669 [0] recvUC: non-timed queue now has 1 items
1644848885.699698 [0] tev: xpack_addmsg 0x7f28001380 0x7f20015ae0 0(control): niov 0 sz 0 => now niov 2 sz 68
1644848885.699749 [0] tev: nn_xpack_send 68: 0x7f2800138c:20 0x7f20015bc8:48 [ udp/192.0.5.35:7423@3 ]
1644848885.699759 [0] tev: traffic-xmit (1) 68
1644848885.699805 [0] recvUC: HDR(88731001:c738445:f5822862 vendor 1.16) len 116 from udp/192.0.5.15:49931
1644848885.699818 [0] recvUC: INFOTS(1644848885.696739024)
1644848885.699852 [0] recvUC: DATA(88731001:c738445:f5822862:2b03 -> 0:0:0:0 #2567033 L(:1c1 25861.953387)
1644848885.699911 [0] recvUC: data(application, vendor 1.16): 88731001:c738445:f5822862:2b03 #2567033: ST0 rt/xtra_controller_abstraction/xtra_osd/geometry_msgs::msg::dds_::Point_:{0.000000,720.000000,0.000000} =>16dc1001:7ca56c8b:6113ee07:2404
1644848885.700261 [0] recvUC: )```
I wrote what's below before you added the excerpt from the Cyclone trace. Thanks for sharing that, it confirms some of things I could only suspect before. Now we have to figure out whether my suspicions are correct :)
I have a suspicion. There are basically four cases in requesting retransmits:
ACKNACK
requests retransmit of "small" sample, followed by DATA
in responseACKNACK
requests retransmit of "large" sample, followed by all DATA_FRAG
s making up the sample in responseACKNACK
requests retransmit of "large" sample, followed by some but not all DATA_FRAG
in responseNACK_FRAG
requests retransmits of some (perhaps all) fragments of a "large" sample, followed by ≥1 DATA_FRAG
in responseNaturally, some or all of the packets in the response can be lost in transmission. There are recovery mechanisms to deal with that and they start with a HEARTBEAT
message that suggests/requires the readers to acknowledge what they have received and request retransmissions for what is missing.
From packets 166835+166837 en 166843+166846 and knowing Cyclone, I suspect you're may be looking at case 3. When a remote reader requests a retransmission of a large sample, it only retransmits the first fragment and relies on the reader to request the remaining fragments via NACK_FRAG
. (I learnt the hard way how much of a difference this makes in the predictability of the system when samples very in size from a few bytes to tens of megabytes.) If the other side doesn't request the remaining fragments using NACK_FRAG
but simply asks for a full retransmit using an ACKNACK
message, then this leads to a cycle.
The little bit I can see fits with this pattern. This is all very much a grey area in the specs, it doesn't say anywhere that you must use NACK_FRAG
, though I do think it is daft not to use it.
Two things are interesting now:
DATA_FRAG
sent in response to its ACKNACK
? (I suspect so, otherwise nothing would work, but it is best to check the basics.)If it is what I suspect, a quick work-around that should at least work for a while is to raise the maximum message size and maximum fragment size in Cyclone. In the capture you shared everything is <16kB, so setting: General/MaxMessageSize
to 65500B
and General/FragmentSize
to 62kB
would avoid all fragmentation and presumably work around the problem.
But I'd be grateful if you could have a look into those two questions as well. Work-arounds are not solutions, and this one falls apart for anything over 62kB.
Hey again,
So I'm not familiar with Fast-RTPS logs so I'll need some time to dig into it. In the meanwhile here's the answer to your 2nd question: In our system, we have one main device and multiple client devices which can connect to it. Only one client can be "active" at a time. This means it is receiving video (RTSP) from the client + sending / receiving control messages & status (DDS). When we wish to switch between clients (that is, making the other active), we connect to the new RTSP server of the other client, delete all publishers / subscribers of the old client, and create new ones of the new client. I've noticed that when we do this switch repeatedly, then the network traffic starts to climb. So to answer your question: Yes. It happens when we delete old publishers / subscribers and create new ones. After many iterations - this issue will occur. Hope it helps.
One more observation that I noticed... I think that most cases when I reproduce the issue and can identify the first packet that started the loop, I see that a TCP packet was sent / received just a moment before. I don't see the connection yet but this is too much of a coincidence to leave ignored. For example: One more: And also:
Also, I find it weird that both clients loses the same sample and both stuck in that loop. This must be an issue in the sender's side.
Edit: I don't think it's related to fragmentation because I observed the same loop occurs even with a full sample: I didn't want to flood this comment with snapshots, but you can take my word that the same sample was missing in all ACKNACKs (sample #667).
Thank you once again :)
I think I may have a theory... You said to increase the packet size... I think maybe to solution would be to lower the packet size. I notice that only large packets fail to be sent to the other end. Maybe the issue is the MTU. I'm trying to lower the MaxMessageSize / MaxRexmitSize, and reproduce the issue.
Will update on the results...
Edit: Unfortunately it didn't help. Retransmit loop still occurrs.
@eboasson,
I've noticed in my trace file that the first sample of ros_topic_info was:
1644848544.774573 [0] xtra_osd_n: write_sample 16dc1001:7ca56c8b:6113ee07:403 #1: ST0 ros_discovery_info/rmw_dds_common::msg::dds_::ParticipantEntitiesInfo_:{{{{22,220,16,1,124,165,108,139,97,19,238,7,0,0,1,193,0,0,0,0,0,0,0,0}}},{{"/","xtra_osd",{},{}}}}
Then the second was:
1644848544.779657 [0] xtra_osd_n: write_sample 16dc1001:7ca56c8b:6113ee07:403 #2: ST0 ros_discovery_info/rmw_dds_common::msg::dds_::ParticipantEntitiesInfo_:{{{{22,220,16,1,124,165,108,139,97,19,238,7,0,0,1,193,0,0,0,0,0,0,0,0}}},{{"/","xtra_osd",{},{{{{22,220,16,1,124,165,108,139,97,19,238,7,0,0,6,3,0,0,0,0,0,0,0,0}}}}}}}
And the 10th:
1644848544.818250 [0] xtra_osd_n: write_sample 16dc1001:7ca56c8b:6113ee07:403 #10: ST0 ros_discovery_info/rmw_dds_common::msg::dds_::ParticipantEntitiesInfo_:{{{{22,220,16,1,124,165,108,139,97,19,238,7,0,0,1,193,0,0,0,0,0,0,0,0}}},{{"/","xtra_osd",{{{{22,220,16,1,124,165,108,139,97,19,238,7,0,0,8,4,0,0,0,0,0,0,0,0}}}{{{22,220,16,1,124,165,108,139,97,19,238,7,0,0,10,4,0,0,0,0,0,0,0,0}}}{{{22,220,16,1,124,165,108,139,97,19,238,7,0,0,12,4,0,0,0,0,0,0,0,0}}}{{{22,220,16,1,124,165,108,139,97,19,238,7,0,0,14,4,0,0,0,0,0,0,0,0}}}{{{22,220,16,1,124,165,108,139,97,19,238,7,0,0,16,4,0,0,0,0,0,0,0,0}}}{{{22,220,16,1,124,165,108,139,97,19,238,7,0,0,18,4,0,0,0,0,0,0,0,0}}}{{{22,220,16,1,124,165,108,139,97,19,238,7,0,0,20,4,0,0,0,0,0,0,0,0}}}},{{{{22,220,16,1,124,165,108,139,97,19,238,7,0,0,6,3,0,0,0,0,0,0,0,0}}}{{{22,220,16,1,124,165,108,139,97,19,238,7,0,0,7,3,0,0,0,0,0,0,0,0}}}{{{22,220,16,1,124,165,108,139,97,19,238,7,0,0,9,3,0,0,0,0,0,0,0,0}}}{{{22,220,16,1,124,165,108,139,97,19,238,7,0,0,11,3,0,0,0,0,0,0,0,0}}}{{{22,220,16,1,124,165,108,139,97,19,238,7,0,0,13,3,0,0,0,0,0,0,0,0}}}{{{22,220,16,1,124,165,108,139,97,19,238,7,0,0,15,3,0,0,0,0,0,0,0,0}}}{{{22,220,16,1,124,165,
So it seems that when I remove publishers / subscription in my ROS node, it doesn't delete their readers / writters, so that message keeps growing until it's too big to be sent.
Could that be the cause?
I would expect that deleting a publisher / subscription would also remove their reader / writter.
Thanks again. Omer
When you create a node, several subscriptions and publishers (for parameters and the like) get created, and this fits with that pattern. I did a quick test by just starting (a derivative of) "listener" and stopping it again (it stops cleanly when it receives ^C) and that looks perfectly normal: first you see all the entities get created (with as many ros_discovery_info samples, with the samples growing ever larger), then you see them getting destroyed (again with ros_discovery_info samples, but now getting smaller again).
CYCLONEDDS_URI="$CYCLONEDDS_URI,<Tr><V>finest</><Out>cdds.log</>" build/hacked_demo/listener64 & sleep 1 ; kill -INT $!
rg 'WRITER|READER|write_sample.*ros_discovery_info|delete_(reader|writer)\(guid' cdds.log > filtered.txt
tail filtered.txt:
1645106437.913739 [0] 6465549: delete_writer(guid 1103c14:58d4043a:239086f6:903) - no unack'ed samples
1645106437.913898 [0] 6465549: write_sample 1103c14:58d4043a:239086f6:403 #20: ST0 ros_discovery_info/rmw_dds_common::msg::dds_::ParticipantEntitiesInfo_:{{{{1,16,60,20,88,212,4,58,35,144,134,246,0,0,1,193,0,0,0,0,0,0,0,0}}},{{"/","listener",{{{{1,16,60,20,88,212,4,58,35,144,134,246,0,0,20,4,0,0,0,0,0,0,0,0}}}{{{1,16,60,20,88,212,4,58,35,144,134,246,0,0,21,4,0,0,0,0,0,0,0,0}}}},{{{{1,16,60,20,88,212,4,58,35,144,134,246,0,0,19,3,0,0,0,0,0,0,0,0}}}}}}}
1645106437.914020 [0] 6465549: delete_writer(guid 1103c14:58d4043a:239086f6:703) ...
1645106437.914027 [0] 6465549: delete_writer(guid 1103c14:58d4043a:239086f6:703) - no unack'ed samples
1645106437.914171 [0] 6465549: write_sample 1103c14:58d4043a:239086f6:403 #21: ST0 ros_discovery_info/rmw_dds_common::msg::dds_::ParticipantEntitiesInfo_:{{{{1,16,60,20,88,212,4,58,35,144,134,246,0,0,1,193,0,0,0,0,0,0,0,0}}},{{"/","listener",{{{{1,16,60,20,88,212,4,58,35,144,134,246,0,0,20,4,0,0,0,0,0,0,0,0}}}{{{1,16,60,20,88,212,4,58,35,144,134,246,0,0,21,4,0,0,0,0,0,0,0,0}}}},{}}}}
1645106437.914201 [0] 6465549: delete_writer(guid 1103c14:58d4043a:239086f6:1303) ...
1645106437.914207 [0] 6465549: delete_writer(guid 1103c14:58d4043a:239086f6:1303) - no unack'ed samples
1645106437.914376 [0] 6465549: write_sample 1103c14:58d4043a:239086f6:403 #22: ST0 ros_discovery_info/rmw_dds_common::msg::dds_::ParticipantEntitiesInfo_:{{{{1,16,60,20,88,212,4,58,35,144,134,246,0,0,1,193,0,0,0,0,0,0,0,0}}},{}}
1645106437.914466 [0] 6465549: delete_writer(guid 1103c14:58d4043a:239086f6:403) ...
1645106437.914472 [0] 6465549: delete_writer(guid 1103c14:58d4043a:239086f6:403) - no unack'ed samples
It gets big, yes, but not too big to be sent. It does get too big to be sent in one ethernet frame, then in what Cyclone by default uses as the limit before it starts fragmenting, &c., but the ones in the Wireshark capture would still fit in a single UDP datagram. Raising the message size limit in Cyclone would avoid DDSI-RTPS fragmentation, and that, I suspect, will work around the problem.
In theory all DDS implementations should interoperate perfectly with very large messages and fragmentation, but that's where I think you're running into some problem that hadn't surfaced before.
Hello @eboasson, I've changed the value and now running my node and a script which repeatedly switching devices. After about 15 minutes of run I can see this message in my trace file:
1645109065.392445 [0] RosThread: write_sample 4cff1001:ac1b8e9e:6714e943:403 #2565: ST0 ros_discovery_info/rmw_dds_common::msg::dds_::ParticipantEntitiesInfo_:{{{{76,255,16,1,172,27,142,158,103,20,233,67,0,0,1,193,0,0,0,0,0,0,0,0}}},{{"/","xtra_osd",{{{{76,255,16,1,172,27,142,158,103,20,233,67,0,0,8,4,0,0,0,0,0,0,0,0}}}{{{76,255,16,1,172,27,142,158,103,20,233,67,0,0,10,4,0,0,0,0,0,0,0,0}}}{{{76,255,16,1,172,27,142,158,103,20,233,67,0,0,12,4,0,0,0,0,0,0,0,0}}}{{{76,255,16,1,172,27,142,158,103,20,233,67,0,0,14,4,0,0,0,0,0,0,0,0}}}{{{76,255,16,1,172,27,142,158,103,20,233,67,0,0,16,4,0,0,0,0,0,0,0,0}}}{{{76,255,16,1,172,27,142,158,103,20,233,67,0,0,18,4,0,0,0,0,0,0,0,0}}}{{{76,255,16,1,172,27,142,158,103,20,233,67,0,0,20,4,0,0,0,0,0,0,0,0}}}{{{76,255,16,1,172,27,142,158,103,20,233,67,0,0,23,4,0,0,0,0,0,0,0,0}}}{{{76,255,16,1,172,27,142,158,103,20,233,67,0,0,25,4,0,0,0,0,0,0,0,0}}}{{{76,255,16,1,172,27,142,158,103,20,233,67,0,0,27,4,0,0,0,0,0,0,0,0}}}{{{76,255,16,1,172,27,142,158,103,20,233,67,0,0,29,4,0,0,0,0,0,0,0,0}}}{{{76,255,16,1,172,27,142,158,103,20,233,67,0,0,31,4,0,0,0,0,0,0,0,0}}}{{{76,255,16,1,172,27,142,158,103,20,233,67,0,0,33,4,0,0,0,0,0,0,0,0}}}{{{76,255
1645109065.392529 [0] RosThread: xpack_addmsg 0x55ab568320 0x7f400382a0 0(data(4cff1001:ac1b8e9e:6714e943:403:#2565/1)): niov 0 sz 0 => now niov 3 sz 24672
1645109065.392549 [0] RosThread: writer_hbcontrol: wr 4cff1001:ac1b8e9e:6714e943:403 multicasting (rel-prd 19 seq-eq-max 17 seq 2565 maxseq 2548)
If I understand it correctly, the ros_discovery_info message reached the size of 24672 bytes... So if I'm not wrong, in about 30 minutes from now it will start the loop again. So it did delay the issue but not really solve it.
I did try to restart my node and I noticed the message size was reset. So I'm assuming that the test you made didn't exactly showcase what I'm experiencing.
Instead, if it's not too much trouble, can you please try to create a subscriber / publisher, and then do .reset()
on its shared_ptr (that's what I do to delete it), then create a new one, and repeat that process in an endless loop?
Thank you Omer
I don't think it really has anything to do with how you delete things, but I did reproduce it and for there onwards it was quite trivial. If you could give https://github.com/ros2/rmw_cyclonedds/pull/373 a try ...
Thanks for catching that!
Thank you for all of your help... Now after a week of debugging, I think I earned myself a beer :)
Do you know when / if it will be part of galactic?
I think I earned myself a beer :)
Definitely. (I chose wine because squashing a bug leaves a red stain 😜)
I don't know yet when/if it will get backported (really: cherry-picked) to galactic, but I think it should be. Perhaps it would be a good idea to ask for that on the PR, I am sure everyone likes feedback from users that fixing stuff in patch releases is worth the effort.
Hehe :) I added my comment in the PR. I'm not sure if I should close this issue now, or wait until the PR gets merged.
Hello,
I'm using CycloneDDS as RMW of ROS2 nodes in my system. I have an issue where after my system runs for a long time, large packets of USER TRAFFIC are sent to another node (on a different machine - running micro ROS Agent).
Looking at the code, I can't find any reason why these packets would be sent. Also, when I take the agent down - the traffic goes back to normal. I have a Wireshark capture where you can see how the traffic starts to climb up. Is there a way I can use Wireshark and the trace log to find the publisher (or topic name) which creates this traffic?
Please see the attached wireshark capture. You can see the spike in RTPS traffic at around time=350. wireshark-capture.zip
Another observation worth mentioning, I looked at one of the big packets sent from
192.0.5.15
to192.0.5.35
. In the serialized data section I expected to see random characters (as it is serialized)... But instead I saw an interesting, yet redundant pattern. For example:You can see
xtra_osd
which is probably the name of the node which sent the message. Then a repeating pattern ofÅPæwgWg$
followed by all english alphabet characters, digits and keyboard symbols one at a time.Thank you. Omer