mkosobucki commented 4 years ago

When Subscriber or Publisher leaves, remaining participants (in my case Discovery Service or Publisher or Subscriber) will be left with sockets stuck in CLOSE_WAIT. This may mean cleanup did not happen after detecting participant left.

Expected Behavior

Sockets that was used to communicate with any participant that left should be closed on ALL remaining participants.

Current Behavior

ALL remaining participants should close and clear that socket and not stay in CLOSE_WAIT.

Steps to Reproduce

Start 3 processes on 3 different hosts (Discovery, Subscriber, Publisher)
Use same Topic on Publisher and Subscriber and utilize service discovery to discovery each other to match
Start sending 1 MB messages from Publisher to Subscriber every few seconds
Watch output of "netstat -tanp" on all participants and identify connection to Participant you are about to stop
Gracefully stop Publisher or Subscriber
Watch output of "netstat -tanp" on remaining Participants you will see the following:

On Discovery:

tcp 0 0 10.244.0.72:9843 10.244.0.73:42036 CLOSE_WAIT 8/disco

On Publisher (if that is remaining):

tcp 1 0 10.244.6.4:60036 10.244.0.73:48376 CLOSE_WAIT 12/pub

System information

Fast-RTPS version: 1.9.4
OS: Ubuntu 18.04
Network interfaces: eth0
ROS2: N/A

Additional context

Tested in baremetal and dockerized (same problem).

Additional resources

Wireshark capture (different IPs than above! you can see FP. was sent by a subscriber before it went down, but that did not cause Socket on Pub or Disco to close sockets) 21:59:32.123302 IP 10.244.0.80.9843 > 10.244.0.85.56490: Flags [.], ack 5077, win 505, options [nop,nop,TS val 158643746 ecr 3915380301], length 0
21:59:33.590663 IP 10.244.0.80.9843 > 10.244.0.85.56490: Flags [P.], seq 6444:6802, ack 5077, win 507, options [nop,nop,TS val 158645214 ecr 3915380301], length 358 21:59:33.590808 IP 10.244.0.85.56490 > 10.244.0.80.9843: Flags [.], ack 6802, win 1179, options [nop,nop,TS val 3915381769 ecr 158645214], length 0
21:59:34.123560 IP 10.244.0.85.56490 > 10.244.0.80.9843: Flags [P.], seq 5077:5359, ack 6802, win 1179, options [nop,nop,TS val 3915382302 ecr 158645214], length 282 21:59:34.123636 IP 10.244.0.80.9843 > 10.244.0.85.56490: Flags [.], ack 5359, win 505, options [nop,nop,TS val 158645747 ecr 3915382302], length 0
21:59:35.590830 IP 10.244.0.80.9843 > 10.244.0.85.56490: Flags [P.], seq 6802:7160, ack 5359, win 507, options [nop,nop,TS val 158647214 ecr 3915382302], length 358 21:59:35.590954 IP 10.244.0.85.56490 > 10.244.0.80.9843: Flags [.], ack 7160, win 1201, options [nop,nop,TS val 3915383769 ecr 158647214], length 0
21:59:36.123824 IP 10.244.0.85.56490 > 10.244.0.80.9843: Flags [P.], seq 5359:5641, ack 7160, win 1201, options [nop,nop,TS val 3915384302 ecr 158647214], length 282 21:59:36.123888 IP 10.244.0.80.9843 > 10.244.0.85.56490: Flags [.], ack 5641, win 505, options [nop,nop,TS val 158647747 ecr 3915384302], length 0
21:59:37.590900 IP 10.244.0.80.9843 > 10.244.0.85.56490: Flags [P.], seq 7160:7518, ack 5641, win 507, options [nop,nop,TS val 158649214 ecr 3915384302], length 358 21:59:37.591002 IP 10.244.0.85.56490 > 10.244.0.80.9843: Flags [.], ack 7518, win 1224, options [nop,nop,TS val 3915385769 ecr 158649214], length 0
21:59:38.123905 IP 10.244.0.85.56490 > 10.244.0.80.9843: Flags [P.], seq 5641:5923, ack 7518, win 1224, options [nop,nop,TS val 3915386302 ecr 158649214], length 282 21:59:38.123976 IP 10.244.0.80.9843 > 10.244.0.85.56490: Flags [.], ack 5923, win 505, options [nop,nop,TS val 158649747 ecr 3915386302], length 0
21:59:39.591067 IP 10.244.0.80.9843 > 10.244.0.85.56490: Flags [P.], seq 7518:7876, ack 5923, win 507, options [nop,nop,TS val 158651214 ecr 3915386302], length 358 21:59:39.591210 IP 10.244.0.85.56490 > 10.244.0.80.9843: Flags [.], ack 7876, win 1247, options [nop,nop,TS val 3915387769 ecr 158651214], length 0
21:59:40.124017 IP 10.244.0.85.56490 > 10.244.0.80.9843: Flags [P.], seq 5923:6205, ack 7876, win 1247, options [nop,nop,TS val 3915388302 ecr 158651214], length 282 21:59:40.124113 IP 10.244.0.80.9843 > 10.244.0.85.56490: Flags [.], ack 6205, win 505, options [nop,nop,TS val 158651747 ecr 3915388302], length 0
21:59:41.325983 IP 10.244.0.85.56490 > 10.244.0.80.9843: Flags [P.], seq 6205:6351, ack 7876, win 1247, options [nop,nop,TS val 3915389504 ecr 158651747], length 146 21:59:41.326044 IP 10.244.0.80.9843 > 10.244.0.85.56490: Flags [.], ack 6351, win 504, options [nop,nop,TS val 158652949 ecr 3915389504], length 0
21:59:41.326524 IP 10.244.0.80.9843 > 10.244.0.85.56490: Flags [P.], seq 7876:8006, ack 6351, win 507, options [nop,nop,TS val 158652949 ecr 3915389504], length 130 21:59:41.326619 IP 10.244.0.85.56490 > 10.244.0.80.9843: Flags [.], ack 8006, win 1269, options [nop,nop,TS val 3915389504 ecr 158652949], length 0
21:59:41.326830 IP 10.244.0.85.56490 > 10.244.0.80.9843: Flags [P.], seq 6351:6393, ack 8006, win 1269, options [nop,nop,TS val 3915389505 ecr 158652949], length 42 21:59:41.328643 IP 10.244.0.85.56490 > 10.244.0.80.9843: Flags [FP.], seq 6393:6611, ack 8006, win 1269, options [nop,nop,TS val 3915389506 ecr 158652949], length 218 21:59:41.329268 IP 10.244.0.80.9843 > 10.244.0.85.56490: Flags [.], ack 6612, win 512, options [nop,nop,TS val 158652952 ecr 3915389505], length 0 21:59:45.361228 IP 10.244.0.85.56912 > 10.244.0.80.9843: Flags [S], seq 765080420, win 29200, options [mss 1460,sackOK,TS val 3915393539 ecr 0,nop,wscale 7], length 0 21:59:45.361309 IP 10.244.0.80.9843 > 10.244.0.85.56912: Flags [S.], seq 1011728825, ack 765080421, win 28960, options [mss 1460,sackOK,TS val 158656984 ecr 3915393539,nop,wscale 7], length 0 21:59:45.361387 IP 10.244.0.85.56912 > 10.244.0.80.9843: Flags [.], ack 1, win 229, options [nop,nop,TS val 3915393539 ecr 158656984], length 0
21:59:45.362123 IP 10.244.0.85.56912 > 10.244.0.80.9843: Flags [P.], seq 1:69, ack 1, win 229, options [nop,nop,TS val 3915393540 ecr 158656984], length 68
21:59:45.362201 IP 10.244.0.80.9843 > 10.244.0.85.56912: Flags [.], ack 69, win 227, options [nop,nop,TS val 158656985 ecr 3915393540], length 0
21:59:45.362304 IP 10.244.0.80.9843 > 10.244.0.85.56912: Flags [P.], seq 1:69, ack 69, win 227, options [nop,nop,TS val 158656985 ecr 3915393540], length 68 21:59:45.362430 IP 10.244.0.85.56912 > 10.244.0.80.9843: Flags [.], ack 69, win 229, options [nop,nop,TS val 3915393540 ecr 158656985], length 0
21:59:45.362590 IP 10.244.0.85.56912 > 10.244.0.80.9843: Flags [P.], seq 69:111, ack 69, win 229, options [nop,nop,TS val 3915393540 ecr 158656985], length 42
21:59:45.362703 IP 10.244.0.80.9843 > 10.244.0.85.56912: Flags [P.], seq 69:103, ack 111, win 227, options [nop,nop,TS val 158656985 ecr 3915393540], length 34
21:59:45.404680 IP 10.244.0.85.56912 > 10.244.0.80.9843: Flags [.], ack 103, win 229, options [nop,nop,TS val 3915393582 ecr 158656985], length 0
21:59:45.463144 IP 10.244.0.85.56912 > 10.244.0.80.9843: Flags [P.], seq 111:393, ack 103, win 229, options [nop,nop,TS val 3915393641 ecr 158656985], length 282
21:59:45.470634 IP 10.244.0.85.56912 > 10.244.0.80.9843: Flags [.], seq 393:1841, ack 103, win 229, options [nop,nop,TS val 3915393648 ecr 158656985], length 1448 21:59:45.470675 IP 10.244.0.80.9843 > 10.244.0.85.56912: Flags [.], ack 1841, win 258, options [nop,nop,TS val 158657093 ecr 3915393641], length 0
21:59:45.470787 IP 10.244.0.85.56912 > 10.244.0.80.9843: Flags [P.], seq 1841:1865, ack 103, win 229, options [nop,nop,TS val 3915393649 ecr 158657093], length 24
21:59:45.478317 IP 10.244.0.85.56912 > 10.244.0.80.9843: Flags [.], seq 1865:3313, ack 103, win 229, options [nop,nop,TS val 3915393656 ecr 158657093], length 1448 21:59:45.478352 IP 10.244.0.80.9843 > 10.244.0.85.56912: Flags [.], ack 3313, win 280, options [nop,nop,TS val 158657101 ecr 3915393649], length 0
21:59:45.478454 IP 10.244.0.85.56912 > 10.244.0.80.9843: Flags [P.], seq 3313:3341, ack 103, win 229, options [nop,nop,TS val 3915393656 ecr 158657101], length 28
21:59:45.485813 IP 10.244.0.85.56912 > 10.244.0.80.9843: Flags [.], seq 3341:4789, ack 103, win 229, options [nop,nop,TS val 3915393664 ecr 158657101], length 1448 21:59:45.485840 IP 10.244.0.80.9843 > 10.244.0.85.56912: Flags [.], ack 4789, win 303, options [nop,nop,TS val 158657109 ecr 3915393656], length 0
21:59:45.485926 IP 10.244.0.85.56912 > 10.244.0.80.9843: Flags [P.], seq 4789:4817, ack 103, win 229, options [nop,nop,TS val 3915393664 ecr 158657109], length 28
21:59:45.493347 IP 10.244.0.85.56912 > 10.244.0.80.9843: Flags [.], seq 4817:6265, ack 103, win 229, options [nop,nop,TS val 3915393671 ecr 158657109], length 1448 21:59:45.493377 IP 10.244.0.80.9843 > 10.244.0.85.56912: Flags [.], ack 6265, win 326, options [nop,nop,TS val 158657116 ecr 3915393664], length 0
21:59:45.493464 IP 10.244.0.85.56912 > 10.244.0.80.9843: Flags [P.], seq 6265:6293, ack 103, win 229, options [nop,nop,TS val 3915393671 ecr 158657116], length 28
21:59:45.500834 IP 10.244.0.85.56912 > 10.244.0.80.9843: Flags [.], seq 6293:7741, ack 103, win 229, options [nop,nop,TS val 3915393679 ecr 158657116], length 1448 21:59:45.500858 IP 10.244.0.80.9843 > 10.244.0.85.56912: Flags [.], ack 7741, win 348, options [nop,nop,TS val 158657124 ecr 3915393671], length 0

mkosobucki commented 4 years ago

We think it might be related to cancel called unnecessary or too early. https://github.com/eProsima/Fast-RTPS/blob/master/src/cpp/rtps/transport/TCPChannelResourceBasic.cpp#L116

mkosobucki commented 4 years ago

Any progress here?

Jimmy316 commented 4 years ago

Hi we are also noticed similar issue. Any update on this.

AshwinSreelal commented 4 years ago

I believe that I have determined the root cause of the issue. When the UNBIND RTCP message is sent from the Participant that is disconnecting, that triggers in connected Participants the TCPChannelResource to call the disconnect function linked by @mkosobucki.

However, since the original participant has already disconnected the conditional in line 105 if (eConnecting < change_status(eConnectionStatus::eDisconnected) && alive()) returns False because alive() is False. This means that while the original participant deletes its sockets, the connected participants do not clean up sockets.

I believe that the correct fix is to change the conditional to if (eConnecting < change_status(eConnectionStatus::eDisconnected)) since the check of alive() is not needed and have verified that this change properly cleans up the sockets using netstat.

Jimmy316 commented 4 years ago

@richiware @MiguelCompany any updates on this?

richiware commented 4 years ago

We didn't have time to look over this issue. We will try to schedule it next week.

JLBuenoLopez commented 2 years ago

Sorry @mkosobucki,

Fast DDS (formerly known as FastRTPS) 1.9.x has reached end of life. Also, the issue reported might be already solved by #2470 in the currently supported versions. I am going to close this issue but feel free to reopen it if you reproduce it in one of the supported Fast DDS versions.

eProsima / Fast-DDS

Service Discovery - sockets left in CLOSE_WAIT on remaining participants [8989] #966

Expected Behavior

Current Behavior

Steps to Reproduce

On Discovery:

On Publisher (if that is remaining):

System information

Additional context

Additional resources