eProsima / Fast-DDS

The most complete DDS - Proven: Plenty of success cases. Looking for commercial support? Contact info@eprosima.com
https://eprosima.com
Apache License 2.0
2.13k stars 757 forks source link

Service Discovery - sockets left in CLOSE_WAIT on remaining participants [8989] #966

Closed mkosobucki closed 2 years ago

mkosobucki commented 4 years ago

When Subscriber or Publisher leaves, remaining participants (in my case Discovery Service or Publisher or Subscriber) will be left with sockets stuck in CLOSE_WAIT. This may mean cleanup did not happen after detecting participant left.

Expected Behavior

Sockets that was used to communicate with any participant that left should be closed on ALL remaining participants.

Current Behavior

ALL remaining participants should close and clear that socket and not stay in CLOSE_WAIT.

Steps to Reproduce

  1. Start 3 processes on 3 different hosts (Discovery, Subscriber, Publisher)
  2. Use same Topic on Publisher and Subscriber and utilize service discovery to discovery each other to match
  3. Start sending 1 MB messages from Publisher to Subscriber every few seconds
  4. Watch output of "netstat -tanp" on all participants and identify connection to Participant you are about to stop
  5. Gracefully stop Publisher or Subscriber
  6. Watch output of "netstat -tanp" on remaining Participants you will see the following:

On Discovery:

tcp 0 0 10.244.0.72:9843 10.244.0.73:42036 CLOSE_WAIT 8/disco

On Publisher (if that is remaining):

tcp 1 0 10.244.6.4:60036 10.244.0.73:48376 CLOSE_WAIT 12/pub

System information

Additional context

Tested in baremetal and dockerized (same problem).

Additional resources

mkosobucki commented 4 years ago

We think it might be related to cancel called unnecessary or too early. https://github.com/eProsima/Fast-RTPS/blob/master/src/cpp/rtps/transport/TCPChannelResourceBasic.cpp#L116

mkosobucki commented 4 years ago

Any progress here?

Jimmy316 commented 4 years ago

Hi we are also noticed similar issue. Any update on this.

AshwinSreelal commented 4 years ago

I believe that I have determined the root cause of the issue. When the UNBIND RTCP message is sent from the Participant that is disconnecting, that triggers in connected Participants the TCPChannelResource to call the disconnect function linked by @mkosobucki.

However, since the original participant has already disconnected the conditional in line 105 if (eConnecting < change_status(eConnectionStatus::eDisconnected) && alive()) returns False because alive() is False. This means that while the original participant deletes its sockets, the connected participants do not clean up sockets.

I believe that the correct fix is to change the conditional to if (eConnecting < change_status(eConnectionStatus::eDisconnected)) since the check of alive() is not needed and have verified that this change properly cleans up the sockets using netstat.

Jimmy316 commented 4 years ago

@richiware @MiguelCompany any updates on this?

richiware commented 4 years ago

We didn't have time to look over this issue. We will try to schedule it next week.

JLBuenoLopez commented 2 years ago

Sorry @mkosobucki,

Fast DDS (formerly known as FastRTPS) 1.9.x has reached end of life. Also, the issue reported might be already solved by #2470 in the currently supported versions. I am going to close this issue but feel free to reopen it if you reproduce it in one of the supported Fast DDS versions.