Haivision / srt

Secure, Reliable, Transport
https://www.srtalliance.org
Mozilla Public License 2.0
3.06k stars 840 forks source link

[BUG] Not freeing system sockets when calling srt_close() #1822

Open andersc opened 3 years ago

andersc commented 3 years ago

Describe the bug

SRT got a problem freeing system-sockets the way it’s implemented. If I start one server and ->

st = srt_close(socket);
srt_cleanup();

The system-socket is released immediately. after srt_cleanup(); However we dynamically start and stop multiple instances of SRT servers all the time. Meaning we can’t call srt_cleanup(); Since we in the system got other active servers while we close a server. The problem then is that the srt_close(socket); seams just to mark the system-socket for deletion and your garbage collector that periodically runs is after a while actually closing the system-socket. The problem is then that programatically the user of SRT thinks the socket is free to be used but ‘bind’ will fail.

srt_close() should close the system-socket.

To Reproduce (This is verified on Ubuntu 20.04 and MacOS 11.2.1) There is a repo verifying this claim -> bugtest bugtestmain.cpp

run the bugtest executable and in the system also a terminal where in linux you check the process binding the port like this-> netstat -tulpen | grep 8000

Expected behavior The bugtest application and when the application prints -> Stopped and destroyed the server, now the port should be free. run the setstat command.. The socket is still bound to the application for a while (3-5 seconds).

Desktop (please provide the following information):

jeandube commented 3 years ago

I encountered this problem early in the development of SRT (circa 2013) and then implemented a workaround in my wrapper lib that was so effective that I forgot about it... until today. Basically, when srt_listen() fails with SRT_EDUPLISTEN the workaround enter a loop of 5 more srt_listen() attempts with 1 second period. I have never seen this error again. This timing and delay was (and still is) not critical to me.

andersc commented 3 years ago

The problem we got is that the user might select other protocols or features we do not control on that previously used SRT port so we cant fix this issue in any wrapper. We got a user interface where the user can select many different types of protocols where SRT is one of them. The user selects another protocol on the same port usually.. Then our system fails due to SRT still binds the port we did free.

Anyhow there are ugly workarounds to fix stuff like that but we boil down to how should it work and my view is that when closing a socket it should block the call until the system socket is released.

ethouris commented 3 years ago

@jeandube I fixed this as one of the first thing I did in SRT :) See CUDTUnited::close(CUDTSocket* s), the section with SRTS_LISTENING state.

That was, however, only the first problem, which was blocking the listener and made it unable to close one listener and immediately create another binding to the same port. Now it works, but it simply borrows the multiplexer from the closed-but-not-yet-recycled previous listener socket before it would lose all users and would be deleted.

What is likely required now is that the socket is disconnected from its multiplexer immediately when requested deletion. This is actually a bigger problem because in order to disconnect the socket from the multiplexer - and associated sender and receiver queues - first all activities concerning this socket must be stopped. The biggest problem actually is with these weird secondary lists that collect sockets to be operated with from another thread, which are not exactly well synchronized with socket deletion.

ethouris commented 3 years ago

@andersc Just note that what is really required here (and possible to be done) is to early disconnect the SRT socket from the multiplexer, not forcefully destroy the multiplexer. Multiplexer will be of course destroyed when disconnected from the last socket, but you cannot be ever sure it will happen. The multiplexer is shared between sockets in two cases:

To make sure that the multiplexer is destroyed, the application must track all sockets and know what to do to make it happen: for a caller socket do not bind explicitly (or at least close all sharing sockets) and for a listener, if you close the listener, you need to also close all sockets that were accepted off it. Only then will the listener's multiplexer be destroyed.

jeandube commented 3 years ago

The problem we got is that the user might select other protocols or features we do not control on that previously used SRT port so we cant fix this issue in any wrapper. We got a user interface where the user can select many different types of protocols where SRT is one of them. The user selects another protocol on the same port usually.. Then our system fails due to SRT still binds the port we did free.

Anyhow there are ugly workarounds to fix stuff like that but we boil down to how should it work and my view is that when closing a socket it should block the call until the system socket is released.

So my workaround was maybe not so effective after all. "C" workaround vs. C++ fix.

ethouris commented 3 years ago

This has actually nothing to do with C++. This is the UDT's design, which made the socket garbage-collected after being closed, while some activities still need to continue in other threads. But I think it should be possible to move the multiplexer-disconnection code to happen immediately when closing and actually in every place where the socket is moved from m_Sockets to m_ClosedSockets. The problem could be with the "reader-remaining" sockets, that is, sockets that were broken, but data they collected earlier are still not extracted by the application. Such a socket is not moved to the "trashcan" because these sockets are no longer dispatchable, and reading the remaining data wouldn't be possible. But then, this socket might remain connected to the multiplexer until it's explicitly closed.

jeandube commented 3 years ago
  • When you accept a connection from a listener socket (all accepted sockets out of a listener socket share its multiplexer)

@ethouris Don't you extend the definition of the Multiplexer too much here: I thought that all accepted sockets will share the local port but share the Multiplexer only if the are with the same peer.

ethouris commented 3 years ago

There's no other way in SRT to share the local port as sharing the Multiplexer. Multiplexer owns the Channel, Channel owns the UDP socket, UDP socket owns the port when binding with default SO_REUSEADDR=false.

Multiplexer also owns the sender and receiver queues. When you send a packet over a socket it lands in its sender queue - it doesn't matter that the same queue contain packets to be send to different peers, nor it matters that the receiver queue gets packets incoming through this same UDP socket from different peers.

lcksk commented 1 year ago

I also encountered the same problem. Our scenario is that the listener lives in a service process. Every time the socket returned by srt_accept() is closed, we observe that the internal resources are not released, so each time the following loop will cause the memory to continue to increase(about +100K ~ 200K)

...
srt_listen()

loop {
  fd = srt_accept()
  ...
  srt_recv(fd,...)
  ...
  srt_close(fd)
}
...

This is observed with some tools(https://github.com/bytedance/memory-leak-detector.git) that after each sequence of srt_accept()->srt_close() calls, the tool complains about no closed objects

  939,976   totals
      939,976   libsrt.so

0xc1211030, 409600, 1
0x00122c3d /lib/arm/libsrt.so (srt::CRcvLossList::CRcvLossList(int) + 64)
0x000d5f87 /lib/arm/libsrt.so (srt::CUDT::prepareConnectionObjects(srt::CHandShake const&, srt::HandshakeSide, srt::CUDTException*) + 422)

0xe48b7030, 196608, 1
0x00121883 /lib/arm/libsrt.so (srt::CSndLossList::CSndLossList(int) + 70)

0xc13c0010, 139776, 3
0x001299a7 /lib/arm/libsrt.so (srt::CUnitQueue::increase() + 358)

0xc2f10020, 65536, 1
0x000b72c1 /lib/arm/libsrt.so (srt::FixedArray<srt::CRcvBufferNew::Entry>::FixedArray(unsigned int) + 50)

0xe4d30050, 46592, 1
0x001296e9 /lib/arm/libsrt.so (srt::CUnitQueue::init(int, int, int) + 170)

0xc13e4040, 46592, 1
0x000b3947 /lib/arm/libsrt.so (srt::CSndBuffer::CSndBuffer(int, int) + 130)

0xc2557050, 20904, 1
0x00096049 /lib/arm/libsrt.so (srt::CUDTUnited::newConnection(int, srt::sockaddr_any const&, srt::CPacket const&, srt::CHandShake&, int&, srt::CUDT*&) + 1356)

0xebc4ce50, 6168, 3
0x00129951 /lib/arm/libsrt.so (srt::CUnitQueue::increase() + 272)

0xc252c850, 4096, 1
0x0012b3e9 /lib/arm/libsrt.so (srt::CHash::init(int) + 40)
0x0012d8a3 /lib/arm/libsrt.so (srt::CRcvQueue::init(int, unsigned int, int, int, srt::CChannel*, srt::sync::CTimer*) + 106)

0xebc4ded0, 2056, 1
0x00129695 /lib/arm/libsrt.so (srt::CUnitQueue::init(int, int, int) + 86)

0xebc562d0, 2048, 1
0x00129d97 /lib/arm/libsrt.so (srt::CSndUList::CSndUList(srt::sync::CTimer*) + 110)

Could you please tell me if there is any effective way to avoid this problem? The focus of our attention is that the memory should not continue to increase, because it is a service process, I can't call srt_cleanup directly. In addition, I noticed the following PR, is this related to this problem, is it stable enough to merge into the master state?

lcksk commented 1 year ago

In addition, after merging #1829 (some minor errors need to be corrected due to version changes), the memory still continues to grow.

maxsharabayko commented 9 months ago

Status Update

After PR #2643 (SRT v1.5.2) the closure time has been reduced from 4.5 s to 1.5s.

Setting the state to SRTS_CLOSING reduces the time to actually free the listener socket from 4.5 s to 1.5 s. Adding CSync::notify_one_relaxed(m_GCStopCond); at the end of CUDTUnited::close() further reduces the time to 1 second by notifying the GC thread to run a re-check.

maxsharabayko commented 9 months ago

Removing the artificial delay of 1s to close a socket makes it closed almost immediately, but still in the background thread. This artificial delay in case of a listener helps to gracefully reject or close connection requests if any in progress. Maybe some flag or a socket option could be used to tell to close immediately. In a way, the purpose of the SRTO_LINGER is close to this, and maybe should just be used without this additional 1s delay.

The Change

diff --git a/srtcore/api.cpp b/srtcore/api.cpp
index 7ec8ff57..e3015a97 100644
--- a/srtcore/api.cpp
+++ b/srtcore/api.cpp
@@ -2678,7 +2678,7 @@ void srt::CUDTUnited::checkBrokenSockets()
         // RcvUList
         const steady_clock::time_point now        = steady_clock::now();
         const steady_clock::duration   closed_ago = now - j->second->m_tsClosureTimeStamp;
-        if (closed_ago > seconds_from(1))
+        //if (closed_ago > seconds_from(1))
         {
             CRNode* rnode = j->second->core().m_pRNode;
             if (!rnode || !rnode->m_bOnList)

Unit Test

(Or use the bugtest provided by @andersc)

#include <gtest/gtest.h>
#include <thread>
#include <chrono>
#include <string>
#include <map>

#ifdef _WIN32
#define INC_SRT_WIN_WINTIME // exclude gettimeofday from srt headers
#endif

#include "srt.h"

TEST(Listener, Restart)
{
    srt_startup();
    srt_setloglevel(LOG_NOTICE);

    auto s = srt_create_socket();

    sockaddr_in bind_sa;
    memset(&bind_sa, 0, sizeof bind_sa);
    bind_sa.sin_family = AF_INET;
    ASSERT_EQ(inet_pton(AF_INET, "127.0.0.1", &bind_sa.sin_addr), 1);
    bind_sa.sin_port = htons(5555);

    EXPECT_NE(srt_bind(s, (sockaddr*)&bind_sa, sizeof bind_sa), -1);
    EXPECT_NE(srt_listen(s, 5), -1);

    std::this_thread::sleep_for(std::chrono::milliseconds(500));

    srt_close(s);

    s = srt_create_socket();
    int optval = 100;
    int optlen = sizeof optval;
    EXPECT_NE(srt_setsockflag(s, SRTO_IPTTL, (void*)&optval, optlen), SRT_ERROR) << srt_getlasterror_str();

    const auto time_start = std::chrono::steady_clock::now();
    while (srt_bind(s, (sockaddr*)&bind_sa, sizeof bind_sa) == -1)
    {
        std::this_thread::sleep_for(std::chrono::milliseconds(500));
    }

    std::cerr << "Binding took " << std::chrono::duration_cast<std::chrono::milliseconds>((std::chrono::steady_clock::now() - time_start)).count() << '\n';

    //EXPECT_NE(srt_bind(s, (sockaddr*)&bind_sa, sizeof bind_sa), -1);
    EXPECT_NE(srt_listen(s, 5), -1);

    srt_close(s);
    srt_cleanup();
}