Haivision / srt

Secure, Reliable, Transport
https://www.srtalliance.org
Mozilla Public License 2.0
3.06k stars 839 forks source link

When there is no network,srt_connect() has no return #1952

Open BoleLiu opened 3 years ago

BoleLiu commented 3 years ago

On Android platform, blocking mode,if I turn off the metwork, there is no return for srt_connect and srt_send, does this meet expectations? or any configuration I forgot to set?

ethouris commented 3 years ago

What do you mean by "there is no return"? The blocking functions don't exit? The expected is that if the connection function doesn't establish a connection in a predicted time, it exits with failure. This time is defined by SRTO_CONNTIMEO.

BoleLiu commented 3 years ago

"there is no return" means it's no return and is blocking there

There are 2 problems:

  1. turn off the network during the transmission, srt_send has no return and blocking. But I fixed it by getting the socket state before sending, and if socket state is broken, return immediately.
  2. turn off the network before srt_connect, and when srt_connect is called, there is no return and blocking, too.

Any advice for the second problem?Thanks a lot

BoleLiu commented 3 years ago

I got the log as following, but the srt_connect method still has no return in blocking mode: D:SRT.cn: startConnect: TTL time 18733D 07:21:08.774405 [STDY] exceeded, TIMEOUT.

ethouris commented 3 years ago

What source version do you have (version in git)? I'm suspecting this might be one of the old deadlock problems around registerConnector.

If you enable heavy logs at compile time (ENABLE_HEAVY_LOGGING in cmake) and enable them in the application (-loglevel debug) you should see this log:

    HLOGC(cnlog.Debug, log << "removeConnector: removing @" << id);

If the application hangs after displaying this, it could be this deadlock.

If you could help me by running this under a debugger and see where particular threads are hanging, it would be even more helpful.

BoleLiu commented 3 years ago

image it does running to the log above, and then it stucked there

ethouris commented 3 years ago

Ah, so that's what I suspected. First, however, I need to know your version.

BoleLiu commented 3 years ago

How can I get the real version? The version in CMakeLists is 1.4.3, but in README.md, it's 1.4.2. Besides, I compiled the library by the newest master code in the repo

ethouris commented 3 years ago

Ok, there's a PR that is intended to fix things around there. Would you be able to take the code from the branch mentioned there and see if this fixes the problem? If you confirm it, we should be able to increase the priority for it.

https://github.com/Haivision/srt/pull/1844

BoleLiu commented 3 years ago

OK, I'll try it later, and which version is more stable for live stream transmiting?

ethouris commented 3 years ago

For all I know, the latest master should be stable enough. Maybe @maxsharabayko can be more precise.

BoleLiu commented 3 years ago

ok, and for the first problem, do you have any advice? Have you encountered this problem before?

ethouris commented 3 years ago

That problem I haven't found, but we've encountered a suspected potential deadlock around this place with thread sanitizer, that's why believe the PR I gave you may fix the problem.

BoleLiu commented 3 years ago

I pulled the PR to my local branch and recompiled the library, but it seems doesn't work, it still has no return

ethouris commented 3 years ago

Would you be able to run it under a debugger? Unfortunately I don't have an Android platform at hand to test it...

Also, do you use your own application or one of those in SRT repo?

BoleLiu commented 3 years ago

I can not run it under a debugger, but I can get the debug log, and I use my own application to test it

maxsharabayko commented 3 years ago

From your discussion, the only thing it can be hanging on is the CRendezvousQueue:::m_RIDVectorLock in CRcvQueue::removeConnector(..) , with the lock probably taken by CRendezvousQueue::updateConnStatus(..). The latest screenshot seems to confirm this.

UDP: Although the last message from there is "updateConnStatus: 0/1 sockets updated...", so the lock must be released.

maxsharabayko commented 3 years ago

It would be very surprising if it is hanging here, but just to check @BoleLiu could you please add some logs around THREAD_PAUSED() and THREAD_RESUMED()?

int CRcvQueue::recvfrom(int32_t id, CPacket& w_packet)
{
    UniqueLock bufferlock (m_BufferLock);
    CSync buffercond    (m_BufferCond, bufferlock);

    map<int32_t, std::queue<CPacket *> >::iterator i = m_mBuffer.find(id);

    if (i == m_mBuffer.end())
    {
        THREAD_PAUSED();
        buffercond.wait_for(seconds_from(1));
        THREAD_RESUMED();
BoleLiu commented 3 years ago

image @maxsharabayko it isn't hanging here, it looks resumed success

maxsharabayko commented 3 years ago

I see... 🤔 Could you please add more logs around CRcvQueue::m_BufferLock then? To track where it is hanging locked if it is the cause of the dead lock.

BoleLiu commented 3 years ago

@maxsharabayko I added more logs and found that it didn't block in remoteConnector, It seems like broken socket can not be removed in checkBrokenSockets, and then can not run out of the loop in garbageCollect. Besides, I want to know, what is the expected phenomenon when I call srt_connect under a broken network in blocking mode?

ethouris commented 3 years ago

In blocking mode, the connecting function (CUDT::startConnect) runs a loop of sending and receiving packets necessary for the handshake. In case of cut off network it simply won't receive anything in response and should give up and exit with failure (throw an exception) after a timeout. This "registerred connector" is required for the facility to know that a socket is connection-pending so that it knows where to dispatch handshakes. When a removal is happening, it means it has given up and is about to return an error.