Haivision / srt

Secure, Reliable, Transport
https://www.srtalliance.org
Mozilla Public License 2.0
3.06k stars 839 forks source link

[BUG] SRTO_ENFORCEDENCRYPTION set to false despite default configuration documented as true #999

Closed toots closed 4 years ago

toots commented 4 years ago

Hi!

Describe the bug We were investigating connection freeze in our srt setup and came across the following gdb stacktrace:

Thread 26 (Thread 0x7fce47fff700 (LWP 9328)):
#0  futex_abstimed_wait_cancelable (private=0, abstime=0x7fce47ffd220, expected=0, futex_word=0x7fce38005694) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  __pthread_cond_wait_common (abstime=0x7fce47ffd220, mutex=0x7fce380056e8, cond=0x7fce38005668) at pthread_cond_wait.c:539
#2  __pthread_cond_timedwait (cond=0x7fce38005668, mutex=0x7fce380056e8, abstime=0x7fce47ffd220) at pthread_cond_wait.c:667
#3  0x00007fce90e11b7a in srt_logging::LogDispatcher::Proxy::operator<< <char [84]> (arg=..., this=0x524) at ./srtcore/core.cpp:2745
#4  CUDT::interpretSrtHandshake (this=0x7fce41562360, hs=..., hspkt=..., out_data=0x7fce47ffd308, out_len=0x7fce38005668) at ./srtcore/core.cpp:2712

This exact same trace was repeated across multiple instances.

Upon investigating, this appeared to be a live loop. Further, the code around frame 4 looks like this:

            if (m_bOPT_StrictEncryption)
            {
                m_RejectReason = SRT_REJ_UNSECURE;
                LOGC(mglog.Error, log << "HS KMREQ: Peer declares encryption, but agent does not - rejecting per strict requirement");
                return false;
            }

            LOGC(mglog.Error, log << "HS KMREQ: Peer declares encryption, but agent does not - still allowing connection.");

Intrigued, we then checked the value of m_bOPT_StrictEncryption:

(gdb) f 4
#4  CUDT::interpretSrtHandshake (this=0x7fce41562360, hs=..., hspkt=..., out_data=0x7fce47ffd308, out_len=0x7fce38005668) at ./srtcore/core.cpp:2712
(gdb) p m_bOPT_StrictEncryption
$3 = false

This is despite ever touching this value and despite the documentation at https://github.com/Haivision/srt/blob/master/docs/API.md:

OptName Since Binding Type Units Default Range
SRTO_ENFORCEDENCRYPTION 1.3.2 pre int (bool)   true false

[SET] - This option enforces that both connection parties have the same passphrase set (including empty, that is, with no encryption), or otherwise the connection is rejected.

It looks like the value is set to true as expected, except possibly at line 336:

   m_bOPT_StrictEncryption = ancestor.m_bOPT_StrictEncryption;

I would have pushed a PR instead of an issue but I'm not sure what to do with this constructor case.

As a side note, there's another inconsistent spot in the code, in srtcore/core.h at line 566:

    bool m_bOPT_StrictEncryption;    // Off by default. When on, any connection other than nopw-nopw & pw1-pw1 is rejected.

Desktop (please provide the following information):

maxsharabayko commented 4 years ago

Hi @toots Thanks for reporting. Do you think you could add/modify a test case to catch this issue? Tests related to enforced encryption are in test/test_enforced_encryption.cpp The existing tests should cover what you describe, and succeed. So probably some cases are missing.

The condition m_bOPT_StrictEncryption = ancestor.m_bOPT_StrictEncryption; is applied on the accepted socket, where ancestor should correspond to listener socket. So if listener socket has m_bOPT_StrictEncryption enabled by default. it should be enabled on the accepted socket as well.

ethouris commented 4 years ago

This is something extremely weird.

There's no possible trackable call from LogDispatcher::Proxy::operator<< to pthread_cond_timedwait. The former does call something that refers to pthreads, which is PrintLogLine then calling SendLogLine, which locks one mutex referring to Log configuration. But there's no conditional variable in use in the logging system at all.

As for m_bOPT_StrictEncryption, this is set to true by default (if there's a comment that declares something else, it's probably misleading - indeed, this option was set to false by default in some old version, but this was changed), and the line from ancestor is simply deriving the option from the listener socket (where it's still true by default) to an accepted socket, so if you didn't set this option to false explicitly, it can't have any other value than true.

I don't know why you have these results in gdb, but there's something completely messed up there.

toots commented 4 years ago

Hi all, thanks for reviewing!

The pthread_cond_timedwait is most likely coming from our own log callback implementation, nothing wrong here.

I'm not sure why m_bOPT_StrictEncryption was set to false but it was repeatable and consistent across instances. Also, I believe that we've seen a similar backtrace without our custom log handlers.

Another interesting fact is that most instances are usually unstuck -- at least for a short while -- after releasing the gdb hold. Same with sending a SIGINT to the process I think (@mbugeia, correct?). This seems to indicate at least some interplay with signals, just as mentioned in https://github.com/Haivision/srt/issues/787

I will try my best to reproduce this one (#787 is still a minor issue for us). It's a bit tricky as the implementation is buried in a big application and also because of general business but I'll try my best.

Thanks for the hard work guys!

mbugeia commented 4 years ago

Another interesting fact is that most instances are usually unstuck -- at least for a short while -- after releasing the gdb hold. Same with sending a SIGINT to the process I think (@mbugeia, correct?). This seems to indicate at least some interplay with signals, just as mentioned in #787

Releasing gdb hold is different from sending a SIGINT (which stop the application). But I confirm that releasing gdb (probably a SIGCONT ?) do unstuck the application in our case.

toots commented 4 years ago

Thinking about this more, there's another interesting point: the code where this gets called raises a discrepancy in encryption configuration between client and server socket, but we never set encryption on either end.

This all sounds like some kind of memory corruption. Furthermore, this seem to happen on quick disconnection/reconnection on our end and I believe that this can lead to the listener socket being recreated as well.

Given that the accepted socket can inherit its properties from the listener socket, could it be possible that there is a race condition between the cleanup of a listener socket and the instantiation of a accepted socket?

ethouris commented 4 years ago

If you are talking about a cleanup in the listener socket, which normally happens when it's being closed, there possibly could happen something like that, but this would be still a valid listening socket, still existing, but no longer capable of accepting new connections, just for a short time to finish all remaining operations. There's no cleanup that would set all fields to zero or anything like that.

BTW. note that using any condition-waiting function inside the log handler is an extremely bad idea. This way you can influence some time dependencies within the whole library and change it behavior to much worse. If you can't handle the request for a log immediately, the log should be either stored in some queue, from which it will be then slowly picked up, or dropped otherwise.

toots commented 4 years ago

I am finally getting back to this! After looking at the code, here's the part that looks suspicious for a race condition: https://github.com/Haivision/srt/blob/master/srtcore/api.cpp#L958

      // Change towards original UDT: 
      // Leave all the closing activities for garbageCollect to happen,
      // however remove the listener from the RcvQueue IMMEDIATELY.
      // Even though garbageCollect would eventually remove the listener
      // as well, there would be some time interval between now and the
      // moment when it's done, and during this time the application will
      // be unable to bind to this port that the about-to-delete listener
      // is currently occupying (due to blocked slot in the RcvQueue).

      HLOGC(mglog.Debug, log << s->m_pUDT->CONID() << " CLOSING (removing listener immediately)");
      {
          CGuard cg(s->m_pUDT->m_ConnectionLock);
          s->m_pUDT->m_bListening = false;
          s->m_pUDT->m_pRcvQueue->removeListener(s->m_pUDT);
      }

It seems to me that there's a risk of a race condition here if there is a listener connection already happening before removing the listening socket from the listening loop.

I haven't been able to reproduce the exact issue but, adding some usleep at the wheel and debug logging I was able to reproduce the following log:

19:23:50.788295 D:SRT.c: %264680785: CLOSING (removing from listening, closing CUDT)

**usleeping at the wheel before removing from RcvQueue..**

19:23:50.798402 D:SRT.c: @264680784: SOCKET broken, REMOVING FROM RCV QUEUE/MAP.
19:23:51.454229 D:SRT.c: GC: sleep 1 s
19:23:52.070049 D:SRT.c: INCOMING PACKET: BOUND=0.0.0.0:9000 TARGET=0 CONTROL: handshake size=48
19:23:52.070124 D:SRT.c: Got sockID=0 from 127.0.0.1:55966 - trying to resolve it as a connection request...
19:23:52.070161.N:SRT.c: PASSING request from: 127.0.0.1:55966 to agent:264680785
19:23:52.070190 D:SRT.c: processConnectRequest: received a connection request
19:23:52.070314 D:SRT.c: processConnectRequest: ... NOT. Rejecting because broken.
19:23:52.458608 D:SRT.c: GC/removeSocket: closing associated UDT %264680784
19:23:52.458700 D:SRT.c: GC/removeSocket: DELETING SOCKET %264680784
19:23:52.458883 D:SRT.c: GC: sleep 1 s
19:23:53.463007 D:SRT.c: GC: sleep 1 s

**Done sleeping before removing from RcvQueue..**

19:23:53.791872 D:SRT.c: %264680785: CLOSING (removing listener immediately)

264680784 is an old connected socket, so I wouldn't pay attention to this part. However, the log clearly shows that the connecting client was able to get passed to the agent before it was removed from the receiving queue.

ethouris commented 4 years ago

Ok, I've checked the things at the given location and it looks like there's a potential deadlock in the following form: This above mentioned function puts a lock on two mutexes at a time:

As I have tried to trace the theoretical use of it, locks should either happen in this order (the call to setListener is another one, which happens through CUDT::setListenState, and one that locks only the second one - CRcvQueue::worker_ProcessConnectionRequest. Hard to say, however, if there could be a situation that the second one is locked and then m_ConnectionLock happens, at least I can't see it directly, however this closing function if happened that it was called indirectly from there, could indeed cause a deadlock. Not sure if that is the case.

I have, however, something that can help tracking it down, but you'd have to merge one branch that I have prepared just a moment ago: dev-thread-logging-logging (from my replica). This provides abilities to track every mutex/cv activity and allows to track down the order of locks/unlocks so that you can trace the deadlock cases. After merging these changes you have to configure again and add --enable-thread-logging option additionally, then retry your test.

Can you help?

ethouris commented 4 years ago

Ok, in short: we are currently unable to reproduce it, but you can help us.

If you want, please merge the branch from the #1055 PR into your branch, then compile with --enable-thread-logging and rerun the test.

You should be able to see in the logs which thread has lately acquired locks that are blocking the thread, and which these are.

ethouris commented 4 years ago

@toots, when you have time, please reopen it and retry according to what was written above. The thread logging feature must have been put aside as for now, we might reinstate it at some point, but it won't be fast. The feature is still in this branch, so you might try to use this together with your changes, just note it has now largely diverged from latest master.

toots commented 4 years ago

Thank you and thanks for the time spent looking at this issue. I'll let you know if I have more info/can reproduce. So far, our setup is stable and we're happy with it.