Glimesh / janus-ftl-plugin

A plugin for the Janus WebRTC gateway to enable relaying of audio/video streams utilizing Mixer's FTL (Faster-Than-Light) protocol.
https://hayden.fyi/posts/2020-08-03-Faster-Than-Light-protocol-engineering-notes.html
GNU Affero General Public License v3.0
44 stars 11 forks source link

Possible fd data race: getaddrinfo calls can hang process #115

Closed haydenmc closed 3 years ago

haydenmc commented 3 years ago

Seeing hangs in production with the following stack:

(gdb) bt
#0  __libc_recvmsg (flags=0, msg=0x7fbd71ff9460, fd=24) at ../sysdeps/unix/sysv/linux/recvmsg.c:28
#1  __libc_recvmsg (fd=fd@entry=24, msg=msg@entry=0x7fbd71ff9460, flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/recvmsg.c:25
#2  0x00007fbe1210b24e in make_request (pid=1041291, fd=24) at ../sysdeps/unix/sysv/linux/check_pf.c:170
#3  __check_pf (seen_ipv4=seen_ipv4@entry=0x7fbd71ff96d2, seen_ipv6=seen_ipv6@entry=0x7fbd71ff96d3, in6ai=in6ai@entry=0x7fbd71ff96e0, in6ailen=in6ailen@entry=0x7fbd71ff96e8) at ../sysdeps/unix/sysv/linux/check_pf.c:329
#4  0x00007fbe120d49c9 in __GI_getaddrinfo (name=<optimized out>, name@entry=0x7fbd8411c318 "glimesh.tv", service=<optimized out>, hints=<optimized out>, hints@entry=0x7fbd71ff9c60, pai=pai@entry=0x7fbd71ff9c58) at ../sysdeps/posix/getaddrinfo.c:2305
[...]

Some advice from here indicates we might be double-closing some sockets, causing the recvmsg call to hang forever:

You need to find the bug in your application. Typically, it's something that closes the same socket descriptor twice.

application: closes socket glibc: creates netlink socket glibc: sends data on the socket application: closes the same socket again application: creates a new socket (for the same descriptor number) glibc: calls recvmsg on the socket

Looking through the code with @danstiner, one possible race we've identified is in FtlClient, which doesn't have any locking to protect closing of control/media sockets. In particular, the FtlClient::ConnectAsync(...) call will start a thread and may immediately call Stop(), resulting in closing of sockets even though the thread could have already closed them:

https://github.com/Glimesh/janus-ftl-plugin/blob/70799c0a37fb2a1b9d1917ad7691a8ed89ba06ab/src/FtlClient.cpp#L42-L59

danstiner commented 3 years ago

Possibly related to #112

danstiner commented 3 years ago

@haydenmc have we been seeing this anymore? I'm hoping #116 did fix this.

haydenmc commented 3 years ago

I haven't! I think we can close this as fixed. :)