Seeing hangs in production with the following stack:
(gdb) bt
#0 __libc_recvmsg (flags=0, msg=0x7fbd71ff9460, fd=24) at ../sysdeps/unix/sysv/linux/recvmsg.c:28
#1 __libc_recvmsg (fd=fd@entry=24, msg=msg@entry=0x7fbd71ff9460, flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/recvmsg.c:25
#2 0x00007fbe1210b24e in make_request (pid=1041291, fd=24) at ../sysdeps/unix/sysv/linux/check_pf.c:170
#3 __check_pf (seen_ipv4=seen_ipv4@entry=0x7fbd71ff96d2, seen_ipv6=seen_ipv6@entry=0x7fbd71ff96d3, in6ai=in6ai@entry=0x7fbd71ff96e0, in6ailen=in6ailen@entry=0x7fbd71ff96e8) at ../sysdeps/unix/sysv/linux/check_pf.c:329
#4 0x00007fbe120d49c9 in __GI_getaddrinfo (name=<optimized out>, name@entry=0x7fbd8411c318 "glimesh.tv", service=<optimized out>, hints=<optimized out>, hints@entry=0x7fbd71ff9c60, pai=pai@entry=0x7fbd71ff9c58) at ../sysdeps/posix/getaddrinfo.c:2305
[...]
Some advice from here indicates we might be double-closing some sockets, causing the recvmsg call to hang forever:
You need to find the bug in your application. Typically, it's something that closes the same socket descriptor twice.
application: closes socket
glibc: creates netlink socket
glibc: sends data on the socket
application: closes the same socket again
application: creates a new socket (for the same descriptor number)
glibc: calls recvmsg on the socket
Looking through the code with @danstiner, one possible race we've identified is in FtlClient, which doesn't have any locking to protect closing of control/media sockets. In particular, the FtlClient::ConnectAsync(...) call will start a thread and may immediately call Stop(), resulting in closing of sockets even though the thread could have already closed them:
Seeing hangs in production with the following stack:
Some advice from here indicates we might be double-closing some sockets, causing the recvmsg call to hang forever:
Looking through the code with @danstiner, one possible race we've identified is in
FtlClient
, which doesn't have any locking to protect closing of control/media sockets. In particular, theFtlClient::ConnectAsync(...)
call will start a thread and may immediately callStop()
, resulting in closing of sockets even though the thread could have already closed them:https://github.com/Glimesh/janus-ftl-plugin/blob/70799c0a37fb2a1b9d1917ad7691a8ed89ba06ab/src/FtlClient.cpp#L42-L59