LibreQoE / LibreQoS

A Quality of Experience and Smart Queue Management system for ISPs. Leverage CAKE to improve network responsiveness, enforce bandwidth plans, and reduce bufferbloat.
https://libreqos.io/
GNU General Public License v2.0
414 stars 46 forks source link

lqos_node_manager spins in futex #261

Open dtaht opened 1 year ago

dtaht commented 1 year ago

It looks like the socket fd went away, and it is not responding to EAGAIN, so instead of sleeping on the futex or the epoll it loops. It is not rapid however, and does seem to go away a few minutes after the client does.

getpeername(484, {sa_family=AF_INET6, sin6_port=htons(52938), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::ffff:127.0.0.1", &sin6_addr), sin6_scope_id=0}, [128 => 28]) = 0
futex(0x7fc61ec09988, FUTEX_WAKE_PRIVATE, 1) = 1
accept4(9, {sa_family=AF_INET6, sin6_port=htons(52950), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::ffff:127.0.0.1", &sin6_addr), sin6_scope_id=0}, [128 => 28], SOCK_CLOEXEC|SOCK_NONBLOCK) = 485
epoll_ctl(5, EPOLL_CTL_ADD, 485, {events=EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, data={u32=2013265922, u64=2013265922}}) = 0
setsockopt(485, SOL_TCP, TCP_NODELAY, [1], 4) = 0
getpeername(485, {sa_family=AF_INET6, sin6_port=htons(52950), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::ffff:127.0.0.1", &sin6_addr), sin6_scope_id=0}, [128 => 28]) = 0
write(4, "\1\0\0\0\0\0\0\0", 8)         = 8
accept4(9, {sa_family=AF_INET6, sin6_port=htons(52966), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::ffff:127.0.0.1", &sin6_addr),
 sin6_scope_id=0}, [128 => 28], SOCK_CLOEXEC|SOCK_NONBLOCK) = 486
epoll_ctl(5, EPOLL_CTL_ADD, 486, {events=EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, data={u32=1375731718, u64=1375731718}}) = 0
setsockopt(486, SOL_TCP, TCP_NODELAY, [1], 4) = 0
getpeername(486, {sa_family=AF_INET6, sin6_port=htons(52966), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::ffff:127.0.0.1", &sin6_
addr), sin6_scope_id=0}, [128 => 28]) = 0
write(4, "\1\0\0\0\0\0\0\0", 8)         = 8
accept4(9, 0x7ffe157f9900, [128], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x7fc61ec0a2b0, FUTEX_WAIT_BITSET_PRIVATE, 10810, NULL, FUTEX_BITSET_MATCH_ANY) = 0
accept4(9, {sa_family=AF_INET6, sin6_port=htons(51782), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::ffff:127.0.0.1", &sin6_addr), sin6_scope_id=0}, [128 => 28], SOCK_CLOEXEC|SOCK_NONBLOCK) = 474
epoll_ctl(5, EPOLL_CTL_ADD, 474, {events=EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, data={u32=2046820377, u64=2046820377}}) = 0
setsockopt(474, SOL_TCP, TCP_NODELAY, [1], 4) = 0
getpeername(474, {sa_family=AF_INET6, sin6_port=htons(51782), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::ffff:127.0.0.1", &sin6_addr), sin6_scope_id=0}, [128 => 28]) = 0
write(4, "\1\0\0\0\0\0\0\0", 8)         = 8
accept4(9, 0x7ffe157f9900, [128], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x7fc61ec0a2b0, FUTEX_WAIT_BITSET_PRIVATE, 10811, NULL, FUTEX_BITSET_MATCH_ANY
dtaht commented 1 year ago

lqos@lqos:/opt/libreqos/src/rust$ sudo strace -p 1895 strace: Process 1895 attached futex(0x7f6981e181e0, FUTEX_WAIT_PRIVATE, 1, NULL) = 0 accept4(48, {sa_family=AF_UNIX}, [110 => 2], SOCK_CLOEXEC|SOCK_NONBLOCK) = 50 epoll_ctl(5, EPOLL_CTL_ADD, 50, {events=EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, data={u32=1325400065, u64=1325400065}}) = 0 futex(0x7f69807fd4e0, FUTEX_WAKE_PRIVATE, 1) = 1 accept4(48, 0x7ffe82d444f0, [110], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)

This is NOT a high priority bug. I have a personal scar of not doing the right thing with EAGAIN regarding the deployment of a new (java based) webserver, which under a production workload would leak sockets at a slow rate, spinning ever more, exactly like that, until it ran out of sockets, and an individual instance would crash after about 3 hours, and need to be restarted. Tracing it back to where it leaked the socket took some effort.

dtaht commented 1 year ago

And several hundred servers in the total deployment, simultaneously crashing every few minutes.