Open romange opened 2 months ago
They should reap eventually, but the putting is async hence why it can take a bit of time. Newer than 6.2 will generally keep up better. I ran your example on 6.2 and it does reap as it goes, just a bit behind. Once I sleep after the loop, the rest goes pretty quickly.
How long should should I wait?
We had a case in production where server had hundreds of thousands of such sockets which translated to gigabytes of memory Slab
pages.
I added sleep after the loop but the sockstat still shows me over 100K sockets that do not go down.
It should not take that long, unless perhaps something is keeping everything busy on the server? Are you running mainline 6.2, or something else? 6.2-stable isn't a thing, it's been dead since May 2023. Or is it some weird distro thing?
I'd recommend going to 6.6-stable, as that's also an LTS release. This means that if there are fixes for io_uring (and other things) in 6.6, they will get fixed for years rather than be a dead end like 6.2 is.
@axboe please observe the run on 6.8 kernel (ubuntu 24.04).
https://github.com/user-attachments/assets/075f8385-d8fd-4933-ade3-23c70e0cbb8f
strace shows that the calls are consistent with my pseudo code:
....
[pid 1808295] io_uring_register(4, IORING_REGISTER_FILES_UPDATE, {offset=1, fds=[6]}, 1) = 1
[pid 1808295] io_uring_register(4, IORING_REGISTER_FILES_UPDATE, {offset=1, fds=[-1]}, 1) = 1
[pid 1808295] close(6) = 0
[pid 1808295] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=100000}, NULL) = 0
[pid 1808295] socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 6
[pid 1808295] io_uring_register(4, IORING_REGISTER_FILES_UPDATE, {offset=1, fds=[6]}, 1) = 1
[pid 1808295] io_uring_register(4, IORING_REGISTER_FILES_UPDATE, {offset=1, fds=[-1]}, 1) = 1
[pid 1808295] close(6) = 0
[pid 1808295] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=100000}, NULL) = 0
[pid 1808295] socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 6
[pid 1808295] io_uring_register(4, IORING_REGISTER_FILES_UPDATE, {offset=1, fds=[6]}, 1) = 1
[pid 1808295] io_uring_register(4, IORING_REGISTER_FILES_UPDATE, {offset=1, fds=[-1]}, 1) = 1
[pid 1808295] close(6) = 0
[pid 1808295] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=100000}, NULL) = 0
[pid 1808295] socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 6
[pid 1808295] io_uring_register(4, IORING_REGISTER_FILES_UPDATE, {offset=1, fds=[6]}, 1) = 1
[pid 1808295] io_uring_register(4, IORING_REGISTER_FILES_UPDATE, {offset=1, fds=[-1]}, 1) = 1
[pid 1808295] close(6) = 0
[pid 1808295] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=100000}, NULL) = 0
[pid 1808295] socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 6
[pid 1808295] io_uring_register(4, IORING_REGISTER_FILES_UPDATE, {offset=1, fds=[6]}, 1) = 1
[pid 1808295] io_uring_register(4, IORING_REGISTER_FILES_UPDATE, {offset=1, fds=[-1]}, 1) = 1
[pid 1808295] close(6) = 0
[pid 1808295] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=100000}, NULL) = 0
[pid 1808295] socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 6
[pid 1808295] io_uring_register(4, IORING_REGISTER_FILES_UPDATE, {offset=1, fds=[6]}, 1) = 1
[pid 1808295] io_uring_register(4, IORING_REGISTER_FILES_UPDATE, {offset=1, fds=[-1]}, 1) = 1
[pid 1808295] close(6) = 0
....
I'll take a look. Can you share your .config?
And also please include details on the system in question, mostly just how many cores you have (and arch).
@axboe I can confirm that your code does not reproduce the problematic behavior. I will continue investigating until I find the differentiator that triggers it.
Seems the problem on my side. Again!!! 🤦🏼♂️
Sorry for wasting your time. On a bright side, seems that 6.2 is working perfectly well!!!
Seems the problem on my side. Again!!! 🤦🏼♂️
All good! But would you mind mentioning what the issue was on your side? Might help others that run into the same thing. Always nice to have some sort of conclusion to an issue filed.
I couldn't identify the root cause in my previous comment, but I have now. I'm pasting the example that reproduces the problem. It seems that another socket is needed to cause the leakage of socket descriptors.
#include <stdio.h>
#include <fcntl.h>
#include <sys/socket.h>
#include <unistd.h>
#include <assert.h>
#include <liburing.h>
#include <pthread.h>
#include <sched.h>
#include <malloc.h>
#include <poll.h>
#include <string.h>
#include <netinet/in.h>
int main(int argc, char *argv[])
{
int fd, res, i, unreg_id, done = 0;
struct io_uring ring;
int listen_fd;
sockaddr_in server_addr;
io_uring_sqe* sqe;
io_uring_queue_init(32, &ring, 0);
io_uring_register_files_sparse(&ring, 2);
listen_fd = socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP);
res = io_uring_register_files_update(&ring, 0, &listen_fd, 1);
assert(res == 1);
memset(&server_addr, 0, sizeof(server_addr));
server_addr.sin_family = AF_INET;
server_addr.sin_port = htons(6666);
server_addr.sin_addr.s_addr = INADDR_ANY;
res = bind(listen_fd, (sockaddr*)&server_addr, sizeof(server_addr));
assert(res == 0);
res = listen(listen_fd, 32);
assert(res == 0);
// accept4(listen_fd, NULL, NULL, SOCK_NONBLOCK | SOCK_CLOEXEC);
sqe = io_uring_get_sqe(&ring);
// If I remove the part below - there is no leakage.
io_uring_prep_poll_add(sqe, 0, POLLIN);
sqe->flags |= IOSQE_FIXED_FILE;
io_uring_submit(&ring);
///////////////////////
for (i = 0; i < 10000; i++)
{
fd = socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP);
assert(fd >= 0);
// register
res = io_uring_register_files_update(&ring, 1, &fd, 1);
assert(res == 1);
// unregister
unreg_id = -1;
res = io_uring_register_files_update(&ring, 1, &unreg_id, 1);
done++;
assert(res == 1);
close(fd);
}
printf("registered %d sockets\n", done);
sleep(60);
io_uring_queue_exit(&ring);
return 0;
}
That's what I suspected, I did manage to make a reproducer for this as well. They are freed in order, but stop if one isn't available to free. Let me poke a bit, don't think this needs to be the case.
Ah yours looks different - you are still holding the file open, because you have that poll operation pending. You'll want to cancel the operation(s) on that socket before the file will be fully closed. So don't think your reproducer is valid, it just shows leaving a reference to that fd/socket via a pending operation will prevent it from being closed.
But I think it's a combination of the two - you have the one that prevents it from being closed, and that'll prevent the others too. That is certainly not right, and this is the issue I mentioned in the previous-to-last comment.
but I do not want to close listen_fd
. It's not closed by design and it uses direct fd=0.
I want to close another 10000 sockets that come afterwards and hold index 1.
Oh I agree, your example is way better than mine as it was all just the same. So just to make it clear, I agree this is an issue, it should not hold up other nodes from being freed.
I am just happy that could be helpful this time! (even a broken clock is right twice a day)
In terms of work-around, may be a good idea to have long standing things (like a listen) NOT be a fixed file, at least for now. Should be limited, if any, utility of having that fixed, unless you have a high frequency of connections coming in. This doesn't totally solve the issue as even homogeneous type of requests may have drastically different life times, and one switch of nodes (eg when you update to -1) will potentially block if chronologically previous entries are still active.
We will certainly fix this, but it may require a bit of work and hence isn't necessarily something that can go to a stable kernel in a short amount of time.
Yeah, that won't fly. We have high frequency connections usecases, so I can not rely on that for sockets. But, it's fine - we have not been using direct fds so far, so it's not a blocker for us.
It seems ok for files though, and we have a big feature coming up related to SSD tiering. I am going to use direct fds there.
For finite lifetime requests, like any storage IO would be, then direct descriptors would always be fine. It's only these potentially infinite or long lived requests that are problematic, if you get stuck behind reclaim of them. Still pondering a way to make that better, but it's going to be a 6.12 thing, at the earliest.
the pseudo code below constantly increases
sockets: used
metric in/proc/net/sockstat
This reproduces only when registering direct file descriptors (on kernel 6.2.0).what's interesting is that number of file descriptors does not grow. killing the process restores socket counts in
/proc/net/sockstat
to a normal level.