io_uring_queue_init_params() fails due to resource exhaustion/constraints?

markpapadakis commented 4 months ago

Attempting to initialize multiple rings via e.g io_uring_queue_init_params() and setting CQ size via IORING_SETUP_CQSIZE and cq_entries fails, depending on the number of rings and size of the queues. e.g for 16k cq_entries, this fails for ~4 io_urings / process.

Is there some per-process rlimit that prevents having more rings with large queues perhaps?

isilence commented 4 months ago

CQ/SQ are contiguous in physical memory, it might be hard for the kernel to allocate large rings, and the bigger the worse. We should fall back to vmalloc in case of fail

spikeh commented 4 months ago

Why is there a requirement for physical contiguous memory? Is it for performance reasons only, or are there other technical limitations?

isilence commented 4 months ago

Just convenience / performance. We can add a fallback to be less strict to large rings.

FWIW, it's not confirmed that is the problem, 16K is not that much. @markpapadakis, do you have a repro? What error code it returns?

isilence commented 4 months ago

Well, likely reproducers wouldn't be much of a help, should depend on the system hw / config.

markpapadakis commented 4 months ago

Apologies for taking this long to respond.

I am getting -ENOMEM. This is a 32GB machine that's used almost exclusively for building(compiling code) "projects". It "shouldn't" have failed.

When setting cq_entries to 48k, it failed even for a single io_uring process instance (irrespective of the SQ size specified in io_uring_queue_init_params() call).

e.g this fails with -ENOMEM for a single io_uring

{
        struct io_uring_params params;

        memset(&params, 0, sizeof(params));
        params.flags      = IORING_SETUP_CQSIZE;
        params.cq_entries = 64 * 1024;

        if (const auto e = io_uring_queue_init_params(4096 /* SQ size */, &ring, &params); e) {
                fmt::print("failed {}", strerror(-e));
        }
}

Maybe there should be an io_uring_params flag for falling back to vmalloc() as you suggested?

Thank you

isilence commented 4 months ago

I missed that it's not 16KB but 16K entries = 256KB. That's most likely to fail, we'll take a look at what can be done.

On a side note, it's quite a lot of entries, I wonder why would you even need as much?

markpapadakis commented 4 months ago

@isilence I was experimenting, really. We operate this ads tech stack, and we need to support 100k RPS(many thousands of connections), and so we have a reactor-design where we have multiple OS threads multiplexing I/O. The problem is that sometimes the SQ is not large enough, and can't be made large enough without hitting that limit, so submitting new SQEs becomes a bit involved because of LINK semantics. It's not a big deal in practice, just was thinking about simplifying the dispatcher of SQEs.

BTW, if I may ask, what's the optimal way to "stream" file contents over a socket? use of two pipes and use of splice and SPLICE_F_MOVE | SPLICE_F_NONBLOCK for (file fd => pipe => socket fd )?

YoSTEALTH commented 4 months ago

I don't get why you have 4096 for entries, max is 32768. Even if you are getting 100k rps, with io_uring you should be able to submit multiples of 32768 within that second.

markpapadakis commented 4 months ago

@YoSTEALTH that's 100K, not 100.

YoSTEALTH commented 4 months ago

@markpapadakis typo i was just about to edit it :p

isilence commented 4 months ago

@isilence I was experimenting, really. We operate this ads tech stack, and we need to support 100k RPS(many thousands of connections), and so we have a reactor-design where we have multiple OS threads multiplexing I/O.

That sounds exciting

The problem is that sometimes the SQ is not large enough, and can't be made large enough without hitting that limit, so submitting new SQEs becomes a bit involved because of LINK semantics. It's not a big deal in practice, just was thinking about simplifying the dispatcher of SQEs.

I think the discussion above was about CQ. I asked out of curiosity, I can imagine needing a large CQ, but not sure how much of a problem there is for SQ. What is your max link length? 128 should be more than enough for performance (in terms of syscall batching). And for the link problem, before starting assembling a link you can probably do the following:

if (sq_entries_available() < link_size) {
    submit_sqes(); // flush queued sqes, etc.
}

Does it work for your app?

markpapadakis commented 4 months ago

@isilence Yeah, that's what we do. If the SQ can't hold enough SQEs, we submit and drain the CQ, and if there is still not enough, we track them elsewhere (a vector) and move them back to the SQ when it can accommodate as many as we need. It works, but I wanted to understand the queue size issue (which I now do).

isilence commented 4 months ago

@markpapadakis, the idea is that SQ is relatively small and there shouldn't be problems allocating it, whereas CQ might need to be large. Anyway, let me try to improve the allocations

axboe / liburing

io_uring_queue_init_params() fails due to resource exhaustion/constraints? #1088