axboe / liburing

Library providing helpers for the Linux kernel io_uring support
MIT License
2.86k stars 402 forks source link

number of io_uring instances and its impact on parallelism and concurrency #495

Open kennthhz-zz opened 2 years ago

kennthhz-zz commented 2 years ago

Will more io_uring instances increase the parallelism at the disk device level or only increase concurrency? Is there any guideline on how many io_uring instances to create per device or per CPU? My understanding is that number of instance only impacts concurrency, not throughput (by way of increased parallelism).

axboe commented 2 years ago

One io_uring instance can drive millions of requests, both submit and completion. It's not really a per device or per CPU thing, in general the recommendation is to avoid sharing a ring between threads if possible, since that will need serialization on the app side. Apart from that, you don't need multiple rings. For reference, the 13M IOPS/core numbers I generated is using just 2 logical threads, and 1 ring per thread. Just 2 rings in total for that.

kennthhz-zz commented 2 years ago

That makes sense. However, the polling of the CQ should happen in a different thread that SQ submission to avoid blocking submission thread. Also by having 1 thread per vcore, it is better for cache locality. So wouldn't it better to have 1 submission and 1 competition thread per vcore, and one ring per vcore?In essence, share nothing..

axboe commented 2 years ago

I suspect it depends on your use case. The way I wrote that test app, you'll generally run with QD X, and submit Y and reap Z requests, where Y and Z are smaller than QD. It does mean that device seen queue depth will go as low as X - Z, but that's generally not a problem.

The submission side isn't blocked, it's not running while we're reaping completions. If you share 2 threads on one vcore as well, then you do end up competing for CPU resources between the submitter and completer.

So I suspect the answer is "it depends" :-)

kennthhz-zz commented 2 years ago

Sorry, I mean to say completion can be blocked if I use a single thread per vcore. So I can use io_uring_peek_batch_cqe instead of io_uring_wait_cqe. The challenge of using a single thread per vcore for doing both submission and completion is how to arrange the task queue (two task type: submit and reap_complete). I need to insert the reap_complete after insert submission task. But if it is right after, by the time I execute reap_complete, the completion may not finish yet. So I need to insert another one sometime down the road. It can be done, but app starts to play a scheduler job. Though the peek won't incur sys call, but if it is wasted a lot, still costs CPU (like mindless polling). Does io_uring_wait_cqe internally do busy polling? I think not. It is based on notification right? The notification is based on interrupt (non IOPOLLING mode). So io_uring_wait_cqe will block the thread but not wasting CPU (except incurring sys call initially)?

axboe commented 2 years ago

Why do you think it will be blocked? Waiting for events in the kernel doesn't block new submissions. Waiting doesn't do busy polling, if you're entering the kernel that's considered slow path of event reaping.

Checking for events doesn't have to be busy polling, it can be done as it needs to. It's just a single memory read, seeing if there are new events.

I guess you mean that waiting for events will block? That's of course true, and it's the same as the example I gave higher up where you have to accept that blocking (or io-polling) for completions means that you don't submit at that time. If that's a concern for you, then just use two threads and split your submit and complete between them.

kennthhz-zz commented 2 years ago

I think you got what I am doing in your final paragraph. I have a single thread on a vcore (which also maps to a single ring). So my sqe tasks and cqe tasks are serialized to a task queue. there is no block or sharing between thread/vcores. So if cqe is executing io_uring_wait_cqe, it will block. Anyway this can be solved by single thread by using not blocking peek API for completion and smartly position the completion task in the queue OR by using 2 threads. Now the hard part to know is which one is better. I would think if IO throughput is large, the peek might wins out because it won't waste CPU and won't incur overhead of one extra thread.

Also each call to io_uring_wait_cqe incurs sys call right? while io_uring_peek_cqe doesn't.

axboe commented 2 years ago

Also each call to io_uring_wait_cqe incurs sys call right? while io_uring_peek_cqe doesn't.

wait_cqe() will block if you ask for more events than are directly available, at the time of checking. peek_cqe() will never enter the kernel, all it does is read the kernel CQ tail and see if we have an event available.

kennthhz-zz commented 2 years ago

Just to be clear, does io_uring_wait_cqe internally do a peak (which won't enter kernel)? if at least one cqe is available then it will enter the kernel to dequeue. If there is no one ready, it will block the thread without enter the kernel (without using busy waiting either, will yield the thread). Is that true? Also, if I am using peek, then mark cqe as seen, I can do completion without ever enter kernel mode?

axboe commented 2 years ago

Yes, if an event is available and you ask for one, wait_cqe will not enter the kernel. You never need to enter the kernel to dequeue, that's simply updating the cq head to mark it processed. If none are available, it'll enter the kernel and sleep for one.

Also, if I am using peek, then mark cqe as seen, I can do completion without ever enter kernel mode?

Correct