axboe / liburing

Library providing helpers for the Linux kernel io_uring support
MIT License
2.77k stars 398 forks source link

Worker affinity does not work as expected with isolated CPUs #1017

Open noop-dev opened 8 months ago

noop-dev commented 8 months ago

When setting affinity mask for worker threads (via io_uring_register_iowq_aff) to include isolated CPUs, they never seem to go to more than one isolated CPU, or any isolated CPUs at all, if the set also includes non-isolated CPUs. This behaviour is also affected by the affinity of the main (or SQPOLL) thread. Happens regardless of ASYNC and SQPOLL flags, though without ASYNC we usually need to wait until some workers are created. My test example uses file writes, but I originally observed this behaviour with socket send / write / send_zc operations. Tested with kernel up to 6.6.3 (AWS Linux & Fedora on an AWS instance) The example and more detailed description here: https://github.com/noop-dev/io-uring-affinity-example

I also observed something that looks like race condition / memory ordering issues on certain VMs and configs, but this can wait until later.

axboe commented 8 months ago

Can confirm I see the same here as you report. I added a bunch of debugging and all the masking is done correctly, and both sqpoll and io-wq workers end up getting the correct mask set, as specified by the application. None of them ever run on CPUs that are NOT in the mask given. But I do see weirdness with isolcpus= being used, which I can only chalk up to scheduling. Eg if io-wq is being asked to run on 0-3 and we have 0-1 isolated, then only cpu 0 is being used. Ditto for sqpoll.

andymalakov commented 8 months ago

Hello Jens, The idea is to bind a pool of N workers to N isolated cores with busy waiting (common practice for latency sensitive applications). Currently, the library doesn't seem to optimize for a configuration where there's one worker per CPU. If the library could detect such setups and adjust the CPU mask for each worker to exclusively use its dedicated isolated CPU, it would greatly improve performance.

axboe commented 8 months ago

I don't think the library can handle this. The kernel scheduler doesn't seem to spread the love if the mask has isolated cpus in it, and things like io-wq only have a single mask for all workers. The kernel could probably improve this, by just having the workers round robin the cpus at setup time, if the mask includes isolated cpus.

Though for most fast workloads, you really should not expect to see a lot of io-wq activity. I guess for some they can use it as a thread pool explicitly with IOSQE_ASYNC, so that may be the use case in question here.

noop-dev commented 8 months ago

Well, in such case, at the very least io_uring documentation should clearly state that io_uring_register_iowq_aff API is incompatible with CPU isolation.

axboe commented 8 months ago

In case I wasn't clear, this isn't a io_uring bug, and nobody has reported any issues with isolcpus before. In other words, this is all news to me and others. My best guess at what is happening is that the scheduler masks out CPUs that are isolated, which means if you start on the first CPU in the mask, then you'll never get scheduled to any other in the mask as they are isolated. Hence you stick to the one that you are on.

This is of course not ideal, but there's also not (to me, at least) a clear answer to what should happen here. You can argue either one of:

1) io_uring should not be able to bind to isolated CPUs 2) io_uring should be able to use isolated CPUs if asked to, and we'd then expect all CPUs in the masks to be utilized, regardless of whether they are isolated or not

Since isolated cpus is a system property, the first option seems a lot saner to me, unless the user has privileges that indicate otherwise. Which would then indicate that perhaps the right solution is 3, which is basically option 2 but only allow using isolated CPUs if the user is privileged enough to do so.

In any case, making scheduling possible on isolated CPUs still remains an issue. I'll have to poke a bit to see what is possible there, we do allow it on other system resources.