axboe / liburing

Library providing helpers for the Linux kernel io_uring support
MIT License
2.86k stars 402 forks source link

Too many worker threads created when setting cpu affinity of io worker #976

Closed richael02 closed 3 weeks ago

richael02 commented 1 year ago

Hi, I have an application that creates io_uring instance and processes sq and cq on cpu0 (no sq thread nor io poll mode is enabled). The kernel I use is 6.5-rc4. If I didn't set the cpu affinity of the io worker, I saw three worker threads were created in total and they all run on cpu0. But if I use the io_uring_register_iowq_aff to set the workers' cpu affinity to another cpu, I saw more than 400 io workers created. Creating so many threads brings some overhead.

I read part of the code, new io worker should be created under the following two cases: a. in io_wq_enqueue(), it tries to find and activate a free worker, if no free worker is activated and nr_running is 0, then new worker is created; b. in io_wq_dec_nr_running (called when thread is going to sleep or exit), if nr_running is 0 and there is work to process, then new worker is created.

It seems that after setting the cpu affinity of worker threads, it easily triggers conditions a or b although there are already many workers created.

I'm not familiar with io_uring implementation in kernel. Could you help to explain this or give some suggestions?

axboe commented 1 year ago

What does your workload look like? In general, you really don't want that many active, as you note, and I'm wondering what you're doing that makes this happen.

richael02 commented 1 year ago

Hi @axboe , I'm running spdk target on cpu0, and expose one spdk malloc bdev as ublk device , and run fio randwrite test on the ublk device(fio runs on other cpus). When writing the ublk device, spdk will use io_uring API ...readv to copy data to spdk buffer, these requests should be processed by io_workers.

By default the io workers runs on cpu0 too, I observed only two io workers. But it brough many context switches. So I use API io_uring_register_iowq_aff to set the workers' cpu affinity to another cpu in which I want to see what the performance difference is if the context switches is reduced on cpu0.

Then I observed so many io workers were created.

I added the following code in io_wq_dec_running() to activate free worker before calling io_queue_worker_create, the io workers created are not as many as before, about 40. rcu_read_lock(); dont_create = io_wq_activate_free_worker(wq, acct); rcu_read_unlock(); if (dont_create || atomic_read(&acct->nr_running)) return;