axboe / liburing

Library providing helpers for the Linux kernel io_uring support
MIT License
2.85k stars 402 forks source link

Is it possible for nvme_poll to reflect benefits on QD1 latency? #1221

Closed ZiqiangYoung closed 1 month ago

ZiqiangYoung commented 1 month ago

Hello, I've been looking at the QD1 test method recently, I have some questions I want to turn to the developer. I found that NVMe command processing has MSI-X interrupt which to notify host working when CQ was wroten. So I expect polling to replace interrupts to reduce QD1 latency, then I found io_uring

For test, I set nvme.poll_queues=64 which used to be 0, and bind cpu in specified core, and use io_uring engine in fio on fedora 40(kernel 6.10), However, modifying this parameter does not reduce the clat.

I want to know what other related configuration items may reduce QD1 latency in io_uring or even libaio, thanks.

Below is the test script I wrote, which binding cpu on nvme0q32 which has no interrupt count in /proc/interrupt.

[global]
direct=1
ioengine=io_uring
filename=/dev/nvme0n1
group_reporting

[trim]
rw=trim
bs=1073741824
size=100%
stonewall

[write-4KiB-Q1-J1]
rw=write
bs=4096
iodepth=1
numjobs=1
cpus_allowed=20
ramp_time=120
runtime=180
time_based
stonewall

I have the ability to read and modify kernel source such as blk_pull or nvme_setup_irqs etc. If you have any idea of reducing the QD1 latency even at the expense of other indicators, please let me know.

Thank you.

axboe commented 1 month ago

You're not telling fio to use polled IO with that job. You'll want to add

hipri=1

to the write-4KiB-Q1-J1 job section to do that. Without it, you're doing regular IRQ driven IO with that job.

axboe commented 1 month ago

Once you get polled IO working, registered buffers will further reduce latencies for O_DIRECT IO. You can use those in the fio job by setting

fixedbufs=1

in the job section. Fixed files will help a bit too, particularly real world where applications are threaded. For default fio, probably won't notice.

ZiqiangYoung commented 1 month ago

You're not telling fio to use polled IO with that job. You'll want to add

hipri=1

to the write-4KiB-Q1-J1 job section to do that. Without it, you're doing regular IRQ driven IO with that job.

I just wanted to extend a big thanks to you! I've observed that the CPU utilization for the binding has finally hit 100%, and it’s clear that this time the changes have truly taken effect. The reduction in latency at QD1 matches exactly what I was expecting, which is fantastic.

Thank you so much again—you’re a real lifesaver!

ZiqiangYoung commented 1 month ago

Once you get polled IO working, registered buffers will further reduce latencies for O_DIRECT IO. You can use those in the fio job by setting

fixedbufs=1

in the job section. Fixed files will help a bit too, particularly real world where applications are threaded. For default fio, probably won't notice.

Your conclusion that For default fio, probably won't notice. is correct. After testing, I found that setting this parameter in fio indeed does not yield any noticeable benefit. However, I’ll keep your conclusion in mind. Thanks again for your time!