Closed JigaoLuo closed 1 month ago
more plots that this issue remains:
with 2 threads:
with 8 threads:
So it is not limited to the first plot I showed
Not sure on nvme over rdma, but do you have polling queues setup for nvme? Without that, you're not really polling for completions. Don't have a nvme-over-rdma setup so can't test this myself - if you do have poll queues setup for nvme, then I think you'd want to email the nvme mailing list with your findings.
What do the results look like if you set hipri=0 instead?
Hi @axboe , thanks for replying :)
Regarding io_uring without polling:
I updated all my plots to include interrupt-driven version of io_uring. As shown in the thread scaling plot, the interrupt-driven io_uring
behaves similarly to libaio with a significant number of interrupts.
Regarding poll-queues
:
Yes, I set this parameter and verified it via dmesg
. For this, I also monitored the interrupts per second, as noted in the plot annotations. For io_uring
with CQ polling, the number of interrupts is comparable to SPDK. However, in the cases where the issue occurs, io_uring
CQ polling shows a high number of context switches and an unexpectedly high pgpgIn bandwidth.
More on pgpgin bandwidth:
It's important to note that pgpgin
and pgpgout
aren’t directly related to pages in memory; they represent sectors submitted to the block layer for reading and writing, respectively. These counters are only updated in the submit_bio
function, defined in block/blk-core.c
. I used bpftrace
to trace the number of submit_bio calls, and the results showed that they align with the I/O requests issued by fio
. So it still not so clear to me.
If you have higher context switches, you're likely not waiting on enough events. By default fio will wait for 1. You can set:
iodepth_batch=32
iodepth_batch_complete_min=16
iodepth_batch_complete_min=32
or something like that to reduce it.
But I'm concerned about the pgpgin rates. cgroup memory charging is slow. Like really slow. io_uring generally sets 128 as the limit on what it'll cache, which is probably why you're seeing a drop-off there. Not sure if you have the ability to recompile the kernel, but if you do, changing IO_ALLOC_CACHE_MAX
from 128 to 512 or something might really help. I suspect you're running into the high overhead of memcg charging here.
If changing IO_ALLOC_CACHE_MAX
does help, then it may be worth making this configurable at runtime...
Thank you for suggesting these ideas; it is a new perspective that I hadn’t considered.
However, recompiling the kernel is quite challenging for me at the moment. Is there an alternative approach to verify this performance drop-off without needing to recompile the kernel?
Try and run the workload with iodepth=128 first and then do:
# perf record -g -p <pid of fio thread> -- sleep 3
and then run it with iodepth=256 and repeat the above perf to capture a new trace. Then do:
# perf diff
and that'll show you where the increases in cycles are being spent. For the pid, just pick one of the fio tasks, should not really matter.
Output of perf diff 4k_128_randread_4CPU_perfrecord.data 4k_256_randread_4CPU_perfrecord.data
:
I did find perf diff
so helpful, so I did followings.
perf stat
on cycles and cs 75.218.034.949 cycles
579.016 context-switches
5,413759857 seconds time elapsed
4,790074000 seconds user
15,871115000 seconds sys
perf report --sort symbol
on cyclesperf report --sort symbol
on csperf stat
on cycles and cs 71.378.521.966 cycles
8.131.844 context-switches
5,414678953 seconds time elapsed
1,827919000 seconds user
18,829960000 seconds sys
perf report --sort symbol
on cyclesperf report --sort symbol
on csYou can see the increase in psi_group_charge, which is the pressure stall. So it's allocating too much perhaps? I'm assuming this is also the path that needs to the increased schedules, you can check that by running:
perf report -g --no-children
with perf.data being the run from QD=256. But yeah, I'm very certain this issue is caused by psi/memcg being way too expensive for having 256 inflight vs just 128, where the 128 will end up being cached and recycled by io_uring.
You also look like you have io-wq activity when queue depth is 256, which could also lead to a slowdown as you're overloading the device. What does:
cat /sys/block/nvme0n1/queue/nr_requests
say for the device, provided that nvme0n1 is one of your targets?
perf report -g --no-children
perf report -g --no-children
You also look like you have io-wq activity when queue depth is 256, which could also lead to a slowdown as you're overloading the device. What does:
cat /sys/block/nvme0n1/queue/nr_requests
say for the device, provided that nvme0n1 is one of your targets?
nr_requests
of SSD in nvmeof target:
$ cat /sys/block/nvme0n1/queue/nr_requests
1023
This is the same for the total 4 SSDs on the nvmeof target.
(However, /sys/block/nvme0n1/queue/nr_requests
does not exist on the nvmeof host when remote SSD connected.
On the host are only:
$ ls /sys/block/nvme4n1/queue/
add_random dma_alignment max_discard_segments nomerges virt_boundary_mask
chunk_sectors fua max_hw_sectors_kb nr_zones write_cache
dax hw_sector_size max_integrity_segments optimal_io_size write_same_max_bytes
discard_granularity io_poll max_sectors_kb physical_block_size write_zeroes_max_bytes
discard_max_bytes io_poll_delay max_segments read_ahead_kb zone_append_max_bytes
discard_max_hw_bytes iostats max_segment_size rotational zoned
discard_zeroes_data logical_block_size minimum_io_size stable_writes zone_write_granularity
You can see the increase in psi_group_charge, which is the pressure stall. So it's allocating too much perhaps? I'm assuming this is also the path that needs to the increased schedules, you can check that by running:
perf report -g --no-children
with perf.data being the run from QD=256. But yeah, I'm very certain this issue is caused by psi/memcg being way too expensive for having 256 inflight vs just 128, where the 128 will end up being cached and recycled by io_uring.
Hi @axboe , Thank you so much for your hands-on assistance. I wouldn't have been able to identify this pressure stall issue from PSI/memcg on my own. Honestly, I’m not very familiar with the underlying reasons, so I’ll need to look them up to understand better and have a conclusion. Thanks again for your help! Write me if more perfdata needed. :)
Hi @axboe ,
I’m still unclear about the cause of this issue. From my analysis with perf, it seems that the problem is related to psi_group_charge
, which indicates cgroup memory management.
What’s puzzling me is why this issue specifically occurs with NVMe over Fabrics cases. I ran performance tests using fio with iouring-polled mode on both local SSDs and NVMe-oF. The results show that the local SSD performs well with an iodepth of 256.
Could you help me understand why psi_group_charge
might be impacting the NVMe-oF cases but not the local SSD cases? I’m not sure what specific factors are affecting the NVMe-oF that are not affecting the local SSD. Any insights would be greatly appreciated.
About to be on a plane, but just look at the output of perf report. It'll tell you exactly where the psi call is happening from and will give you an idea of why.
As mentioned earlier, you also have io-wq activity, which should not be happening unless you're exceeding the requests available on that device. This is the iowq* stuff in your traces. This could indeed also make psi and memcg accounting WORSE, as you then have more threads competing for these resources.
When you say you're using QD=256, is that for ALL threads you are using, or is that per-thread? Because if the queue is 1023 entries in size, then 4 threads would already end up hitting io-wq and more threads is just going to make that much worse. You're overloading the device.
Based on the FIO configuration shared in my first message, I have four NVMe-oF SSDs in total, each configured in the case of (4 threads, iodepth 256 per device) :
[filename0]
filename=/dev/nvme4n1
iodepth=256
numjobs=1
cpus_allowed=18
numa_mem_policy=prefer:1
Each SSD operates with a dedicated iodepth of 256, a dedicated core on NUMA node 1 (where the host's NIC is located). I am using the same FIO script across all IO engines: SPDK NVMe-oF, libaio NVMe-oF, io_uring NVMe-oF, and io_uring polled PCIe-local.
Interestingly, the io-wq
and psi_group_charge
behaviors are only observed in the io_uring polled NVMe-oF setup and not in the other configurations. Thank you for your attention!
Interestingly, the io-wq and psi_group_charge behaviors are only observed in the io_uring polled NVMe-oF setup and not in the other configurations.
I think this is because the psi_group_charge overhead is increased quite a bit because io-wq workers are seen. I looked at the rdma code briefly, and it's setting up a pool based on the SQ size of the target. If we can't allocate from this pool to issue a new io, -EAGAIN is returned, and this in turn forces io_uring to punt this request to io-wq.
I don't know too much about nvma over rdma, but this is a target issue. You need to ensure that the SQ size depth is large enough on the target side to avoid running out for your test. Maybe this is configurable, I don't really know. I think you need to ask this question on the nvme mailing list. It's not an io_uring/liburing bug or issue, io_uring is simply running into a target that has a lower queue depth than you would like for your test, and hence performance goes to shit.
thanks. I will check my RDMA SQ size and possibly ask in the mailing list. Will keep you updated
Hi @axboe
I’ve done some research on the NVMe-oF driver code and reviewed discussions in the nvme mailing list. In NVMe-oF, the target refers to the server where the SSD is physically attached, while the host is the I/O issuer (such as FIO) interacting with the remote SSD.
One significant difference I noticed between PCIe-attached and NVMeoF-attached block devices is the queue size, which is evident in the performance plots I shared earlier:
sqsize
:
cat /sys/class/nvme/*/sqsize
1023
1023
1023
1023
sqsize
:
cat /sys/class/nvme/*/sqsize
127
127
127
127
I also explored whether it’s possible to increase the NVMe-oF sqsize
. Unfortunately, it appears that this is restricted by the kernel driver by a constant marco NVME_RDMA_MAX_QUEUE_SIZE
[1]. There was a patch released this year that raised the NVME_RDMA_MAX_QUEUE_SIZE
from 128 to 256 [2], but my kernel version predates this patch. As a result, my setup limited to 127 (or 128) for the queue size on the NVMe-oF host.
I wasn’t initially aware of this upper limit, so I ran FIO experiments with an I/O depth of 256 and even higher — exceeding the limit /sys/class/nvme/*/sqsize
at the host —leading to the performance issues in plots above.
Do you think the queue depth limit could be contributing to the performance drop-off? I’m planning to share these findings on the NVMe mailing list, as it seems there are still issues when using a fixed queue depth in NVMe-oF RDMA.
In my setup:
[1]:
[2]:
Do you think the queue depth limit could be contributing to the performance drop-off?
Yes, this is pretty much what I've been saying all along, that this indeed is the issue. If it wasn't, you would not see io-wq activity, and this in turn leads to both inefficiencies around that, but also then related issues with the memcg charging.
I’m also wondering why this performance drop-off doesn’t occur with libaio or non-polled io_uring.
Could it be that polled io_uring leverages psi/memcg with io-wq workers to cache and recycle I/O depths higher than the sqsize
, whereas non-polled io_uring does not?
From my understanding, other I/O engines will cache I/O depths higher than the sqsize
in the block layer instead.
I’m also wondering why this performance drop-off doesn’t occur with libaio or non-polled io_uring.
For libaio, it'll just block, it doesn't not attempt to handle this condition. So there it just violates the idea that it's an async API, when in fact it'll just wait on previous IO to complete. It will likely just plateau in terms of performance. For non-polled io_uring, it'll in fact still be less efficient, just less so as you don't end up with a bunch of io-wq workers that also poll. io-wq will keep retrying a submission until it succeeds, and hence effectively poll for the submission too.
End of the day, you're over-driving the nvme target. Different ways of doing that will yield different outcomes, but the core problem here is that you are indeed overloaded it. If the nvme target had a higher depth, then it would work better.
Hello, I am benchmarking NVMe-oF RDMA using
fio
withio_uring
polled mode and have encountered a puzzling performance issue under high queue depth conditions.Setup:
fio
.Issue: I am observing a strange performance drop specifically with
io_uring
in polled mode at high queue depths. Withio-uring
random read workloads, as in my plot, each SSD has a queue depth (QD) as the x-axis, and a single core handling I/O per SSD (so 4 cores for 4 SSDs). Under these conditions, the performance suddenly drops from 8.40GB/s to around 2.15 GB/s, when queue depth from 128 to 256. In the plot, I also annotate some details:#IRQ
for avg interrupt per second,#CS
for avg context switch per second,PGPGIN BW
for/proc/vmstat
's pgpgin bandwidth. So it is clear thatio-uring
is really doing polling with such a low#IRQ
. However, wtih queue depth with 256, the#CS
increases a lot and thePGPGIN BW
is much much higher than the I/O bandwidth. (I did not plot other io engines with QD 512 and 1024, because they do not have strange performance drop)More Observations:
io_uring
interrupt mode, andlibaio
saturate the NIC bandwidth without this issue, confirming that the bottleneck is specific toio_uring
polled mode.Question: What could be causing this performance degradation in io_uring polled mode at high queue depths? Is there an interaction between io_uring polling and NVMe-oF that could explain the discrepancy between pgpgin reported bandwidth and the actual observed network traffic? Any insights or suggestions on potential areas to investigate would be greatly appreciated!
-- As for thread-scaling plot, we can also see the strange trend:
System Details: