JigaoLuo commented 1 month ago

Hello, I am benchmarking NVMe-oF RDMA using fio with io_uring polled mode and have encountered a puzzling performance issue under high queue depth conditions.

Setup:

Environment: Two physical servers connected via 100 Gbit/s Mellanox ConnectX-5 NICs over RDMA (Infiniband).
Target: One server is configured as an NVMe-oF target providing access to 4 NVMe SSDs.
Host: The other server acts as the NVMe-oF host, connecting to the target's SSDs via RDMA. All I/O benchmarks are conducted on this host using fio.

Issue: I am observing a strange performance drop specifically with io_uring in polled mode at high queue depths. With io-uring random read workloads, as in my plot, each SSD has a queue depth (QD) as the x-axis, and a single core handling I/O per SSD (so 4 cores for 4 SSDs). Under these conditions, the performance suddenly drops from 8.40GB/s to around 2.15 GB/s, when queue depth from 128 to 256. In the plot, I also annotate some details: #IRQ for avg interrupt per second, #CS for avg context switch per second, PGPGIN BW for /proc/vmstat's pgpgin bandwidth. So it is clear that io-uring is really doing polling with such a low #IRQ. However, wtih queue depth with 256, the #CS increases a lot and the PGPGIN BW is much much higher than the I/O bandwidth. (I did not plot other io engines with QD 512 and 1024, because they do not have strange performance drop)

More Observations:

RDMA Traffic: The RDMA traffic between the host and target is always consistent with the observed I/O bandwidth.
Other Interfaces: Other I/O interfaces like SPDK, io_uring interrupt mode, and libaio saturate the NIC bandwidth without this issue, confirming that the bottleneck is specific to io_uring polled mode.

Question: What could be causing this performance degradation in io_uring polled mode at high queue depths? Is there an interaction between io_uring polling and NVMe-oF that could explain the discrepancy between pgpgin reported bandwidth and the actual observed network traffic? Any insights or suggestions on potential areas to investigate would be greatly appreciated!

-- As for thread-scaling plot, we can also see the strange trend:

System Details:

OS: Linux c08 6.6.1-zabbly+ (Ubuntu 22.04)
SSDs: 4x Samsung 980 Pro 1TB

NIC: Mellanox ConnectX-5

Fio script: I show the strange point case for example:

[global]
ioengine=io_uring
thread=1
direct=1
percentile_list=50:90:99:99.5:99.9:99.99:99.999
group_reporting=1
norandommap=1
rw=randread
bs=4k
time_based=1
runtime=100
fixedbufs=1
registerfiles=1
hipri=1
cpus_allowed=[18-21]
cpus_allowed_policy=split
[filename0]
filename=/dev/nvme4n1
iodepth=256
numjobs=1
cpus_allowed=18
numa_mem_policy=prefer:1
[filename1]
filename=/dev/nvme5n1
iodepth=256
numjobs=1
cpus_allowed=19
numa_mem_policy=prefer:1
[filename2]
filename=/dev/nvme6n1
iodepth=256
numjobs=1
cpus_allowed=20
numa_mem_policy=prefer:1
[filename3]
filename=/dev/nvme7n1
iodepth=256
numjobs=1
cpus_allowed=21
numa_mem_policy=prefer:1

JigaoLuo commented 1 month ago

more plots that this issue remains:

with 2 threads:
with 8 threads:

So it is not limited to the first plot I showed

axboe commented 1 month ago

Not sure on nvme over rdma, but do you have polling queues setup for nvme? Without that, you're not really polling for completions. Don't have a nvme-over-rdma setup so can't test this myself - if you do have poll queues setup for nvme, then I think you'd want to email the nvme mailing list with your findings.

What do the results look like if you set hipri=0 instead?

JigaoLuo commented 1 month ago

Hi @axboe , thanks for replying :)

Regarding io_uring without polling: I updated all my plots to include interrupt-driven version of io_uring. As shown in the thread scaling plot, the interrupt-driven io_uring behaves similarly to libaio with a significant number of interrupts.

Regarding poll-queues: Yes, I set this parameter and verified it via dmesg. For this, I also monitored the interrupts per second, as noted in the plot annotations. For io_uring with CQ polling, the number of interrupts is comparable to SPDK. However, in the cases where the issue occurs, io_uring CQ polling shows a high number of context switches and an unexpectedly high pgpgIn bandwidth.

More on pgpgin bandwidth: It's important to note that pgpgin and pgpgout aren’t directly related to pages in memory; they represent sectors submitted to the block layer for reading and writing, respectively. These counters are only updated in the submit_bio function, defined in block/blk-core.c. I used bpftrace to trace the number of submit_bio calls, and the results showed that they align with the I/O requests issued by fio. So it still not so clear to me.

axboe commented 1 month ago

If you have higher context switches, you're likely not waiting on enough events. By default fio will wait for 1. You can set:

iodepth_batch=32
iodepth_batch_complete_min=16
iodepth_batch_complete_min=32

or something like that to reduce it.

But I'm concerned about the pgpgin rates. cgroup memory charging is slow. Like really slow. io_uring generally sets 128 as the limit on what it'll cache, which is probably why you're seeing a drop-off there. Not sure if you have the ability to recompile the kernel, but if you do, changing IO_ALLOC_CACHE_MAX from 128 to 512 or something might really help. I suspect you're running into the high overhead of memcg charging here.

axboe commented 1 month ago

If changing IO_ALLOC_CACHE_MAX does help, then it may be worth making this configurable at runtime...

JigaoLuo commented 1 month ago

Thank you for suggesting these ideas; it is a new perspective that I hadn’t considered.

However, recompiling the kernel is quite challenging for me at the moment. Is there an alternative approach to verify this performance drop-off without needing to recompile the kernel?

axboe commented 1 month ago

Try and run the workload with iodepth=128 first and then do:

# perf record -g -p <pid of fio thread> -- sleep 3

and then run it with iodepth=256 and repeat the above perf to capture a new trace. Then do:

# perf diff

and that'll show you where the increases in cycles are being spent. For the pid, just pick one of the fio tasks, should not really matter.

JigaoLuo commented 1 month ago

Output of perf diff 4k_128_randread_4CPU_perfrecord.data 4k_256_randread_4CPU_perfrecord.data:

JigaoLuo commented 1 month ago

I did find perf diff so helpful, so I did followings.

iodepth=128 , 4 cores

`perf stat` on cycles and cs

    75.218.034.949      cycles                                                                
           579.016      context-switches                                                      
       5,413759857 seconds time elapsed
       4,790074000 seconds user
      15,871115000 seconds sys

`perf report --sort symbol` on cycles

`perf report --sort symbol` on cs

iodepth=256 , 4 cores

`perf stat` on cycles and cs

    71.378.521.966      cycles                                                                
         8.131.844      context-switches                                                      
       5,414678953 seconds time elapsed
       1,827919000 seconds user
      18,829960000 seconds sys

`perf report --sort symbol` on cycles

`perf report --sort symbol` on cs

axboe commented 1 month ago

You can see the increase in psi_group_charge, which is the pressure stall. So it's allocating too much perhaps? I'm assuming this is also the path that needs to the increased schedules, you can check that by running:

perf report -g --no-children

with perf.data being the run from QD=256. But yeah, I'm very certain this issue is caused by psi/memcg being way too expensive for having 256 inflight vs just 128, where the 128 will end up being cached and recycled by io_uring.

axboe commented 1 month ago

You also look like you have io-wq activity when queue depth is 256, which could also lead to a slowdown as you're overloading the device. What does:

cat /sys/block/nvme0n1/queue/nr_requests

say for the device, provided that nvme0n1 is one of your targets?

JigaoLuo commented 1 month ago

iodepth=128 , 4 cores

`perf report -g --no-children`

iodepth=256 , 4 cores

`perf report -g --no-children`

JigaoLuo commented 1 month ago

You also look like you have io-wq activity when queue depth is 256, which could also lead to a slowdown as you're overloading the device. What does:
cat /sys/block/nvme0n1/queue/nr_requests
say for the device, provided that nvme0n1 is one of your targets?

nr_requests of SSD in nvmeof target:

$ cat /sys/block/nvme0n1/queue/nr_requests
1023

This is the same for the total 4 SSDs on the nvmeof target.

(However, /sys/block/nvme0n1/queue/nr_requests does not exist on the nvmeof host when remote SSD connected. On the host are only:

$ ls /sys/block/nvme4n1/queue/
add_random            dma_alignment       max_discard_segments    nomerges             virt_boundary_mask
chunk_sectors         fua                 max_hw_sectors_kb       nr_zones             write_cache
dax                   hw_sector_size      max_integrity_segments  optimal_io_size      write_same_max_bytes
discard_granularity   io_poll             max_sectors_kb          physical_block_size  write_zeroes_max_bytes
discard_max_bytes     io_poll_delay       max_segments            read_ahead_kb        zone_append_max_bytes
discard_max_hw_bytes  iostats             max_segment_size        rotational           zoned
discard_zeroes_data   logical_block_size  minimum_io_size         stable_writes        zone_write_granularity

JigaoLuo commented 1 month ago

You can see the increase in psi_group_charge, which is the pressure stall. So it's allocating too much perhaps? I'm assuming this is also the path that needs to the increased schedules, you can check that by running:
perf report -g --no-children
with perf.data being the run from QD=256. But yeah, I'm very certain this issue is caused by psi/memcg being way too expensive for having 256 inflight vs just 128, where the 128 will end up being cached and recycled by io_uring.

Hi @axboe , Thank you so much for your hands-on assistance. I wouldn't have been able to identify this pressure stall issue from PSI/memcg on my own. Honestly, I’m not very familiar with the underlying reasons, so I’ll need to look them up to understand better and have a conclusion. Thanks again for your help! Write me if more perfdata needed. :)

JigaoLuo commented 1 month ago

Hi @axboe ,

I’m still unclear about the cause of this issue. From my analysis with perf, it seems that the problem is related to psi_group_charge, which indicates cgroup memory management.

What’s puzzling me is why this issue specifically occurs with NVMe over Fabrics cases. I ran performance tests using fio with iouring-polled mode on both local SSDs and NVMe-oF. The results show that the local SSD performs well with an iodepth of 256.

Could you help me understand why psi_group_charge might be impacting the NVMe-oF cases but not the local SSD cases? I’m not sure what specific factors are affecting the NVMe-oF that are not affecting the local SSD. Any insights would be greatly appreciated.

axboe commented 1 month ago

About to be on a plane, but just look at the output of perf report. It'll tell you exactly where the psi call is happening from and will give you an idea of why.

axboe commented 1 month ago

As mentioned earlier, you also have io-wq activity, which should not be happening unless you're exceeding the requests available on that device. This is the iowq* stuff in your traces. This could indeed also make psi and memcg accounting WORSE, as you then have more threads competing for these resources.

When you say you're using QD=256, is that for ALL threads you are using, or is that per-thread? Because if the queue is 1023 entries in size, then 4 threads would already end up hitting io-wq and more threads is just going to make that much worse. You're overloading the device.

JigaoLuo commented 1 month ago

Based on the FIO configuration shared in my first message, I have four NVMe-oF SSDs in total, each configured in the case of (4 threads, iodepth 256 per device) :

[filename0]
filename=/dev/nvme4n1
iodepth=256
numjobs=1
cpus_allowed=18
numa_mem_policy=prefer:1

Each SSD operates with a dedicated iodepth of 256, a dedicated core on NUMA node 1 (where the host's NIC is located). I am using the same FIO script across all IO engines: SPDK NVMe-oF, libaio NVMe-oF, io_uring NVMe-oF, and io_uring polled PCIe-local.

Interestingly, the io-wq and psi_group_charge behaviors are only observed in the io_uring polled NVMe-oF setup and not in the other configurations. Thank you for your attention!

axboe commented 1 month ago

Interestingly, the io-wq and psi_group_charge behaviors are only observed in the io_uring polled NVMe-oF setup and not in the other configurations.

I think this is because the psi_group_charge overhead is increased quite a bit because io-wq workers are seen. I looked at the rdma code briefly, and it's setting up a pool based on the SQ size of the target. If we can't allocate from this pool to issue a new io, -EAGAIN is returned, and this in turn forces io_uring to punt this request to io-wq.

I don't know too much about nvma over rdma, but this is a target issue. You need to ensure that the SQ size depth is large enough on the target side to avoid running out for your test. Maybe this is configurable, I don't really know. I think you need to ask this question on the nvme mailing list. It's not an io_uring/liburing bug or issue, io_uring is simply running into a target that has a lower queue depth than you would like for your test, and hence performance goes to shit.

JigaoLuo commented 1 month ago

thanks. I will check my RDMA SQ size and possibly ask in the mailing list. Will keep you updated

JigaoLuo commented 1 month ago

Hi @axboe

I’ve done some research on the NVMe-oF driver code and reviewed discussions in the nvme mailing list. In NVMe-oF, the target refers to the server where the SSD is physically attached, while the host is the I/O issuer (such as FIO) interacting with the remote SSD.

One significant difference I noticed between PCIe-attached and NVMeoF-attached block devices is the queue size, which is evident in the performance plots I shared earlier:

The four PCIe-attached SSD at the NVMe-oF target have sqsize:
```
cat /sys/class/nvme/*/sqsize
1023
1023
1023
1023
```
The four NVMeoF-attached SSD at the NVMe-oF host have sqsize:
```
cat /sys/class/nvme/*/sqsize
127
127
127
127
```

I also explored whether it’s possible to increase the NVMe-oF sqsize. Unfortunately, it appears that this is restricted by the kernel driver by a constant marco NVME_RDMA_MAX_QUEUE_SIZE [1]. There was a patch released this year that raised the NVME_RDMA_MAX_QUEUE_SIZE from 128 to 256 [2], but my kernel version predates this patch. As a result, my setup limited to 127 (or 128) for the queue size on the NVMe-oF host.

I wasn’t initially aware of this upper limit, so I ran FIO experiments with an I/O depth of 256 and even higher — exceeding the limit /sys/class/nvme/*/sqsize at the host —leading to the performance issues in plots above.

FIO accepted the I/O depth of 256 without issues, and the FIO report reflects the I/O depth of 256.
I understand that if the block device queues are full, the block layer on the host will queue the commands.

Do you think the queue depth limit could be contributing to the performance drop-off? I’m planning to share these findings on the NVMe mailing list, as it seems there are still issues when using a fixed queue depth in NVMe-oF RDMA.

In my setup:

kernel version: 6.6.1
target has four Samsung 980 Pro SSDs
NIC: Mellanox ConnectX-5 MT27800 NIC (InfiniBand EDR 4x, 100 Gbps)

[1]:

target driver: https://github.com/torvalds/linux/blob/master/drivers/nvme/target/rdma.c#L1945-L1951
host driver: https://github.com/torvalds/linux/blob/master/drivers/nvme/host/rdma.c#L1035-L1045

[2]:

axboe commented 1 month ago

Do you think the queue depth limit could be contributing to the performance drop-off?

Yes, this is pretty much what I've been saying all along, that this indeed is the issue. If it wasn't, you would not see io-wq activity, and this in turn leads to both inefficiencies around that, but also then related issues with the memcg charging.

JigaoLuo commented 1 month ago

I’m also wondering why this performance drop-off doesn’t occur with libaio or non-polled io_uring.

Could it be that polled io_uring leverages psi/memcg with io-wq workers to cache and recycle I/O depths higher than the sqsize, whereas non-polled io_uring does not? From my understanding, other I/O engines will cache I/O depths higher than the sqsize in the block layer instead.

axboe commented 1 month ago

I’m also wondering why this performance drop-off doesn’t occur with libaio or non-polled io_uring.

For libaio, it'll just block, it doesn't not attempt to handle this condition. So there it just violates the idea that it's an async API, when in fact it'll just wait on previous IO to complete. It will likely just plateau in terms of performance. For non-polled io_uring, it'll in fact still be less efficient, just less so as you don't end up with a bunch of io-wq workers that also poll. io-wq will keep retrying a submission until it succeeds, and hence effectively poll for the submission too.

axboe commented 1 month ago

End of the day, you're over-driving the nvme target. Different ways of doing that will yield different outcomes, but the core problem here is that you are indeed overloaded it. If the nvme target had a higher depth, then it would work better.

axboe / liburing

Performance Drop & Bandwidth Discrepancy in Polled `io_uring` with High Queue Depth in NVMe-oF RDMA #1228

iodepth=128 , 4 cores

`perf stat` on cycles and cs

`perf report --sort symbol` on cycles

`perf report --sort symbol` on cs

iodepth=256 , 4 cores

`perf stat` on cycles and cs

`perf report --sort symbol` on cycles

`perf report --sort symbol` on cs

iodepth=128 , 4 cores

`perf report -g --no-children`

iodepth=256 , 4 cores

`perf report -g --no-children`

axboe / liburing

Performance Drop & Bandwidth Discrepancy in Polled `io_uring` with High Queue Depth in NVMe-oF RDMA #1228

iodepth=128 , 4 cores

perf stat on cycles and cs

perf report --sort symbol on cycles

perf report --sort symbol on cs

iodepth=256 , 4 cores

perf stat on cycles and cs

perf report --sort symbol on cycles

perf report --sort symbol on cs

iodepth=128 , 4 cores

perf report -g --no-children

iodepth=256 , 4 cores

perf report -g --no-children

`perf stat` on cycles and cs

`perf report --sort symbol` on cycles

`perf report --sort symbol` on cs

`perf stat` on cycles and cs

`perf report --sort symbol` on cycles

`perf report --sort symbol` on cs

`perf report -g --no-children`

`perf report -g --no-children`