could ceph-nvmeof work well on multi reactor cores?

liuqinfei commented 2 years ago

I try to deploy ceph-nvmeof and i find that one reactor core can work well. However, when i increase the reactor cores to 2, i find significant unbalance between the two cores. For example, one is ~100%, and the other one is ~30%.

PepperJo commented 2 years ago

The utilization of the reactor cores will depend on the number of io queues you do IO to on the NVMeoF side. If you only do IO to a single queue for example you will most likely only see 1 reactor core loaded.

liuqinfei commented 2 years ago

The utilization of the reactor cores will depend on the number of io queues you do IO to on the NVMeoF side. If you only do IO to a single queue for example you will most likely only see 1 reactor core loaded.

Thankyou，i have a try but the result is not ideal. I run 2 reactor cores on the server. I connect 8 nvme device on the client. The test case is fio numjobs=2 iodepth=64 bs=4k. I use spdk_top to capture the utilization of CPU. One is ~100%, the other one is ~9.5%. The IOPS result is ~53K. Do you have any suggestion about this test case?

liuqinfei commented 2 years ago

In fact i just add "-m [1,2]" in nvme_gw_server.py like in spdk to run more cores. So what is the correct way to run 2 reactor cores in ceph-nvmeof?

def start_spdk(self):
      cmd = [spdk_cmd, "-m [1,2]", "-u", "-r", spdk_rpc_socket]

trociny commented 2 years ago

In fact i just add "-m [1,2]" in nvme_gw_server.py like in spdk to run more cores. So what is the correct way to run 2 reactor cores in ceph-nvmeof?
def start_spdk(self):
      cmd = [spdk_cmd, "-m [1,2]", "-u", "-r", spdk_rpc_socket]

There is tgt_cmd_extra_args param in nvme_gw.config, [spkd] section, where you may add "-m [1,2]":

[spdk]
...
tgt_cmd_extra_args = -m [1,2]

PepperJo commented 2 years ago

Thankyou，i have a try but the result is not ideal. I run 2 reactor cores on the server. I connect 8 nvme device on the client. The test case is fio numjobs=2 iodepth=64 bs=4k. I use spdk_top to capture the utilization of CPU. One is ~100%, the other one is ~9.5%. The IOPS result is ~53K. Do you have any suggestion about this test case?

Can you provide more information about:

Which nvme initiator/client do you use? SPDK or kernel?
What is your Ceph configuration OSDs/drives etc.? What performance do you get with e.g. fio/librbd? Resp. what is your expectation?
The exact fio test you run? i.e. are both jobs doing IO to all 8 drives?
Is the GW on a machine with multiple numa nodes?

liuqinfei commented 2 years ago

Thankyou，i have a try but the result is not ideal. I run 2 reactor cores on the server. I connect 8 nvme device on the client. The test case is fio numjobs=2 iodepth=64 bs=4k. I use spdk_top to capture the utilization of CPU. One is ~100%, the other one is ~9.5%. The IOPS result is ~53K. Do you have any suggestion about this test case?

Can you provide more information about:

Which nvme initiator/client do you use? SPDK or kernel?

What is your Ceph configuration OSDs/drives etc.? What performance do you get with e.g. fio/librbd? Resp. what is your expectation?

The exact fio test you run? i.e. are both jobs doing IO to all 8 drives?

Is the GW on a machine with multiple numa nodes?

A1: kernel for initiator. I just want to reduce the overhead in client. A2: The test cluster has two nodes, one is ceph server, the other one is client. In the server node, I deploy 24 OSDs on 2 NVMe drives. A3: I deploy the ceph-nvmeof with 2 Reactor cores running in the server. I connect to 8 subsystems in the client and i get 8 drives. Then i use fio for benchmarking.

A4: No, the 2 reactor cores are running on the same numa node.

PepperJo commented 2 years ago

A1: kernel for initiator. I just want to reduce the overhead in client. A2: The test cluster has two nodes, one is ceph server, the other one is client. In the server node, I deploy 24 OSDs on 2 NVMe drives. A3: I deploy the ceph-nvmeof with 2 Reactor cores running in the server. I connect to 8 subsystems in the client and i get 8 drives. Then i use fio for benchmarking. image

A4: No, the 2 reactor cores are running on the same numa node.

Given your Ceph deployment size/configuration I would not be surprise if you are limited by your Ceph backend. Try running some fio/librbd or fio/krbd experiments to see what kind of performance you get. Just keep in mind that currently ceph-nvmeof only creates one io context with Ceph whereas fio creates one for every job it runs. We will change this in the future but this is the behavior for now. Alternatively you can use SPDK bdevperf it should be closer to the GW setup.

Some perf tuning options you can try: A1: Try limit number of io queues with "-i". The kernel creates as many queues as there are cores, this can hurt your performance. A2: 24OSDs on 2 NVMe drives seems a lot and might hurt your performance. Try disabling rbd_cache it might hurt performance with fast backends. A3: Consider using larger image sizes. Especially for throughput tests 20GB is probably to small. A4: The problem is that for the RBD bdev there are extra threads created in librbd. Those are unpinned by default. Consider calling SPDK with numactl to keep those on the right NUMA node (we are currently chasing an issue where some threads of librbd don't seem to respect numactl parameters but most do).

PepperJo commented 2 years ago

FYI the SPDK team identified a performance issue where only one reactor thread would handle RBD IO. You can try applying this patch to see if it fixes your issue: https://review.spdk.io/gerrit/c/spdk/spdk/+/10416

liuqinfei commented 2 years ago

I have verified the patch with 2 reactor cores and 4 reactor cores. The verification result shows that the patch can improve the load balance of the reactor core of the spdk rbd bdev. But the limitation of my cluster is backend storage ceph. I will imporve the backend storage and try again.

liuqinfei commented 2 years ago

@PepperJo @trociny Hi, PepperJo and trociny. I have saw your #23 and your talking about the limitaion of the rpc thread number which is set to 1. I did some ceph NOF performace tests locally but the performance result is not ideal. The performace does not scale up and even decrease when i increase the number of images. I think this may be releated to the limitation of the rpc thread num. Do you think so? If yes, will the limitation of rpc thread be solved in the near future? My cluster consists of 5 nodes: deploy ceph (17.2.3) on the first 3 arm nodes with 36 OSD, use the fourth arm node as NOF gateway, use the fifth x86 node as client to run fio. I attach several images as rbd bdev in the fourth node. I increase the number of images from 1 to 4 and the number of reactor core stay the same (1).

test case for one image: [global] ioengine=libaio thread=1 numjobs=1 group_reporting=1 direct=1 norandommap=1 rw=randwrite bs=4k iodepth=128 time_based=1 ramp_time=10 runtime=300 [filename2] filename=/dev/nvme2n1

I run with several images parallelly like this way: fio --client=server1 nvme1.fio --client=server1 nvme2.fio

performance result 1image 140K IOPS 2image 117K IOPS 3image 105K IOPS 4image 97K IOPS

PepperJo commented 2 years ago

@liuqinfei The number of RPC threads does not affect the data path (IO) but only how fast you can reconfigure SPDK via the gRPC API (e.g. create a subsystem, add a RBD image etc.), so the PR you referenced is unrelated to the performance problems you are seeing. Did you apply the patch as mentioned above? We do see performance improvements when using multiple images (randwrite @16KiB QD128): Screen Shot 2022-08-29 at 11 09 44 PM

liuqinfei commented 2 years ago

Fine. No, i did not apply the patch #10416 when i get the following performance result.

performance result 1image 140K IOPS 2image 117K IOPS 3image 105K IOPS 4image 97K IOPS

I was misleaded because the cpu useage from the spdk_top in v22.01 is not correct as the fix patch #12319 was merged since v22.05. Now i got that the bottleneck is the reactor core as the cpu useage of which is nearly 100% when i increase the number of rbd images to 2. While IO to 1 rbd image, the cpu useage of the reactor core is about 45%.

And yes, i tried the patch #10416 on my new cluster. And i see performance improvement for multiple rbd images as bellow. It looks almost okay. 1image 140K IOPS 2image 213K IOPS 3image 269K IOPS 4image 316K IOPS

So do you think this patch could be merged or not since the PR #10416 has been opened for nearly 9 months. If not, should we find other ways to fix this issue.

PepperJo commented 2 years ago

I agree that the patch should be merged. I will reach out to the SPDK team.

liuqinfei commented 2 years ago

@liuqinfei The number of RPC threads does not affect the data path (IO) but only how fast you can reconfigure SPDK via the gRPC API (e.g. create a subsystem, add a RBD image etc.), so the PR you referenced is unrelated to the performance problems you are seeing. Did you apply the patch as mentioned above? We do see performance improvements when using multiple images (randwrite @16KiB QD128):

By the way, i do not understand the difference between the test cases "fio default, rbd_bdev librbd + spdk, gw(nvmf tcp gw)" in the table. Could you describe it more carefully.

PepperJo commented 2 years ago

fio: fio with engine librbd
rbd_bdev: is SPDK rbd_bdevperf without NVMeoF target
gw: SPDK with NVMeoF + RBD

liuqinfei commented 2 years ago

Here's a new problem. The above performance result shows that the PR #10416 will improve the performance of several rbd images with several reactor cores. But one rbd image with several reactor cores will not benifit from this PR. As the performance result shows: _gw: SPDK with NVMeoF + RBD I change the number of io_context_pool and msgr-worker- while keep the number of reactor core to be 1. I just want to see the peak performance for one rbd image. io_context_pool and msgr-worker-:3+3: 121K IOPS reactor 43.65% io_context_pool and msgr-worker-:4+4: 126K IOPS reactor 97.68% io_context_pool and msgr-worker-:6+6: 100K IOPS reactor 99%_

We can get that the reactor core is the performance bottleneck. I increase the number of reactor core to 2. However, the performance result does not increase. And the cpu useage of the reactor core is about 9%+100%.

PepperJo commented 2 years ago

If you want a single image to leverage multiple reactor cores you need to make sure you use multiple IO queues (resp. connections) on the NVMeoF side. The kernel initiator automatically sets up 1 IO queue per core but if you run only 1 fio job it will most likely only use 1. So try running multiple jobs against the same image and see if it improves your single image performance. (Note that randwrite single volume scaling might be limited and it always makes sense to double check performance of a single volume without the GW, i.e. using krbd to verify that you are not reaching Ceph limits)

liuqinfei commented 2 years ago

If you want a single image to leverage multiple reactor cores you need to make sure you use multiple IO queues (resp. connections) on the NVMeoF side. The kernel initiator automatically sets up 1 IO queue per core but if you run only 1 fio job it will most likely only use 1. So try running multiple jobs against the same image and see if it improves your single image performance. (Note that randwrite single volume scaling might be limited and it always makes sense to double check performance of a single volume without the GW, i.e. using krbd to verify that you are not reaching Ceph limits)

Fine, i understand what you mean now. I need to use multiple IO queues to to leverage multiple reactors. I try this. And yes, this will improve my single image performance.However, by this, we can has a limitted benefit of the performance. As the main_td is one spdk_thread. Although we can poll cqe from several IO queues. Finally, only the main_td will execute _bdev_rbd_submit_request(). So, the performance of one reacotor is still a performance bottleneck.

PepperJo commented 2 years ago

I don't think it is clear that issuing IOs from multiple threads to a single RBD image (and single Ceph context) will benefit your performance. You obviously create a lot of contention in librbd when doing this. If we believe this is a problem we need to show that we can achieve significant performance improvements when using multiple threads issuing IO to a single image with librbd (again single Ceph context) in a stand-alone microbenchmark.

liuqinfei commented 2 years ago

I don't think it is clear that issuing IOs from multiple threads to a single RBD image (and single Ceph context) will benefit your performance. You obviously create a lot of contention in librbd when doing this. If we believe this is a problem we need to show that we can achieve significant performance improvements when using multiple threads issuing IO to a single image with librbd (again single Ceph context) in a stand-alone microbenchmark.

Yes, you are right. I also did not see any improvement of performance. I run fio on single image with 1 fio numjob. And i purposely keep polling at the core 56 and the rbd main_td on the core 48 . I observed from spdk_top that the ratio of the CPU overhead of the thread_poller( polling cqe) to the _bdev_rbd_submit_request is about 1 : 5 . I try to fill the single image to see if one reactor core could be the bottlenect , then i can use other reactor core. But i fail as the io_context_pool and msgr-worker-* already be the performance bottleneck. | Core | Thread count | Poller count | Idle [us] | Busy [us] | Frequency [MHz] | Intr | CPU % | | 48 | 2 | 2 | 271065 | 760270 | N/A | No | 73.72 | | 52 | 1 | 2 | 1031324 | 12 | N/A | No | 0.00 | | 56 | 1 | 2 | 888709 | 142627 | N/A | No | 13.83 |

xin3liang commented 3 days ago

A1: kernel for initiator. I just want to reduce the overhead in client. A2: The test cluster has two nodes, one is ceph server, the other one is client. In the server node, I deploy 24 OSDs on 2 NVMe drives. A3: I deploy the ceph-nvmeof with 2 Reactor cores running in the server. I connect to 8 subsystems in the client and i get 8 drives. Then i use fio for benchmarking. image A4: No, the 2 reactor cores are running on the same numa node.

Given your Ceph deployment size/configuration I would not be surprise if you are limited by your Ceph backend. Try running some fio/librbd or fio/krbd experiments to see what kind of performance you get. Just keep in mind that currently ceph-nvmeof only creates one io context with Ceph whereas fio creates one for every job it runs. We will change this in the future but this is the behavior for now.

Is there any plan for this change? I met the multi-RBD performance scale-up issue: https://github.com/ceph/ceph-nvmeof/issues/939

Alternatively you can use SPDK bdevperf it should be closer to the GW setup.

Some perf tuning options you can try: A1: Try limit number of io queues with "-i". The kernel creates as many queues as there are cores, this can hurt your performance. A2: 24OSDs on 2 NVMe drives seems a lot and might hurt your performance. Try disabling rbd_cache it might hurt performance with fast backends. A3: Consider using larger image sizes. Especially for throughput tests 20GB is probably to small. A4: The problem is that for the RBD bdev there are extra threads created in librbd. Those are unpinned by default. Consider calling SPDK with numactl to keep those on the right NUMA node (we are currently chasing an issue where some threads of librbd don't seem to respect numactl parameters but most do).

ceph / ceph-nvmeof

could ceph-nvmeof work well on multi reactor cores? #27