Multi-RBD performance does not scale up well as fio-rbd

xin3liang commented 1 week ago

We do some 4k random read/write performance tests on the below testbed. And found that the Nvmeof gateway multi-rbd performance does not scale well as fio-rbd.

Hardware

Arm CPU: Kunpeng 920, 2.6GHz, 96 CPU cores, 4 numa nodes
X86 CPU: Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz, 112 CPU cores, 2 numa nodes
Disk: 3 x ES3000 V6 NVMe SSD 3.2T per Arm server
Network: 1x MLNX ConnectX-5 100Gb IB，1x1G tcp

Software

OS: openEuler 22.03 LTS SP3， kernel 5.10.0-192.0.0.105.oe2203sp3
Ceph: main-nvmeof require revert commit "nvmeof gw monitor: disable by default"
SPDK: 24.05
nvmeof: 1.3.2
fio: fio-3.29

Deployment and Parameters Tuning

Deploy the Nvmeof gateway and Ceph cluster with cephadm.
To get a good backend Ceph performance for testing the Nvmf Gateway, we set a larger number to pg_num, set the replica size to 1, and bind each OSD to 4 cores.
Rebuild the nvmeof image without "--enable-debug" option as a release type build.
Tuning the CPU cores/mask of SPDK Nvmf target and Ceph client to bind their threads in the same NUMA as the 100Gbit NIC.

Make sure there are enough CPU cores for the nvmf target and Ceph client threads so that they don't meet the CPU bottleneck.

Each osd bind to 4 cores      
Set Ceph size=1 pg_num=16384 
# Note: nvmeof gw increase msg and io enqueue threads , bind ceph client, spdk tgt in the same numa as NIC    
ms_async_op_threads = 9   # 3->9      
librados_thread_count = 10 # 2->10      
x86 (4 spdk cores, all threads in same numa 1, NIC in numa 1)      
tgt_cmd_extra_args =               "-m 0xF0000000"      
librbd_core_mask = 0xFFFFFFF0000000FFFFFF00000000     
arm (6 spdk cores, all threads in numa  2,3, NIC in numa 2)      
tgt_cmd_extra_args =     "-m 0x3F000000000000"      
librbd_core_mask = 0xFFFFFFFFFFC0000000000000

FYI, in case someone is interested in the details of the hybrid x86 and arm Ceph Nvmf Gateway cluster deployment. Please refer to the attached pdf: Ceph SPDK NVMe-oF Gateway Evaluation on openEuler on openEuler (1).pdf

Fio Running Cmds and Configs We run fio tests on the client node with cmds RW=randwrite BS=4k IODEPTH=128 fio ./[fio_test-rbd.conf|fio_test-nvmeof.conf] --numjobs=1

(.venv) [root@client1 spdktest]# cat fio_test-rbd.conf
[global]
#stonewall
description="Run ${RW} ${BS} rbd test"
bs=${BS}
ioengine=rbd
clientname=admin
pool=nvmeof
#pool=test-pool
thread=1
group_reporting=1
direct=1
verify=0
norandommap=1
time_based=1
ramp_time=10s
runtime=60m
iodepth=${IODEPTH}
rw=${RW}
#numa_cpu_nodes=0

[test-job1]
rbdname=fio_test_image1

[test-job2]
rbdname=fio_test_image2

[test-job3]
rbdname=fio_test_image3

[test-job4]
rbdname=fio_test_image4

[test-job5]
rbdname=fio_test_image5

(.venv) [root@client1 spdktest]# cat fio_test-nvmeof.conf
[global]
#stonewall
description="Run ${RW} ${BS} NVMe ssd test"
bs=${BS}
#ioengine=libaio
ioengine=io_uring
thread=1
group_reporting=1
direct=1
verify=0
norandommap=1
time_based=1
ramp_time=10s
runtime=1m
iodepth=${IODEPTH}
rw=${RW}
#numa_cpu_nodes=0

[test-job1]
#filename=/dev/nvme2n1
filename=/dev/nvme2n2

[test-job2]
#filename=/dev/nvme2n3
#filename=/dev/nvme2n4
filename=/dev/nvme4n1
#filename=/dev/nvme4n2

#[test-job3]
#filename=/dev/nvme2n5
##filename=/dev/nvme2n6
#
#[test-job4]
#filename=/dev/nvme2n7
##filename=/dev/nvme2n8
#
#[test-job5]
#filename=/dev/nvme2n9
##filename=/dev/nvme2n10

xin3liang commented 1 week ago

We notice that currently, one ceph-nvmeof gateway creates only one Ceph IO context(RADOS connection) with Ceph whereas fio creates one Ceph IO context with Ceph for each running job.

And Refer to two performance tuning guides below, one Ceph IO context can't support too many RBD images read/write access well. And maybe the RBD Grouping Strategy(one Ceph IO Context per group) would help with the multi-RBD performance scale-up.

See P9-10 of: https://ci.spdk.io/download/2022-virtual-forum-prc/D2_4_Yue_A_Performance_Study_for_Ceph_NVMeoF_Gateway.pdf Rbd Grouping Strategy: https://www.intel.com/content/www/us/en/developer/articles/technical/performance-tuning-of-ceph-rbd.html

caroav commented 6 days ago

We currently create a cluster context for every X images. This is configurable by the "bdevs_per_cluster" parameter as in ceph-nvmeof.conf. Note that currently this is done per ANA group (and it had some reasons related to failback and blocklisting), but we are going to make it flat again. So you can set this to 1 if you want 1 Ceph IO context per image, or more. FYI @oritwas @leonidc @baum

xin3liang commented 6 days ago

We currently create a cluster context for every X images. This is configurable by the "bdevs_per_cluster" parameter as in ceph-nvmeof.conf. Note that currently this is done per ANA group (and it had some reasons related to failback and blocklisting), but we are going to make it flat again. So you can set this to 1 if you want 1 Ceph IO context per image, or more. FYI @oritwas @leonidc @baum

Sounds cool, thanks @caroav . Will give it a try. BTW, regarding the configurable parameters in ceph-nvmeof.conf we might need to document all of them somewhere, I think.

caroav commented 6 days ago

BTW, regarding the configurable parameters in ceph-nvmeof.conf we might need to document all of them somewhere, I think.

Yes I need to update the entire upstream nvmeof documentation. I will do it soon.

ceph / ceph-nvmeof

Multi-RBD performance does not scale up well as fio-rbd #939