ceph / ceph-nvmeof

Service to provide Ceph storage over NVMe-oF/TCP protocol
GNU Lesser General Public License v3.0
89 stars 46 forks source link

Multi-RBD performance does not scale up well as fio-rbd #939

Open xin3liang opened 1 week ago

xin3liang commented 1 week ago

We do some 4k random read/write performance tests on the below testbed. And found that the Nvmeof gateway multi-rbd performance does not scale well as fio-rbd.

image

image image

Hardware

Software

Deployment and Parameters Tuning

Fio Running Cmds and Configs We run fio tests on the client node with cmds RW=randwrite BS=4k IODEPTH=128 fio ./[fio_test-rbd.conf|fio_test-nvmeof.conf] --numjobs=1

(.venv) [root@client1 spdktest]# cat fio_test-rbd.conf
[global]
#stonewall
description="Run ${RW} ${BS} rbd test"
bs=${BS}
ioengine=rbd
clientname=admin
pool=nvmeof
#pool=test-pool
thread=1
group_reporting=1
direct=1
verify=0
norandommap=1
time_based=1
ramp_time=10s
runtime=60m
iodepth=${IODEPTH}
rw=${RW}
#numa_cpu_nodes=0

[test-job1]
rbdname=fio_test_image1

[test-job2]
rbdname=fio_test_image2

[test-job3]
rbdname=fio_test_image3

[test-job4]
rbdname=fio_test_image4

[test-job5]
rbdname=fio_test_image5
(.venv) [root@client1 spdktest]# cat fio_test-nvmeof.conf
[global]
#stonewall
description="Run ${RW} ${BS} NVMe ssd test"
bs=${BS}
#ioengine=libaio
ioengine=io_uring
thread=1
group_reporting=1
direct=1
verify=0
norandommap=1
time_based=1
ramp_time=10s
runtime=1m
iodepth=${IODEPTH}
rw=${RW}
#numa_cpu_nodes=0

[test-job1]
#filename=/dev/nvme2n1
filename=/dev/nvme2n2

[test-job2]
#filename=/dev/nvme2n3
#filename=/dev/nvme2n4
filename=/dev/nvme4n1
#filename=/dev/nvme4n2

#[test-job3]
#filename=/dev/nvme2n5
##filename=/dev/nvme2n6
#
#[test-job4]
#filename=/dev/nvme2n7
##filename=/dev/nvme2n8
#
#[test-job5]
#filename=/dev/nvme2n9
##filename=/dev/nvme2n10
xin3liang commented 1 week ago

We notice that currently, one ceph-nvmeof gateway creates only one Ceph IO context(RADOS connection) with Ceph whereas fio creates one Ceph IO context with Ceph for each running job.

And Refer to two performance tuning guides below, one Ceph IO context can't support too many RBD images read/write access well. And maybe the RBD Grouping Strategy(one Ceph IO Context per group) would help with the multi-RBD performance scale-up.

See P9-10 of: https://ci.spdk.io/download/2022-virtual-forum-prc/D2_4_Yue_A_Performance_Study_for_Ceph_NVMeoF_Gateway.pdf Rbd Grouping Strategy: https://www.intel.com/content/www/us/en/developer/articles/technical/performance-tuning-of-ceph-rbd.html

caroav commented 6 days ago

We currently create a cluster context for every X images. This is configurable by the "bdevs_per_cluster" parameter as in ceph-nvmeof.conf. Note that currently this is done per ANA group (and it had some reasons related to failback and blocklisting), but we are going to make it flat again. So you can set this to 1 if you want 1 Ceph IO context per image, or more. FYI @oritwas @leonidc @baum

xin3liang commented 6 days ago

We currently create a cluster context for every X images. This is configurable by the "bdevs_per_cluster" parameter as in ceph-nvmeof.conf. Note that currently this is done per ANA group (and it had some reasons related to failback and blocklisting), but we are going to make it flat again. So you can set this to 1 if you want 1 Ceph IO context per image, or more. FYI @oritwas @leonidc @baum

Sounds cool, thanks @caroav . Will give it a try. BTW, regarding the configurable parameters in ceph-nvmeof.conf we might need to document all of them somewhere, I think.

caroav commented 6 days ago

BTW, regarding the configurable parameters in ceph-nvmeof.conf we might need to document all of them somewhere, I think.

Yes I need to update the entire upstream nvmeof documentation. I will do it soon.