ceph / ceph-iscsi

Ceph iSCSI tools
GNU General Public License v3.0
60 stars 59 forks source link

rbd-target-api: concurrent requests are not supported #261

Open lnsyyj opened 2 years ago

lnsyyj commented 2 years ago

Hi everyone,

When two or more clients request rbd-target-api at the same time, the configuration file gateway.conf will be modified randomly. When used at scale, performance is slow, will the community support concurrency?

lxbsz commented 2 years ago

Hi everyone,

When two or more clients request rbd-target-api at the same time, the configuration file gateway.conf will be modified randomly. When used at scale, performance is slow, will the community support concurrency?

Sorry for late.

What do you mean modified randomly ? BTW, have you seen any issue ? Currently when changing the gateway.conf object it will acquire the exclusive lock first from Rados. And only the auth gateway node could change the corresponding disk config.

lnsyyj commented 2 years ago

image

Yes, this exclusive lock cannot lock concurrent requests. We ran into a problem:

  1. When concurrency with different rbd-target-api Both rbd-target-apis are executed to step 3. At this time, rbd-target-api-1 gets the lock, writes gateway.conf to rados successfully, and releases the lock. rbd-target-api-2 also modifies the configuration file in memory at this time, gets the lock, and writes gateway.conf will overwrite the gateway.conf of rbd-target-api-1.

  2. When the same rbd-target-api is used concurrently There is also this problem

I think the cause of the problem: 1) gateway.conf is to read and write the entire rados object, and the granularity is too large. 2) rbd-target-api is not a distributed service.

lxbsz commented 2 years ago

Currently the sequence is:

1, exclusive lock 2, read gateway.conf object to tmp_config 3, update the tmp_config in memory 4, store tmp_config to gateway.conf 5, exclusive unlock

So for the step 3 in your picture, the exclusive lock should have already been acquired.

And also for each sections, such as for:

 o- iscsi-targets ............................................................. [Targets: 1]
    o- iqn.2003-01.com.redhat.iscsi-gw:ceph-gw1 ................... [Auth: CHAP, Gateways: 2]
    | o- disks ................................................................... [Disks: 1]
    | | o- rbd/disk_1 .............................................. [Owner: rh7-gw2, Lun: 0]
    | o- gateways ..................................................... [Up: 2/2, Portals: 2]
    | | o- rh7-gw1 .................................................... [192.168.122.69 (UP)]
    | | o- rh7-gw2 .................................................... [192.168.122.14 (UP)]
      o- host-groups ........................................................... [Groups : 0]
      o- hosts ................................................ [Auth: ACL_ENABLED, Hosts: 1]

We can see that it's auth is Gateways: 2, and the ceph-iscsi will only allow the auth gateway 2 to update this secions in gateway.conf, so there shouldn't have any conflict of it, or it's buggy in the corresponding code.

lnsyyj commented 2 years ago

Yes, we can test it. Similar configuration file errors often occur when concurrently accessing different nodes rbd-target-api.

Jun 1 08:00:44 node51 rbd-target-api[2744]: KeyError: u'rbd/disk226'

The following are two scripts that simulate concurrent operation of rbd-target-api services on different nodes and add luns to the same target. (It is very easy to reproduce the problem) access rbd-target-api-1

for i in `seq 1 100`;
do
curl --insecure --user admin:admin -d mode=create -d create_image=true -d pool=rbd -d size=1T -X PUT http://192.168.122.52:5000/api/disk/rbd/disk$i ;
curl --insecure --user admin:admin -d disk=rbd/disk$i -X PUT http://192.168.122.52:5000/api/targetlun/iqn.2003-01.com.redhat.iscsi-gw:ceph-gw1;
curl --insecure --user admin:admin -d disk=rbd/disk$i -X PUT http://192.168.122.52:5000/api/clientlun/iqn.2003-01.com.redhat.iscsi-gw:ceph-gw1/iqn.2022-05.com.xstor.client0005;
done

access rbd-target-api-2

for i in `seq 101 200`;
do
curl --insecure --user admin:admin -d mode=create -d create_image=true -d pool=rbd -d size=1T -X PUT http://192.168.122.53:5000/api/disk/rbd/disk$i ;
curl --insecure --user admin:admin -d disk=rbd/disk$i -X PUT http://192.168.122.53:5000/api/targetlun/iqn.2003-01.com.redhat.iscsi-gw:ceph-gw1;
curl --insecure --user admin:admin -d disk=rbd/disk$i -X PUT http://192.168.122.53:5000/api/clientlun/iqn.2003-01.com.redhat.iscsi-gw:ceph-gw1/iqn.2022-05.com.xstor.client0005;
done
lxbsz commented 2 years ago

Cool, so it's a bug IMO. Recently I am busy with cephfs project, since you can reproduce it and if you'd like please raise on PR to fix it.

lnsyyj commented 2 years ago

I think this problem is very difficult to fix, it involves design issues and changes will be huge.