ClusterLabs / fence-agents

Fence agents
101 stars 155 forks source link

fence_scsi: fix registration handling in device 'off' workflows #558

Closed smohanan20 closed 9 months ago

smohanan20 commented 10 months ago

Problem:

When a device is powered off (preempted), fence_scsi agent assumes that the client has a registration to the device and sends a preempt-and-abort request on the key held by the other device. This fails due to reservation conflict if the host registration has a conflicting ISID. (Another manifestation of problem https://github.com/ClusterLabs/fence-agents/pull/529)

Impact:

If the local host is unable to preempt any other hosts because a matching registration with local host is not found, then the local host won't be able to start the resources.

Proposed Fix:

To fix this, the agent needs to register with the host key before it tries a preempt request.

knet-jenkins[bot] commented 10 months ago

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/fence-agents/job/fence-agents-pipeline/job/PR-558/1/input

oalbrigt commented 9 months ago

Do you have a reproducer? We're having issues reproducing this in our env.

smohanan20 commented 9 months ago

Steps that I’ve used to reproduce ISID issue

  1. Establish iSCSI sessions with a storage device that supports multiple LUNs per iSCSI target. We would also need multiple iSCSI target connections. Each session would have its own ISID.
  2. Use sg_persist/pacemaker to make registrations/reservations.
  3. Add a new LUN to the first iSCSI target and establish a session with client. The iSCSI initiator picks a new session id (large) to use for ISID.
  4. Use sg_persist to register the new device
  5. Reboot the initiator. This should cause the connections to re-establish and there is a good chance that the ISID changes (compared to the state pre-reboot) for iSCSI targets (due to initiator's order of iSCSI login requests). This makes the old reservation/registration invalid.
oalbrigt commented 9 months ago

What do you use to create a 2nd session? We're unable to do it with targetcli.

smohanan20 commented 9 months ago

The initial repro was with a storage vendor that exposes the target/lun. But I can reproduce with targetcli post reboots:

  1. Create bunch of targets with single LUN (unique name in a specific order - disk 0, disk 1, disk5 and disk 6) and expose them to a client create iqn.2003-01.org.linux-iscsi.localhost.x8664:disk<0-1,5-6>
  2. Establish iscsi session from another centos client with iscsiadm
    
    iscsiadm -m discover -t st -p <ip>
    iscsiadm -m node -l
    iscsiadm -m session -P 3 |  grep 'SID\|Target:'

Target: iqn.2003-01.org.linux-iscsi.localhost.x8664:disk0 (non-flash) SID: 27 Target: iqn.2003-01.org.linux-iscsi.localhost.x8664:disk1 (non-flash) SID: 28 Target: iqn.2003-01.org.linux-iscsi.localhost.x8664:disk5 (non-flash) SID: 29 Target: iqn.2003-01.org.linux-iscsi.localhost.x8664:disk6 (non-flash) SID: 30

3. Create another new iscsi target - disk2 and establish same session. disk 2 gets SID 31

iscsiadm -m discovery -t st -p iscsiadm -m node -l iscsiadm -m session -P 3 | grep 'SID|Target:'

Target: iqn.2003-01.org.linux-iscsi.localhost.x8664:disk0 (non-flash) SID: 27 Target: iqn.2003-01.org.linux-iscsi.localhost.x8664:disk1 (non-flash) SID: 28 Target: iqn.2003-01.org.linux-iscsi.localhost.x8664:disk5 (non-flash) SID: 29 Target: iqn.2003-01.org.linux-iscsi.localhost.x8664:disk6 (non-flash) SID: 30 Target: iqn.2003-01.org.linux-iscsi.localhost.x8664:disk2 (non-flash) SID: 31


5. I rebooted the client and listed down sessions 

iscsiadm -m session -P 3 | grep 'SID|Target:'

Target: iqn.2003-01.org.linux-iscsi.localhost.x8664:disk6 (non-flash) SID: 10 Target: iqn.2003-01.org.linux-iscsi.localhost.x8664:disk2 (non-flash) SID: 11 ... Target: iqn.2003-01.org.linux-iscsi.localhost.x8664:sn.f703c1d91bd7 (non-flash) SID: 6 Target: iqn.2003-01.org.linux-iscsi.localhost.x8664:disk0 (non-flash) SID: 7 Target: iqn.2003-01.org.linux-iscsi.localhost.x8664:disk1 (non-flash) SID: 8 Target: iqn.2003-01.org.linux-iscsi.localhost.x8664:disk5 (non-flash) SID: 9



The SIDs were not consistent with the order pre-reboot. I can have registration with a SID before reboot and now post-reboot(or any disconnects for that matter) may get a new SID i.e., ISID would make the registration obsolete. 
oalbrigt commented 9 months ago

retest this please

smohanan20 commented 9 months ago

@oalbrigt Did you have anything specific in mind for me to test?

oalbrigt commented 9 months ago

No. That was for our CI to run it's tests.

oalbrigt commented 9 months ago

Thanks.