ClusterLabs / fence-agents

Fence agents
104 stars 160 forks source link

fence-scsi registration/reservation issue #517

Open smohanan20 opened 1 year ago

smohanan20 commented 1 year ago

Hi,

I'm facing an issue while setting up pacemaker with fence_scsi possibly due to how fence_scsi issues a registration. Some background below:

Consider a SCSI device backed by iSCSI shared with 2 systems using pacemaker and fence-scsi as the fencing agent (The block device supports SCSI-PR).

Since they are backed by iSCSI, the reservation/registration is based on I_T nexus i.e., {Initiator Name, Target Name, ISID and target portal group tag}. Each reservation or registration done by the fencing agent is unique to it's own I_T nexus.

On certain conditions, after I reboot a system, the system comes back up and sets up connection with same iSCSI targets but uses a new ISID compared to the one before (SID of the ISID changes based on how it sets up the connection). This is default behavior of iscsi-initiator-utils. In such situations, instead of registering again for the new I_T nexus, fence_scsi skips this since the reservation key is the same as the old one.

def register_dev(options, dev):
            ...     
                   return True
  if get_reservation_key(options, dev, False) == options["--key"]: <---------------
      return True
  reset_dev(options, dev)
  cmd = options["--sg_persist-path"] + " -n -o -I -S " + options["--key"] + " -d " + dev
  cmd += " -Z" if "--aptpl" in options else ""
  #cmd return code != 0 but registration can be successful`

So the registration process ends up as a no-op and system thinks it holds the reservation even though the key is for a different I_T nexus.

Instead of skipping the registration path, shouldn't this workflow preempt it's own reservation/registration with the same KEY so that it sets up a registration with the key again (SCSI PR spec allows for Preempt/Preempt-and-abort to happen on a specific key).

Please let me know your thoughts.

Regards

smohanan20 commented 1 year ago

Proposed a fix to address this issue: https://github.com/ClusterLabs/fence-agents/pull/529