WanWizard commented 5 years ago

Hey, followed your great instructions to the letter, but I'm left with a situation that leaves me stumped.

I have a setup with two supermicro's, each connected to 2 12-disk JBOD's with SAS disks, but without a loop, so no multipath (and multipath is not installed). Both JBOD's are used in mirrored vdevs, so I can lose an entire JBOD without much issues.

OS: CentOS Linux release 7.6.1810 (Core) ZFS: 0.7.13, from the zfs-kmod repo

This setup works fine, until pacemaker decides there is a need to failover. It doesn't matter if that is because the active node is put into standby, because the hardware is switched off, etc.

When pacemaker fails over, the second node tries to import the pool, which fails because something on the first node has placed SCSI reservations on the disk:

[root@nas01 /]# sg_persist -r /dev/sdh
  NETAPP    X412_HVIPC560A15  NA02
  Peripheral device type: disk
  PR generation=0x1, Reservation follows:
    Key=0x666e0001
    scope: LU_SCOPE,  type: Write Exclusive, registrants only

as soon as the failover happens, the second node starts to log:

[ 5834.890588] sd 0:0:7:0: reservation conflict
[ 5834.890674] sd 0:0:7:0: reservation conflict
[ 5834.890693] sd 0:0:7:0: [sdh] Test Unit Ready failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[ 5834.891274] sd 0:0:7:0: reservation conflict
[ 5834.891369] sd 0:0:7:0: reservation conflict
[ 5834.891466] sd 0:0:7:0: reservation conflict
[ 5834.891560] sd 0:0:7:0: reservation conflict
[ 5834.921402]  sdh: sdh1 sdh9
[ 5834.922452] sd 0:0:7:0: reservation conflict
[ 5834.957331] sd 0:0:7:0: reservation conflict
[ 5834.958157] sd 0:0:7:0: reservation conflict
[ 5835.052881] sd 0:0:7:0: reservation conflict

which either causes the entire import to fail, or, if the import succeeds, with disks offline due to excessive errors.

I've been pulling my hair out for about two weeks now, but no clue what sets these reservations, or how I can have them released on a cluster start or a cluster failover. There seem to be lots of people building Linux HA clusters with ZFS judging the discussions I found, but no one mentions this issue...

ewwhite commented 5 years ago

Hello,

This doesn't follow the spirit of what I documented. The idea of my instructions is to use JBODs with dual-ported SAS disks and multipath cabling.

Why doesn't your design use multipath?

WanWizard commented 5 years ago

Don't have the space for a second HBA, servers have only a few low-profile slots. One has the HBA, the other a dual 10G card for NFS connectivity.

This is a low-budget op with second hand hardware, paid by donations, to provide storage for the build/compile servers of an open source development team I help out with some infra and admin work.

I understand that two HBA's and multipathing would provide additional availability, but unfortunately it is what it is. Until a big sponsor comes along... ;-)

milleroff commented 5 years ago

We had the same problems with SCSI reservations as it never worked as expected. Sometimes when a failover happened, the new active server could not import the disks. We ended up setting the IPMI fencing. If something goes wrong, the second server shuts off the first server over the IPMI protocol and takes the control over the disks.

WanWizard commented 5 years ago

Good to read I'm not alone. Not good that you needed to work around it like that. I'd hoped to avoid that.

ewwhite commented 5 years ago

I haven't had such issues with any deployment. You can use a single dual-port HBA in each host. Dual HBA cards are not required.

@milleroff Please make sure meta provides=unfencing, pcmk_monitor_action="metadata" and pcmk_host_list= is populated in your SCSI stonith resource. Also, this absolutely requires dual-port SAS drives everywhere.

WanWizard commented 5 years ago

That is indeed how I've hooked it up now, one port going to each of the enclosures.

All my disks are Hitachi HUS156060VLS600, which are dual-port SAS drives. I don't have any SCSI fencing active, removing that was my first step in trying to find the problem.

ewwhite commented 5 years ago

SCSI fencing is crucial to what you're doing. That's how the failover and pool import work.

WanWizard commented 5 years ago

So what is setting the reservations, as it's not the fence_scsi agent?

I get that in production you need it to avoid imports on both nodes (which would be utter horror), but if it doesn't work in a controlled failover (where the pools are cleanly exported and a node is cleanly shutdown to tigger a failover), I don't see how adding another layer of complexity will fix the issue.

ewwhite commented 5 years ago

Controlled and uncontrolled failovers work in the setup I've described and documented.

I do not know what's unique about your environment, but removing critical components of the design isn't going to help the situation. What is the output of zpool status -v and lsscsi and multipath -ll?

This high-availability design assumes: Multipath SAS cabling Dual-ported disks Multipath service enabled ZFS pool creation using dm-multipath devices (not individual /dev/sdX SCSI disks) SCSI reservation fencing

WanWizard commented 5 years ago

I get that. I'd rather work with a documented (and if possible supported) environment as well. But as said, it is what it is. ;-)

[root@nas01 /]# zpool status -v
  pool: sas01
 state: ONLINE
  scan: none requested
config:

        NAME                        STATE     READ WRITE CKSUM
        sas01                       ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            wwn-0x5000cca01fa899a4  ONLINE       0     0     0
            wwn-0x5000cca01fa81cfc  ONLINE       0     0     0
          mirror-1                  ONLINE       0     0     0
            wwn-0x5000cca02a6a7e6c  ONLINE       0     0     0
            wwn-0x5000cca01fcf0128  ONLINE       0     0     0
          mirror-2                  ONLINE       0     0     0
            wwn-0x5000cca01f8f1394  ONLINE       0     0     0
            wwn-0x5000cca0411f4c48  ONLINE       0     0     0
          mirror-3                  ONLINE       0     0     0
            wwn-0x5000cca01fcc73ec  ONLINE       0     0     0
            wwn-0x5000cca02a018bf8  ONLINE       0     0     0
          mirror-4                  ONLINE       0     0     0
            wwn-0x5000cca01f47b644  ONLINE       0     0     0
            wwn-0x5000cca018511124  ONLINE       0     0     0
          mirror-5                  ONLINE       0     0     0
            wwn-0x5000cca01fda745c  ONLINE       0     0     0
            wwn-0x5000cca02a6b3e20  ONLINE       0     0     0
          mirror-6                  ONLINE       0     0     0
            wwn-0x5000cca01fa6e548  ONLINE       0     0     0
            wwn-0x5000cca018d80018  ONLINE       0     0     0
          mirror-7                  ONLINE       0     0     0
            wwn-0x5000cca02a67e1e0  ONLINE       0     0     0
            wwn-0x5000cca02a0070e0  ONLINE       0     0     0

errors: No known data errors

[root@nas01 /]# lsscsi
[0:0:0:0]    disk    NETAPP   X412_HVIPC560A15 NA04  /dev/sdb
[0:0:1:0]    disk    NETAPP   X412_HVIPC560A15 NA02  /dev/sdc
[0:0:2:0]    disk    NETAPP   X412_HVIPC560A15 NA02  /dev/sdd
[0:0:3:0]    disk    NETAPP   X412_HVIPC560A15 NA02  /dev/sde
[0:0:4:0]    disk    NETAPP   X412_HVIPC560A15 NA02  /dev/sdf
[0:0:5:0]    disk    NETAPP   X412_HVIPC560A15 NA04  /dev/sdg
[0:0:6:0]    disk    NETAPP   X412_HVIPC560A15 NA02  /dev/sdh
[0:0:7:0]    disk    NETAPP   X412_HVIPC560A15 NA04  /dev/sdi
[0:0:8:0]    enclosu LSI      SAS2X28          0e12  -
[0:0:9:0]    disk    NETAPP   X412_HVIPC560A15 NA02  /dev/sdj
[0:0:10:0]   disk    NETAPP   X412_HVIPC560A15 NA04  /dev/sdk
[0:0:11:0]   disk    NETAPP   X412_HVIPC560A15 NA02  /dev/sdl
[0:0:12:0]   disk    NETAPP   X412_HVIPC560A15 NA04  /dev/sdm
[0:0:13:0]   disk    NETAPP   X412_HVIPC560A15 NA02  /dev/sdn
[0:0:14:0]   disk    NETAPP   X412_HVIPC560A15 NA04  /dev/sdo
[0:0:15:0]   disk    NETAPP   X412_HVIPC560A15 NA02  /dev/sdp
[0:0:16:0]   disk    NETAPP   X412_HVIPC560A15 NA04  /dev/sdq
[0:0:17:0]   enclosu LSI      SAS2X28          0e12  -
[2:0:0:0]    disk    ATA      SAMSUNG SSD CM85 3D2Q  /dev/sda

The SSD disk is a 32GB SATA DOM the server boots from.

I don't have a multipath setup, so no multipath service installed. And therefore no multipath output, and no dm-multipath devices.

I know my setup isn't as documented, but I was still hoping someone knew where these reservations came from, so I could work with/around them. I have no issues writing a resource agent to deal with those if needed, if that is what it takes.

Thanks sofar. I have reinstated the fence_scsi agent now, will try another failover tomorrow, I'm in GMT+2, getting late here... ;-)

ewwhite commented 5 years ago

I'd enable the multipath daemon, re-create the pool with the resulting /dev/mapper devices and add a fencing resource containing those /dev/mapper devices.

pcs stonith create fence-vol1 fence_scsi pcmk_monitor_action="metadata" pcmk_host_list="zfs-node1,zfs-node2" devices="/dev/mapper/35000c500236061b3,/dev/mapper/35000c500236032f7,/dev/mapper/35000c5007772e5ff,/dev/mapper/35000c50023614aef,/dev/mapper/35000a7203008de44,/dev/mapper/35000c500236004a3,/dev/mapper/35000c5002362ffab,/dev/mapper/35000c500236031ab,/dev/mapper/35000c50023605c1b,/dev/mapper/35000c500544508b7,/dev/mapper/35000c5002362f347" meta provides=unfencing

Your NetApp shelves should allow you to do this.

WanWizard commented 5 years ago

Ok, will do that tomorrow evening. Thanks for the help sofar.

WanWizard commented 5 years ago

Decided to restart the project completely. Formatted and cleared everything, reinstalled CentOS and ZFS, following the wiki, virtually to the letter.

Difference is I decided not to use multipath, I've had a chat with some redhat DC guys today, and they all advised me not to use it when there is no multipath in use, to avoid another layer of complexity. So I followed their advice, and used "/dev/disk/by-id" instead of "/dev/mapper". Seems to work fine sofar.

Just tested a few failovers by switching nodes to standby and back and faking network issues, and that seems to work fine now, including the SCSI fencing. Happy days.

Only one issue left: when ZFS fails over, the shares aren't activated after the failover, and I need to do a zfs share -a to get them active again. Put one node in standby, and restarted the active node, after which the same issue occured. /etc/default/zfs' has ZFS_SHARE='yes', and the zfs-share.service is enabled.

No idea where to look next. I didn't have this problem before, so I seem to roll from one issue to the next...

ewwhite commented 5 years ago

I don't understand what you are trying to do by avoiding multipath, as it is a key element of this design.

I understand you're seeking assistance, but you have not clearly articulated the reasoning behind not using multipath devices. If there's an architectural issue preventing multipath cabling, please explain.

WanWizard commented 5 years ago

I don't have multipaths to my storage, so it totally pointless to install and use multipath, every device only has one path.

I get it is a key element of your design, but I don't have the hardware to match, I have already explained that to you (I only have one HBA per server, and only place for one).

ewwhite commented 5 years ago

I'm sorry, but the guidelines are very clear. Single HBAs aren't a problem if they have two external ports. Your equipment choices and crafting a workaround is not a valid support issue.

WanWizard commented 5 years ago

What an attitude. Disappointing.

I have servers with a single HBA. They have two ports each. One port is connected to shelf A, one port is connected to shelf B. The shelfs themselfs also only have two ports, so I CANT create a multipath, even if I wanted to. As I wrote yesterday, it is what it is, and then you didn't have a problem with it.

Besides that, the fact that nfs shares don't become available after a zpool import has absolutely zero to do with whether multipath is in use or not. I doesn't work either if I boot up one node while the other is switched off...

ewwhite commented 5 years ago

This is outside the scope of support because your solution is not built properly.

Regarding ZFS shares, filesystem exports are shared automatically on zpool import. sharenfs is a ZFS filesystem property, so if the filesystem is present and mounted, the sharing should work.

I suspect that your pools aren’t actually exporting/importing since the servers and disks have no knowledge of each other; because there's no use of multipath devices/device names in your zpool.

rcproam commented 5 years ago

@ewwhite Thanks so much for your excellent and hard work on this project! If you're interested was hoping to share some work I've done to integrate your design with ZapZend, which (as I'm sure you're aware) stores all of the snapshot & replication configuration within properties of the ZFS filesystem itself. In my testing (with minor modification) ZnapZend meshes well with your design :-)

We had the same problems with SCSI reservations as it never worked as expected. Sometimes when a failover happened, the new active server could not import the disks.

@milleroff I've encountered a very similar issue with nodes not releasing scsi reservations during graceful failover. However, my design is based on Debian Stretch (stable) which unfortunately at present only includes older versions of pacemaker and fence-agents compared to CentOS 7. As such, I was thinking that the failover/fencing issue I'm encountering is related to this issue Red Hat documented "RHEL 7 High Availability and Resilient Storage Pacemaker cluster experiences a fence race condition between nodes during network outages while using fence_scsi with multipath storage" https://access.redhat.com/solutions/3201072 ...but as I don't have a RHEL account I can't see what versions of pacemaker and fence-agents are affected. If you have time might you be able to share the versions of O/S, pacemaker, and fence-agents used in your implementation?

@WanWizard If you have each SAS HBA port connected to a separate shelf on both server nodes than the physical SAS connectivity IS already multipathed. However, for this ZFS-HA design to function, and as @ewwhite has described in the wiki, it is essential to install and configure device-mapper-multipath, and use the /dev/mapper/ device IDs for the vdevs when you create the ZFS pool. Hope this helps.

ewwhite commented 5 years ago

Here are the notes from the Red Hat support article linked above:

Resolution Utilizing fence_mpath instead of fence_scsi should prevent his race condition from occurring Ensure storage is always appropriately zoned so that all paths are functional

Root Cause This issue occurs in the rare circumstance where your nodes have a faulty path to multipath storage but are unaware that the path is not currently offline. This can be as a result of improper zoning of devices from the storage side, misconfiguration of FCOE storage, or just a path failure occurring at the right time. If the node is unaware that one path is not accepting I/O to the device, and it hasn't been otherwise determined that the device is unavailable, it may temporarily prevent a node from adjusting scsi reservations on the device long enough for another node in the cluster to fence that device.

rcproam commented 5 years ago

Thanks for the prompt response and helpful info @ewwhite :-)

It seems the fence_mpath agent is a little more complex to setup, and requires "that /etc/multipath.conf be configured with a unique reservation_key hexadecimal value on each node, either in the defaults or in a multipath block for each cluster-shared device." https://access.redhat.com/articles/3078811

Have you tested using the fence_mpath agent with your design BTW?

WanWizard commented 5 years ago

For those finding this issue because of a similar issue:

Multipath is not a requirement in my setup, creating (and failing over) a zpool with vdev's using wwn's works fine, wwn's are fixed and unique. I had this confirmed by my company's redhat system engineer.

And I've checked the ZoL code, zpool import only does a zfs share -a on import but only if it detects that NFS is already running. Which wasn't the case for me due to resource order constraints (I have the nfsserver status directory on a dataset in the zpool to be able to failover nfs status, so zfs must start before nfs).

Addressed this issue by modifying the nfsserver heartbeat script, and add a zfs share -a just before ocf_log info "NFS server started".

ewwhite commented 5 years ago

@rcproam No, I have not had a need to use the fence_mpath. I don't and have not encountered SCSI reservation problems in my builds. Definitely try to make sure meta provides=unfencing, pcmk_monitor_action="metadata" and pcmk_host_list= is populated in your SCSI stonith resource.

The other thing that I do these days is ensure there's a discrete heartbeat network path between nodes. I've been using a simple USB transfer cable between hosts to provide this additional link as the alternate Corosync ring.

I found this to be necessary in environments where I have MLAG/MC-LAC switches and multi-chassis LACP from the server to switches. A switch failure with collapsed VLANs for data, heartbeat, etc. would kill all of the network links, including the Corosync rings.

That's the only other modification I've needed. I don't suspect that SCSI reservation issues are commonplace.

@WanWizard - I advise leaving the NFS service running full time on both nodes. ZFS takes care of the rest. There's no need to start/stop that service for this purpose. Note that there's no NFS server resource. Just ZFS zpool, STONITH and IP address.


 Resource Group: group-vol1
     vol1   (ocf::heartbeat:ZFS):   Started zfs1-node1
     vol1-ip    (ocf::heartbeat:IPaddr2):   Started zfs1-node1

WanWizard commented 5 years ago

As I wrote, I have nfs_shared_infodir configured to point to a zfs dataset, so I can failover nfs state information.

This is on suggestion of that same Redhat engineer, and documented here: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_administration/s1-resourcegroupcreatenfs-haaa

Doing so requires the zpool and its datasets to be available before NFS starts, and that can only be archieved using the nfsserver resource in combination with order constraints.

It does create a chicken-and-egg problem, I understand that now. Redhat's examples are based on DRBD, which doesn't have this problem. I worked around it, I don't have a problem with that. Just wanted to report that back, for future reference.

rcproam commented 5 years ago

@ewwhite Thanks again for the prompt reply and helpful tips! :-)

As the fence_scsi vs fence_mpath agent topic is out of scope for this issue, I've opened a new issue to track if it resolves the fencing problem with my particular implementation:

26

rbicelli commented 5 years ago

Hello, I think I'm in the same situation as @WanWizard. I have a Dell MD1220 which basically has 2 EMM controllers with 2 SAS ports each. Unfortunately the 2nd port is unusable for multipath because is reserved for the daisy chain between enclosures only. So I'm forced to connect only one sas cable per EMM (EMM1 on host1 HBA, EMM2 on host2 HBA). Disks are SAS dual port.

I have multipath enabled but obviously multipath -ll show a single path for each disk on each host.

However it seems that failover is working with no issues.

My question is: could I stay safe with this setup or have I to migrate to a full dual-port SAS solution?

WanWizard commented 5 years ago

The only addtional risk you run is that a cable or connection issue between the active node and one of the enclosures will trigger a failover, whereas with multipath the active node would remain active and would use the second path.

It depends on your situation, but in my case everything runs in a locked rack that nobody ever opens, the chance of connection issues are very slim, and a failover because of it is not a problem (that is why I have two nodes, right?). In my case the risks by far outweigh the replacement costs.

ewwhite commented 5 years ago

I just read through the technical guidebook for the Dell MD1220. The manual says that clustering is not supported on the enclosure.

What happens if you create a SAS Multipath ring and use the Out ports on the EMM? The manual says the ports may be disabled depending on the enclosure mode (split/unified). If this doesn’t work, I guess that means this Dell is not an ideal enclosure for ZFS clustering purposes.

Edmund White

On Apr 29, 2019, at 11:25 AM, Riccardo notifications@github.com<mailto:notifications@github.com> wrote:

Hello, I think I'm in the same situation as @WanWizardhttps://github.com/WanWizard. I have a Dell MD1220 which basically has 2 EMM controllers with 2 SAS ports. Unfortunately the 2nd port is unusable for multipath because is reserved for the daisy chain between enclosures only.

So I'm looking for a way to preserve my MD1220 and have a reliable system even in case of hard failure.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ewwhite/zfs-ha/issues/25#issuecomment-487647723, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABJSFNQYDXSTO2GSLRIQZSTPS4OOXANCNFSM4HDFLAPQ.

rbicelli commented 5 years ago

By connecting a SAS HBA to Out port on EMM nothing happens. multipath -ll says there's only one possible path.

However I tried to do some failover test and everything seemed fine:

Soft power off of a node: ok
Hard power off of a node: ok
Kernel panic of a node: ok
Pull out of sas cable of a node: ok ok = the remaining active node takes over the resources, everything with a win vm with iometer doing workload tests. At this point I don't understand if dual sas cable per hba is really a requirement for the solution to work.

I think also that dell MD1200 should be removed from the wiki, since it is equipped with same EMM controllers of MD1220.

Riccardo

ewwhite commented 5 years ago

@rbicelli The point of the dual cabling is to provide HBA, port, cable and controller resilience. I suppose you could have a situation where you lose a cable, and that's tantamount to losing the entire node. So technically, things would work. It just means a cluster failover is triggered in more circumstances.

The limitation of the MD1220 controller is disappointing to see.

Nooby1 commented 2 years ago

I am having this issue as well. My setup is two HP bl460c G8 blades with hp P721m raid controllers in HBA mode, 8.32 firmware two 6g sas blade switches (latest firmware) sas switch zoned in bay mode D2700 plugged in with top and bottom module in left and right sas switch hp D2700 DAS (150 firmware) with 20 600gb 10K sas disks

red hat 8 zfs 2.05 encrypted zpool, but imported with -l and -d /dev/multipath/ and the key on a different disk.

muiltpath-ll shows two paths for each disk populated the STONITH meta provides=unfencing, pcmk_monitor_action="metadata" and pcmk_host_list=

multipathd was initially trying to use TUR, this was giving reservation errors on passive node every time it tried to check if the path was avalable. I changed it to use directio and this stopped the errors on passive node on path check.

However on failover I still get reservastion issues causing a faliure to failover

sd 1:0:62:0: reservation conflict hpsa 0000:21:00.0 cp xxxxxxxx has status 0x18 sense: 0xff, ASC: 0xff, ASCQ: 0xff, Returning result: 0x18 zio pool=zpool1 dev=/dev/mapper/disk error=52 type=2 offset=78633883483416 size=8192 flags=b08c1

sg_persist shows the disk has reservations also, like original poster.

ewwhite / zfs-ha

scsi reservations issue on failover #25

26