Stonith breaks group resources

funnymanva commented 7 years ago

First off thanks for this guide. I've been looking at trying to come up with the same system on my own and you've jump started that process. I'm using a Dell d3200 disk shelf as that's what I had laying around. I'm not sure if this is causing my issues, but I had to completely disable stonith or the group resources always show stopped if one node is down (or rebooted). I can watch the ZFS pool properly move to the second server on reboot of the first, but when the first comes back the resources all go to stopped. I followed the guide completely so I used the same setup and configuration. I'm not completely understanding what that scsi stonith is supposed to do so I have some research there. Also, just FYI. I did a minimal install of CentOS 7 with the latest CD and in order to get NFS to work properly I had to enable and start both nfs-lock and nfs-idmap. May want to add that to your guide.

ewwhite commented 7 years ago

Are you using SAS disks?

You can try pcs resource cleanup. But we may have to look at your logs.

funnymanva commented 7 years ago

I have SAS disks in the Dell md3200 enclosure, however it doesn't support JBOD so I had to make 10 single disk RAID 0 logical disks. I don't have a log or cache drive. The main thing I see in the logs is the following for fence_scsi:

Jul 19 15:52:52 pac-storage-01 stonith-ng[3044]: warning: fence_scsi[3347] stderr: [ ] Jul 19 15:52:53 pac-storage-01 fence_scsi: Failed: nodename or key is required Jul 19 15:52:53 pac-storage-01 fence_scsi: Please use '-h' for usage Jul 19 15:52:53 pac-storage-01 stonith-ng[3044]: warning: fence_scsi[3346] stderr: [ Failed: nodename or key is required ]

Not sure if any of my above setup makes things not work. I figured out I just need to systemctl enable nfs and systemctl start nfs to get the NFS share to work properly and not the two nfs-lock and nfs-mapid above as they are just static services that stem from nfs. However, I do have a strange anomaly in that on a fresh reboot of both servers (say total power loss) the share isn't present. It will move from one server to the other, but not available at boot if it's not moving. Is there something I've missed to make sure the nfs share for ZFS is created properly at boot and not inherited from the last active server when it fails?

ewwhite commented 7 years ago

None of this would work with a hardware RAID solution. What type of controllers/HBAs do you have installed?

funnymanva commented 7 years ago

It's a Dell Perc6/i. The 10 single disk RAID 0's make it appear to the OS as 10 single disks. It seems to be the only way around not having JBOD available. I'm getting the 6G disk shelf you have in your write up and the same HBA card that I'll be setting up as well, this was my first attempt to get this to work with the parts I had on hand. And it is working. I have a Proxmox VE server connected via NFS and I can power off one of the storage servers and the other one takes over almost immediately. I disabled the scsi_fence stonith and I'm not sure if that's an issue later or not. I know the Dell MD3200 disk shelf allows multiple access from different machines so maybe it's not an issue for my test setup to not have stonith used at all. I'm not really 100% sure what the scsi_fence stonith is needed for since I'm just getting my feet wet on HA. The zpool is only on the active server and moves properly to the second one on power off so that all is working once I figure out I needed to have the nfs service started but not used directly by /etc/exports since I'm using the zfs sharenfs command to do that, it's just not persistent if I'm starting from cold with no server currently running.

ewwhite / zfs-ha

Stonith breaks group resources #10