Split Brain when logged in user CWDed into ZFS volume

rbicelli commented 3 years ago

Hi, Consider this scenario:

a user (let's say an admin user) is logged in and working (CWDed on a ZFS volume served by the cluster
a failover event is triggered (i.e secondary node goes off standby mode and tries to take over its resources)

I observed that a fence action is triggered.

The worst thing happened is that the fence action don't work as expected: the volume stays mounted on both nodes, causing ZFS errors (and file corruption). I assume SCSI reservations are somehow not honored.

I triple checked the configuration and looks like ok.

Since I'm planning to add sanoid/syncoid for snapshot/replica send, I would like to avoid a split brain in case of failover in the middle of a process on node using the filesystem.

I think this behaviour it reproducible with ease.

rbicelli commented 3 years ago

Relevant portion of log is (sorrry for cut but i was into a split-screened shell):

Apr 17 16:51:25 zsan02 crmd[3513]:  notice: Result of stop operation for vol1-ip on zsan02: 0 (ok)                                                                                    
Apr 17 16:51:25 zsan02 lrmd[3510]:  notice: vol1_stop_0:34973:stderr [ /usr/lib/ocf/resource.d/heartbeat/ZFS: line 35: [: : integer expression expected ]                             
Apr 17 16:51:25 zsan02 lrmd[3510]:  notice: vol1_stop_0:34973:stderr [ umount: /vol1: target is busy. ]                                                                               
Apr 17 16:51:25 zsan02 lrmd[3510]:  notice: vol1_stop_0:34973:stderr [         (In some cases useful info about processes that use ]                                                  
Apr 17 16:51:25 zsan02 lrmd[3510]:  notice: vol1_stop_0:34973:stderr [          the device is found by lsof(8) or fuser(1)) ]                                                         
Apr 17 16:51:25 zsan02 lrmd[3510]:  notice: vol1_stop_0:34973:stderr [ cannot unmount '/vol1': umount failed ]                                                                        
Apr 17 16:51:25 zsan02 lrmd[3510]:  notice: vol1_stop_0:34973:stderr [ /usr/lib/ocf/resource.d/heartbeat/ZFS: line 35: [: : integer expression expected ]                             
Apr 17 16:51:25 zsan02 crmd[3513]:  notice: Result of stop operation for vol1 on zsan02: 1 (unknown error)                                                                            
Apr 17 16:51:25 zsan02 crmd[3513]:  notice: zsan02-vol1_stop_0:63 [ /usr/lib/ocf/resource.d/heartbeat/ZFS: line 35: [: : integer expression expected\numount: /vol1: target is busy.\ 
n        (In some cases useful info about processes that use\n         the device is found by lsof(8) or fuser(1))\ncannot unmount '/vol1': umount failed\n/usr/lib/ocf/resource.d/he 
artbeat/ZFS: line 35: [: : integer expression expected\n ]                                                                                                                            
Apr 17 16:51:25 zsan02 stonith-ng[3509]:  notice: fence-vol1 can fence (reboot) zsan02: static-list                                                                                   
Apr 17 16:51:25 zsan02 stonith-ng[3509]:  notice: fence-vol2 can fence (reboot) zsan02: static-list                                                                                   
Apr 17 16:51:25 zsan02 stonith-ng[3509]:  notice: fence-vol3 can fence (reboot) zsan02: static-list                                                                                   
Apr 17 16:51:26 zsan02 stonith-ng[3509]:  notice: Operation 'reboot' targeting zsan02 on zsan01 for crmd.3980@zsan01.36ba1933: OK

looks like when something is using the filesystem locally the resource agent is unable to stop the fs, then crashes and triggers a fence event. Fencing that doesn't happen (I've configured idrac but doesn't power cycle the node if I trigger a fence). But this is another story.

Same behaviour occours with a zfs send in progress.

In order to mitigate this issue I wrote an helper script, that i put in /usr/lib/ocf/lib/heartbeat/helpers/zfs-helper:

#!/bin/bash
# Pre-Export script for ZFS Pool
# Check if there is some process using files in Zpool and kill them
# Requires lsof, ps, awk, sed

zpool_pre_export () {

        # Forcibly Terminate all pids using zpool
        ZPOOL=$1
        #Exits gracefully anyway, for now
        RET=0

    lsof /$ZPOOL{*,/*} | awk '{print ($2)}' | sed -e "1d" | \
        while read PID
        do
            echo "Terminating PID $PID"
                kill -9 $PID
        done

    # Check if some blocking ZFS operations are running, such 
        # zfs send ...
        ps aux | grep $ZPOOL | awk '{print ($2)}' | \
        while read PID
        do
            echo "Terminating PID $PID"
                kill -9 $PID
        done

    exit $RET
}

case $1 in

    pre-export)
        zpool_pre_export $2
        ;;
esac

intentions commented 3 years ago

Wouldn't using the multihost protection prevent the second host from mounting the pool?

rbicelli commented 3 years ago

Wouldn't using the multihost protection prevent the second host from mounting the pool?

Wasn't aware of this feature. I've enabled it and testing it.

Nooby1 commented 2 years ago

I have put it in /usr/lib/ocf/lib/heartbeat/zfs-helper.sh, as there is no helper directory in RHEL8 and there are other scripts in this directory.

Does anything else have to be done for this on RHEL8?

rbicelli commented 2 years ago

I don't remember since months are passed, but is possible that I needed to create the required directory.

ewwhite / zfs-ha

Split Brain when logged in user CWDed into ZFS volume #38