ClusterLabs / fence-agents

Fence agents
104 stars 160 forks source link

fence_scsi is not rebooting the node #530

Open calippus opened 1 year ago

calippus commented 1 year ago

Hello, 

I am having problem with fencing in our environment. 

When I manually fence from node2 to node1 

[root@clus2 ~]# pcs stonith fence clus1
Node: clus1 fenced
[root@clus2 ~]# 

The fence operation is "OK", but the node is not rebooting, pacekamer is shutting down and then the node stays alive, see the logs from node1;

Mar 03 07:49:14 cn1 pacemaker-fenced[1018]:  notice: scsi is eligible to fence (reboot) cn1: static-list
Mar 03 07:49:14 cn1 pacemaker-fenced[1018]:  notice: Operation 'reboot' targeting cn1 by cn2 for stonith_admin.1940@cn2: OK (complete)
Mar 03 07:49:14 cn1 pacemaker-controld[1022]:  crit: We were allegedly just fenced by cn2 for cn2!
Mar 03 07:49:14 cn1 pacemaker-execd[1019]:  warning: new_event_notification (/dev/shm/qb-1019-1022-7-wtDpTb/qb): Bad file descriptor (9)
Mar 03 07:49:14 cn1 pacemaker-execd[1019]:  warning: Could not notify client crmd: Bad file descriptor
Mar 03 07:49:14 cn1 pacemaker-based[1017]:  warning: new_event_notification (/dev/shm/qb-1017-1022-11-mvjvmC/qb): Broken pipe (32)
Mar 03 07:49:14 cn1 pacemaker-based[1017]:  warning: Could not notify client crmd: Broken pipe
Mar 03 07:49:14 cn1 pacemakerd[1012]:  warning: Shutting cluster down because pacemaker-controld[1022] had fatal failure
Mar 03 07:49:14 cn1 pacemakerd[1012]:  notice: Shutting down Pacemaker
Mar 03 07:49:14 cn1 pacemakerd[1012]:  notice: Stopping pacemaker-schedulerd
Mar 03 07:49:14 cn1 pacemaker-schedulerd[1021]:  notice: Caught 'Terminated' signal
Mar 03 07:49:14 cn1 pacemakerd[1012]:  notice: Stopping pacemaker-attrd
Mar 03 07:49:14 cn1 pacemaker-attrd[1020]:  notice: Caught 'Terminated' signal
Mar 03 07:49:14 cn1 pacemakerd[1012]:  notice: Stopping pacemaker-execd
Mar 03 07:49:14 cn1 pacemaker-execd[1019]:  notice: Caught 'Terminated' signal
Mar 03 07:49:14 cn1 pacemakerd[1012]:  notice: Stopping pacemaker-fenced
Mar 03 07:49:14 cn1 pacemaker-fenced[1018]:  notice: Caught 'Terminated' signal
Mar 03 07:49:14 cn1 pacemakerd[1012]:  notice: Stopping pacemaker-based
Mar 03 07:49:14 cn1 pacemaker-based[1017]:  notice: Caught 'Terminated' signal
Mar 03 07:49:14 cn1 pacemaker-based[1017]:  notice: Disconnected from Corosync
Mar 03 07:49:14 cn1 pacemaker-based[1017]:  notice: Disconnected from Corosync
Mar 03 07:49:14 cn1 pacemakerd[1012]:  notice: Shutdown complete
Mar 03 07:49:14 cn1 pacemakerd[1012]:  notice: Shutting down and staying down after fatal error
Mar 03 07:49:14 cn1 systemd[1]: pacemaker.service: Succeeded.
Mar 03 07:49:14 cn1 corosync[926]:   [CFG   ] Node 1 was shut down by sysadmin
Mar 03 07:49:14 cn1 corosync[926]:   [SERV  ] Unloading all Corosync service engines.
Mar 03 07:49:14 cn1 corosync[926]:   [QB    ] withdrawing server sockets
Mar 03 07:49:14 cn1 corosync[926]:   [SERV  ] Service engine unloaded: corosync vote quorum service v1.0
Mar 03 07:49:14 cn1 corosync[926]:   [QB    ] withdrawing server sockets
Mar 03 07:49:14 cn1 corosync[926]:   [SERV  ] Service engine unloaded: corosync configuration map access
Mar 03 07:49:14 cn1 corosync[926]:   [QB    ] withdrawing server sockets
Mar 03 07:49:14 cn1 corosync[926]:   [SERV  ] Service engine unloaded: corosync configuration service
Mar 03 07:49:14 cn1 corosync[926]:   [QB    ] withdrawing server sockets
Mar 03 07:49:14 cn1 corosync[926]:   [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Mar 03 07:49:14 cn1 corosync[926]:   [QB    ] withdrawing server sockets
Mar 03 07:49:14 cn1 corosync[926]:   [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Mar 03 07:49:14 cn1 corosync[926]:   [SERV  ] Service engine unloaded: corosync profile loading service
Mar 03 07:49:15 cn1 corosync[926]:   [MAIN  ] Corosync Cluster Engine exiting normally
Mar 03 07:49:15 cn1 systemd[1]: corosync.service: Control process exited, code=exited status=1
Mar 03 07:49:15 cn1 systemd[1]: corosync.service: Failed with result 'exit-code'.

This is the stonith configuration;

[root@cn2 ~]# pcs stonith config
Resource: scsi (class=stonith type=fence_scsi)
 Attributes: scsi-instance_attributes
   debug_file=/root/fence.debug
   devices=/dev/sdb
   pcmk_host_list=cn1,cn2
   pcmk_reboot_action=off
   verbose=yes
 Meta Attributes: scsi-meta_attributes
   provides=unfencing
 Operations:
   monitor: scsi-monitor-interval-60s
     interval=60s
[root@cn2 ~]#

Why the node is not rebooting, I couldn't find the solution. Could you please help in this matter?  Thanks in advance

oalbrigt commented 1 year ago

fence_scsi is meant to disconnect access to shared storage, so e.g. your database or other resource(s) arent able to write to it when the node fails.

To reboot the node you should use any of the redfish/ipmilan agents (for iLO or iDRAC, etc), or fence_xvm with fence-virtd on the host node for virtual machines.

For other scenarioes you can use fence_sbd with poison pill on shared storage.

calippus commented 1 year ago

That's a VMware environment (so no ipmilan agents) and also I don't have the access to use fence_vmware_soap.

One can use fence_scsi as stonith device and by definition of stonith fence_scsi should do the work. I did use this agent before in a different system, it was working.

Anyway, as in the log, Operation 'reboot' has been started. It should be able to reboot.

wenningerk commented 1 year ago

If you want a "real" reboot you could still go for an SBD setup.

As SBD is heavily relying on a reliable watchdog. This makes SBD on VMware a bit critical as everything available below VSphere 7 was softdog and from there a virtual watchdog implementation. Both, as to my current knowledge, do have issues in guaranteeing a reliable reboot within a defined timeout in certain scenarios (migration, pausing, ...). Having that in mind you still could go for an SBD setup depending on what your cluster is intended for (test-cluster ...). As you have setup scsi-fencing already you could try having fence_scsi and fence_sbd in a topology (first fence_scsi and 2nd fence_sbd on the same level). If you keep e.g. your database on the scsi-device this would guarantee protection against database-corruption and still give you a quite reliable reboot of the fenced node. Haven't done any testing with this setup but it should work.

From the fencing configuration above it looks as if you're running a 2-node-cluster. This gives you basically 2 options for SBD: poison-pill (fence_sbd) with a shared disk or watchdog-fencing if you add either qdevice or a 3rd node for real quorum-forming. If your pacemaker-version is current enough (easiest check for existence of /usr/sbin/fence_watchdog) you can use watchdog in a topology similarly as fence_sbd with poison-pill.

calippus commented 1 year ago

Thanks a lot for the information and explanation. I read that sbd is not supported on VMware, that's the reason why I didn´t try it. (see: https://access.redhat.com/articles/3131271)

But I will test it now to configure, let's see.

Meanwhile, I have installed watchdog together with fence_scsi. First tests are successful so far.

wenningerk commented 1 year ago

That is why I was very careful in suggesting SBD for your scenario - but as it had already been mentioned ...

You are right - for just getting the "real" reboot using watchdog-daemon in combination with fence_scsi should be a possibility. Personally I have no experience with that combination and I haven't looked into how it is done in detail - both setup and implementation. But I would assume that on resource-recovery of resources that don't use the disk (like an IP address) you might have to be careful as I'm not sure if there is a mechanism that guarantees there is left enough time for the watchdog to trigger (leaving of course the uncertainty here if the watchdog triggers within the given timeout at all).