SUSE / SAPHanaSR

SAP HANA System Replication Resource Agent for Pacemaker Cluster
GNU General Public License v2.0
26 stars 21 forks source link

Unexpected DB outage when cluster is removed from maintenance mode after cluster service restart. #121

Open sairamgopal opened 2 years ago

sairamgopal commented 2 years ago

Issue: On a fully operational cluster, when cluster is put to maintenance mode and Pacemaker/Cluster service is restarted then after removing cluster from maintenance mode DB on primary is stopped and started again which results in an outage to the customers.

Recreate the issue with below steps

  1. Make sure cluster is fully operational with one Promoted and one Demoted node. HSR is in sync
  2. Put the cluster to maintenance mode ( crm configure property maintenance-mode=true )
  3. Stop the cluster service on both the nodes ( crm cluster stop )
  4. Start the cluster service on both the nodes ( crm cluster start )
  5. Remove cluster from Maintenance mode ( crm configure property maintenance-mode=false )

After Step 5 DB on Primary will be restarted or sometimes triggers the failover.

Reason: This is happening because If you attempt to start cluster services on a node while the cluster or node is in maintenance mode, Pacemaker will initiate a single one-shot monitor operation (a “probe”) for every resource to evaluate which resources are currently running on that node. However, it will take no further action other than determining the resources' status.

so after step 4 a probe is initiated using SAPHana and SAPHanaTopology Resources.

In SAPHanaTopology when it is identified as probe in monitoring clone function it only check and Sets the attribute for Hana Version, but it is not doing any check for current cluster state. Because of this "hana__roles" and "master-rscSAPHana_HDB42" attributes are not set in the cluster primary.

Also in SAPHana Resource agent it is trying to get the status of role attribute (which is not set by that time) and setting score to 5 during the probe and later when cluster is removed from maintenance mode, resource agent checks for the roles attribute and its score, as those values are not as expected, agent is trying to fix the cluster and DB stop-Start is happening.

Resolution: To overcome this issue, If we add a check to identify the status of the primary node and set the "hana__roles" attribute during probe, then when cluster is removed from the maintenance, cluster will not try to stop and start the DB or to trigger a failover as it will see the operational primary node.

I have already modified the code and tested multiple scenarios, cluster functionality is not disturbed and the mentioned issue is resolved. I don't think these changes to SAPHana Resource agent will cause additional issues because, during probe we are setting the attributes only if the we identify the primary node. But need your expertise to check and finalize if this approach can be used or suggest any other alternative/fix to overcome the above mentioned issue.

sairamgopal commented 2 years ago

Not able to push my code to a develop branch hence uploading the modified code as .md file.

Used Code from Jun 30, 2022 which is working, because felt like code committed on sep 5th is having a bug. SAPHana_Jun30.md

Modified code is from line 2713 till 2780.

BTW a SUSE support case 00360697 was already opened for the same, but the solution provided is not addressing the root cause.

ksanjeet commented 2 years ago

Hello @sairamgopal

The maintenance procedure that you seem to follow looks incomplete. The maintenance procedure as defined in the https://github.com/SUSE/SAPHanaSR/blob/master/man/SAPHanaSR_maintenance_examples.7 in section "Overview on maintenance procedure for Linux, HANA remains running, on pacemaker-2.0." requires that the resources needs to be refreshed to know the current status. I am quoting below from the manpage:

"6. Let Linux cluster detect status of HANA resource, on either node.

crm resource refresh cln_...

crm resource refresh msl_..."

Moreover, the maintenance procedure has been updated to only set maintenance on the msl_ resource and not on the whole cluster. Please refer section "11.3.2 Updating SAP HANA - seamless SAP HANA maintenance" of https://documentation.suse.com/sbp/all/single-html/SLES4SAP-hana-sr-guide-costopt-15/#cha.hana-sr.administrate

sairamgopal commented 2 years ago

Hi @ksanjeet, Thanks for your update I performed the below steps but issue still exist ( DB stopped and started on primary)

1. crm maintenance on
2. crm cluster stop ( on both nodes )
3. crm cluster start ( on both nodes )
4. crm resource refresh cln_... 
5. crm resource refresh msl_...
6. crm maintenance off

crm resource refresh commands didnt make any difference in SAPHana-showAttr command output

[root@sl12hans ~]# crm resource refresh cln_SAPHanaTopology_S4C_HDB42
Cleaned up rsc_SAPHanaTopology_S4C_HDB42:0 on sl12hans-ha
Cleaned up rsc_SAPHanaTopology_S4C_HDB42:1 on sl12hans
Waiting for 2 replies from the CRMd.. OK

[root@sl12hans ~]# crm resource refresh msl_SAPHana_S4C_HDB42
Cleaned up rsc_SAPHana_S4C_HDB42:0 on sl12hans-ha
Cleaned up rsc_SAPHana_S4C_HDB42:1 on sl12hans
Waiting for 2 replies from the CRMd.. OK

After running refresh commands Out put of SAPHana-showAttr command is below
--------------------------------------------------------------------------------------------
Global cib-time                 maintenance
--------------------------------------------
global Mon Sep 26 09:40:36 2022 true
Resource              maintenance
----------------------------------
msl_SAPHana_S4C_HDB42 false
Hosts       clone_state lpa_s4c_lpt node_state remoteHost  score     site  srah srmode standby version                vhost
----------------------------------------------------------------------------------------------------------------------------------
sl12hans    DEMOTED     30          online     sl12hans-ha -INFINITY NODEA -    sync   off     2.00.053.00.1605092543 sl12hans
sl12hans-ha             1664184691  online     sl12hans    -1        NODEB -    sync   off     2.00.053.00.1605092543 sl12hans-ha

#######################################################################################

2 . the maintenance procedure has been updated to only set maintenance on the msl_ resource and not on the whole cluster. Please refer section "11.3.2 Updating SAP HANA - seamless SAP HANA maintenance"

Yes I have gone through those steps. Suse support engineer provided that document before but I feel that procedure to set only msl resource to maintenance is just a workaround for the issue.

https://documentation.suse.com/sle-ha/15-SP1/html/SLE-HA-all/cha-ha-maintenance.html#sec-ha-maint-shutdown-node-maint-mode As per the above mentioned pacemaker documentation in section 17.9,

If you attempt to start cluster services on a node while the cluster or node is in maintenance mode, Pacemaker will initiate a single one-shot monitor operation (a “probe”) for every resource to evaluate which resources are currently running on that node.

It says node can be in maintenance mode ( not only a specific resource ) and pacemaker will initiate a probe for every resource. but this is not happening, a probe is not initiated for msl resource. because of that reason we have the issue and we have the workaround steps mentioned in section Updating SAP HANA - seamless SAP HANA maintenance.

If we include a check for checking the primary node status and set the Role attribute during the probe in SAPHana Resouce agent then SAPHana Resource agent will set the score based on the Role attribute and that resolves the issue. Also this will not disturb the cluster functionality because we do this check only during probe and set the attribute only when we identify a fully operational primary.

ksanjeet commented 2 years ago

Dear @sairamgopal ,

As per the output of SAPHana-showAttr, I would have expected the "score" attribute to have updated and it seems that you may have not waited long enough for the cluster to get stabilized before running commands one after the other ( I am just making a guess). You can check if cluster is stabilized by running "cs_clusterstate -i" and look for value "S_IDLE". If the value is anything different like "UNKNOWN", "S_TRANSITION" , "S_POLICY_ENGINE" then you should not run a new command.

I have tested these procedures many many times and your result seems unexpected to me.

Although the maintainer will look at your valuable patch in due course of time, may I request you to look at my blog which details these procedures aavailable at: https://www.suse.com/c/sles-for-sap-os-patching-procedure-for-scale-up-perf-opt-hana-cluster/

and an important blog about checking pre-requisites before starting a maintenance procedure available at: https://www.suse.com/c/sles-for-sap-hana-maintenance-procedures-part-1-pre-maintenance-checks/

I am quite sure, these will help you define procedure for your organization and give you an optimal maintenance experience.

sairamgopal commented 2 years ago

Hi @PeterPitterling,

I saw you did commit on Sep 5th to get rid of HDBSettings.sh call to cdpy; python

Below command is working fine if we are not changing to python directory using cdpy, but if we use cdpy then it failes to run the command cdpy saying no such file or directory and it exit with the exit code 2.

output=$(timeout --foreground -s 9 "$timeOut" $pre_cmd "($pre_script; timeout -s 9 $timeOut $cmd > $cmd_out_log) >& $cmd_err_log" 2>"$su_err_log"); rc=$?

Example: command without cdpy

[root@hanode suse]# timeout --foreground -s 9 60 su - s4cadm -c "(true; timeout -s 9 60 hdbnsutil -sr_stateConfiguration)"

System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~

mode: primary
site id: 2
site name: NODEB
done.

command with cdpy

[root@hanode suse]# timeout --foreground -s 9 60 su - s4cadm -c "(true; timeout -s 9 60 cdpy; python landscapeHostConfiguration.py --sapcontrol=1)"         
timeout: failed to run command ‘cdpy’: No such file or directory
python: can't open file 'landscapeHostConfiguration.py': [Errno 2] No such file or directory

I tried from my end on two different cluster SLES12 and SLES15 and on both clusters I got this error. I am sure that you have tested this but could you please check this again from your end ?

sairamgopal commented 2 years ago

Hi @ksanjeet ,

Thank you very much for your feedback, I have waited long enough, more than 15 min but the result is same, it is not happening on one or two clusters i see this behavior in all our clusters. If possible and if you have any Test clusters for Hana I request you to test the below steps by putting entire cluster to maintenance mode.

1. Make sure cluster is fully operational with one Promoted and one Demoted node. HSR is in sync
2. Put the cluster to maintenance mode ( crm configure property maintenance-mode=true ) 
3. Stop the cluster service on both the nodes ( crm cluster stop )
4. Start the cluster service on both the nodes ( crm cluster start )
5. Remove cluster from Maintenance mode ( crm configure property maintenance-mode=false ) 

If incase you didnt face any issues with the above mentioned procedure then we can discuss or check other cluster parameters.

if you face same issue then you can try using the SAPHana_Jun30.md by renaming it to SAPHana and replacing with your current SAPHana Resource agent and test the same above mentioned steps.

Thank you very much in advance.

PeterPitterling commented 2 years ago

Hi @sairamgopal,

cdpy is an alias which is defined by the hdbenv.sh script. This script is invoked by this chain sidadm home (/usr/sap/SID/home) directory --> .profile --> .bashrc --> .sapenv.sh --> HDBSettings.sh --> hdbenv.sh

What shell are you using for your s4cadm user? Is the alias defined, once you logon as s4cadm user?

su - s4cadm
alias
cdpy

BR

sairamgopal commented 2 years ago

Hello @PeterPitterling,

Alias is defined under s4cadm user.

s4cadm@hanode:/usr/sap/S4C/HDB42> echo $SHELL
/bin/sh
s4cadm@hanode:/usr/sap/S4C/HDB42> alias | grep cdpy
cdpy='cd $DIR_INSTANCE/exe/python_support'

Running cdpy (or any alias) command from root without timout is working fine.

[root@hanode ~]# timeout --foreground -s 9 60 su - s4cadm -c "(cdpy; python landscapeHostConfiguration.py --sapcontrol=1)"
SAPCONTROL-OK: <begin>
.
.
SAPCONTROL-OK: <end>

Running cdpy command with timeout is working but only when we run a command before running an alias

[root@hanode ~]# timeout --foreground -s 9 60 su - s4cadm -c "(true; timeout -s 9 60 pwd;cdpy; python landscapeHostConfiguration.py --sapcontrol=1)"
/sapmnt/shared/S4C/HDB42
SAPCONTROL-OK: <begin>
.
.
SAPCONTROL-OK: <end>

running cdpy command with timeout is failing

[root@hanode ~]# timeout --foreground -s 9 60 su - s4cadm -c "(true; timeout -s 9 60 cdpy; python landscapeHostConfiguration.py --sapcontrol=1)"
timeout: failed to run command ‘cdpy’: No such file or directory
python: can't open file 'landscapeHostConfiguration.py': [Errno 2] No such file or directory

changing $pre_scipt to after timeout value is working.

[root@sl12hans-ha ~]# timeout --foreground -s 9 60 su - s4cadm -c "(timeout -s 9 60 true;cdpy; python landscapeHostConfiguration.py --sapcontrol=1)"
SAPCONTROL-OK: <begin>
.
.
SAPCONTROL-OK: <end>

So may be output command should be like this ? output=$(timeout --foreground -s 9 "$timeOut" $pre_cmd "(timeout -s 9 $timeOut $pre_script; $cmd > $cmd_out_log) >& $cmd_err_log" 2>"$su_err_log"); rc=$?

ksanjeet commented 2 years ago

Hello @sairamgopal ,

The document you are referring to https://documentation.suse.com/sle-ha/15-SP1/html/SLE-HA-all/cha-ha-maintenance.html#sec-ha-maint-shutdown-node-maint-mode is a generic HA documentation and not for SLES for SAP for HANA for which the maintenance pocedure is defined at https://documentation.suse.com/sbp/all/single-html/SLES4SAP-hana-sr-guide-costopt-15/#cha.hana-sr.administrate

Setting of maintenance on msl resource instead of the whole cluster is not a workaround. The logic is that you need to have SAPHanaTopology resource (the cln ) resource running to check the current system replication status and the landscape configuration details. What do you think is the role of a second resource agent(SAPHanaTopology) for HANA which you don't see in other databases like sybase? This resource needs to be running and when it does you don't see the problem of unexpected node attributes and eventually a restart of HANA DB at primary node.

As mentioned previously, you need to adapt your procedure by accomodating 2 changes:

  1. Set maintenance on the msl_ resource rather than the whole cluster (This is a documented and supported procedure)
  2. Refresh the cln and msl resource before unsetting the maintenance on msl_ after your maintenance is over.

There can be many ways to solve a problem. It is good to discuss about better ways and I support that we should go through your patch and give it a careful consideration. However, the current state of supported procedure requires that you set maintenance only on msl_ resource and not on whole cluster.

sairamgopal commented 2 years ago

Hello @ksanjeet, Understood that, Thank you.

Hi @PeterPitterling, @angelabriel, @fdanapfel, @fmherschel

Could you please provide your thoughts or suggestion on below comment If we add a check to identify the status of the primary node and set the "hana__roles" attribute during probe in SAPHana Resource agent, then even when entire cluster set to maintenance , restart cluster on both the nodes and remove cluster from the maintenance, cluster will not attempt to stop and start the DB or to trigger failover

Modified/additional code is from line 2713 to 2780 also modified the line 668 to output=$(timeout --foreground -s 9 "$timeOut" $pre_cmd "(timeout -s 9 $timeOut $pre_script; $cmd > $cmd_out_log) >& $cmd_err_log" 2>"$su_err_log"); rc=$? I understood that timeout is not applied to $cmd but for now this is working

SAPHana.md

PeterPitterling commented 2 years ago

@sairamgopal I can confirm that the inner timeout command is failing and not executing cdpy .. that is indeed a bit strange.

Nevertheless just putting true; in front of cdpy will fix it

# timeout --foreground -s 9 60 su - s4cadm -c "(true; timeout -s 9 60 true; cdpy; python landscapeHostConfiguration.py --sapcontrol=1)"

This would need to be added in the calling function. Currently it is not clear if this inner timeout is required at all .. we will align and come back. see #122

Btw, the Repo version is currently not shipped, so you are working more less with a in development version ..

sairamgopal commented 2 years ago

Hi @PeterPitterling Thanks for the confirmation.

Also could you please check and consider the other point to add a check to identify the status of the primary node and set the "hana__roles" attribute during probe in SAPHana Resource agent if that is valid for the next release ? I have tested it and it is working. I included the code in the above comment.

PeterPitterling commented 2 years ago

Hi @PeterPitterling Thanks for the confirmation.

Also could you please check and consider the other point to add a check to identify the status of the primary node and set the "hana__roles" attribute during probe in SAPHana Resource agent if that is valid for the next release ? I have tested it and it is working. I included the code in the above comment.

please create a separate issue for this request