SUSE / ha-sap-terraform-deployments

Automated SAP/HA Deployments in Public/Private Clouds
GNU General Public License v3.0
122 stars 88 forks source link

S/4HANA HA environment HANA System Replication Failure #803

Closed ab-mohamed closed 2 years ago

ab-mohamed commented 2 years ago

Used cloud platform GCP

Used SLES4SAP version SLES4SAP 15 SP2

Used client machine OS Google Cloud Shell

Expected behavior vs. observed behavior HANA System Replication status is SFAIL while it should be SOK.

How to reproduce

Using the current master branch, start a new S/4HANA deployment.

Troubleshooting Steps

  1. Using the Monitoring Server Dashboars -> SAP HANA, it shows a failed HANA system replication, as shown below: image

2, Confirming the same results from the command line:

demo1-hana01:~ # SAPHanaSR-showAttr
Global cib-time
--------------------------------
global Mon Dec  6 16:48:31 2021

Resource                      is-managed
-----------------------------------------
cln_SAPHanaTopology_PRD_HDB00 true

Sit srHook
-----------
FRA SFAIL

Hosts        clone_state lpa_prd_lpt node_state op_mode   remoteHost   roles                            score     site srmode sync_state version                vhost
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
demo1-hana01 PROMOTED    1638809311  online     logreplay demo1-hana02 4:P:master1:master:worker:master 150       NUE  sync   PRIM       2.00.052.00.1599235305 demo1-hana01
demo1-hana02 DEMOTED     10          online     logreplay demo1-hana01 4:S:master1:master:worker:master -INFINITY FRA  sync   SFAIL      2.00.052.00.1599235305 demo1-hana02
demo1-hana01:prdadm> HDBSettings.sh systemReplicationStatus.py
| Database | Host         | Port  | Service Name | Volume ID | Site ID | Site Name | Secondary    | Secondary | Secondary | Secondary | Secondary     | Replication | Replication | Replication                                                                   |
|          |              |       |              |           |         |           | Host         | Port      | Site ID   | Site Name | Active Status | Mode        | Status      | Status Details                                                                |
| -------- | ------------ | ----- | ------------ | --------- | ------- | --------- | ------------ | --------- | --------- | --------- | ------------- | ----------- | ----------- | ----------------------------------------------------------------------------- |
| SYSTEMDB | demo1-hana01 | 30001 | nameserver   |         1 |       1 | NUE       | demo1-hana02 |     30001 |         2 | FRA       | YES           | SYNC        | ACTIVE      |                                                                               |
| PRD      | demo1-hana01 | 30007 | xsengine     |         2 |       1 | NUE       | demo1-hana02 |     30007 |         2 | FRA       | YES           | SYNC        | ERROR       | Connection refused: Primary needs initial data backup for system replication! |
| PRD      | demo1-hana01 | 30040 | docstore     |         5 |       1 | NUE       | demo1-hana02 |     30040 |         2 | FRA       | YES           | SYNC        | ERROR       | Connection refused: Primary needs initial data backup for system replication! |
| PRD      | demo1-hana01 | 30003 | indexserver  |         3 |       1 | NUE       | demo1-hana02 |     30003 |         2 | FRA       | YES           | SYNC        | ERROR       | Connection refused: Primary needs initial data backup for system replication! |
| PRD      | demo1-hana01 | 30011 | dpserver     |         4 |       1 | NUE       | demo1-hana02 |     30011 |         2 | FRA       | YES           | SYNC        | ERROR       | Connection refused: Primary needs initial data backup for system replication! |

status system replication site "2": ERROR
overall system replication status: ERROR

Local System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

mode: PRIMARY
site id: 1
site name: NUE

It seems that the initial backup for the primary HANA was not executed, as shown in the replication status details:

Connection refused: Primary needs initial data backup for system replication!
  1. Despite the above-mentioned HANA system replication error, the monitoring server reports that the HANA HA cluster status is OK. The same also applies to the CRMSH output: image
    
    demo1-hana01:~ # crm_mon -rnf1
    Cluster Summary:
    * Stack: corosync
    * Current DC: demo1-hana01 (version 2.0.4+20200616.2deceaa3a-3.12.1-2.0.4+20200616.2deceaa3a) - partition with quorum
    * Last updated: Mon Dec  6 16:59:40 2021
    * Last change:  Mon Dec  6 16:59:18 2021 by root via crm_attribute on demo1-hana01
    * 2 nodes configured
    * 8 resource instances configured

Node List:

  1. Stopping the HANA database on both nodes:

    demo1-hana01:~ # crm cluster stop
    demo1-hana02:~ # crm cluster stop
  2. Backup the primary DB:

    demo1-hana01:~ # su - prdadm
    demo1-hana01:prdadm> HDB start
    demo1-hana01:prdadm> sapcontrol -nr 00 -function GetProcessList | column -t
    06.12.2021         17:05:21
    GetProcessList
    OK
    name,              description,  dispstatus,       textstatus,     starttime,   elapsedtime,  pid
    hdbdaemon,         HDB           Daemon,           GREEN,          Running,     2021          12        06    17:03:43,  0:01:38,   29360
    hdbcompileserver,  HDB           Compileserver,    GREEN,          Running,     2021          12        06    17:03:53,  0:01:28,   29604
    hdbdiserver,       HDB           Deployment        Infrastructure  Server-PRD,  GREEN,        Running,  2021  12         06         17:04:35,  0:00:46,  30571
    hdbdocstore,       HDB           DocStore-PRD,     GREEN,          Running,     2021          12        06    17:03:53,  0:01:28,   29649
    hdbdpserver,       HDB           DPserver-PRD,     GREEN,          Running,     2021          12        06    17:03:53,  0:01:28,   29652
    hdbindexserver,    HDB           Indexserver-PRD,  GREEN,          Running,     2021          12        06    17:03:53,  0:01:28,   29655
    hdbnameserver,     HDB           Nameserver,       GREEN,          Running,     2021          12        06    17:03:44,  0:01:37,   29378
    hdbpreprocessor,   HDB           Preprocessor,     GREEN,          Running,     2021          12        06    17:03:53,  0:01:28,   29607
    hdbwebdispatcher,  HDB           Web               Dispatcher,     GREEN,       Running,      2021      12    06         17:04:35,  0:00:46,   30574
    hdbxsengine,       HDB           XSEngine-PRD,     GREEN,          Running,     2021          12        06    17:03:53,  0:01:28,   29658
    
    demo1-hana01:prdadm> hdbsql -t -d SYSTEMDB -u system -p <PASSWORD> -i 00 "backup data using file ('full')"
    0 rows affected (overall time 10.645387 sec; server time 10.643820 sec)

demo1-hana01:prdadm> hdbsql -t -d prd -u system -p -i 00 "backup data using file ('full')" 0 rows affected (overall time 342.329219 sec; server time 342.327767 sec)

demo1-hana01:prdadm> hdbsql -u system -p -i 00 "select value from "SYS"."M_INIFILE_CONTENTS" where key='log_mode'" VALUE "normal"


6. Starting the cluster again:

demo1-hana01:~ # crm cluster start demo1-hana02:~ # crm cluster start


7. Checking the HANA system replication:

demo1-hana01:~ # SAPHanaSR-showAttr Global cib-time

global Mon Dec 6 17:46:16 2021

Resource is-managed

cln_SAPHanaTopology_PRD_HDB00 true

Sit srHook

FRA SOK

Hosts clone_state lpa_prd_lpt node_state op_mode remoteHost roles score site srmode sync_state version vhost

demo1-hana01 PROMOTED 1638812776 online logreplay demo1-hana02 4:P:master1:master:worker:master 150 NUE sync PRIM 2.00.052.00.1599235305 demo1-hana01 demo1-hana02 DEMOTED 30 online logreplay demo1-hana01 4:S:master1:master:worker:master 100 FRA sync SOK 2.00.052.00.1599235305 demo1-hana02


8. Checking the HANA HA cluster status:

demo1-hana01:~ # crm_mon -rnf1 Cluster Summary:

Node List:

Inactive Resources:

Migration Summary:

Failed Resource Actions:

yeoldegrove commented 2 years ago

The SR is OK initially after the HANA installation. In the S/4HANA case, a large backup is imported into the HANA before the PAS instance is installed. After this import, the HANA is in the described state and would need a 2nd initial backup

The node who knows best that this new backup is needed is the PAS node. It would be logical to initiate the backup from this host. @arbulu89 What do you think of the idea to implement a query_remote module inside salt-shaptools which would than run on the PAS node to run a backup query on the HANA? Is such a way of interacting wanted at all? The wait_for_connection module already does something similar and is only missing the query part.

yeoldegrove commented 2 years ago

https://github.com/SUSE/sapnwbootstrap-formula/pull/95 and https://github.com/SUSE/ha-sap-terraform-deployments/pull/811 include a new feature to backup HANA after initial DB import (if SR is enabled). This should fix this issue.