HANA HA on GCP - DB doesn't start using the SAPHana resource agent

ab-mohamed commented 2 years ago

Used cloud platform GCP Used SLES4SAP version SLSES 15 SP3 for SAP Applciations

Used client machine OS macOS 13.4

Expected behaviour vs observed behaviour I expected that HANA would start successfully using the SAPHana resource agent, but it did not work

How to reproduce

Use the most recent master branch to deploy HANA (2.0 SPS05) HA using the terraform apply --auto-approve command.

HANA DB did not work on ab-vmhana01 host, and the HSR failed:


ab-vmhana01:~ # crm_mon -rnf1
Cluster Summary:
* Stack: corosync
* Current DC: ab-vmhana01 (version 2.0.5+20201202.ba59be712-150300.4.21.1-2.0.5+20201202.ba59be712) - partition with quorum
* Last updated: Wed Jun  1 08:04:37 2022
* Last change:  Wed Jun  1 08:03:49 2022 by root via crm_attribute on ab-vmhana02
* 2 nodes configured
* 8 resource instances configured

Node List:

Node ab-vmhana01: online:
- Resources:
  - rsc_gcp_stonith_PRD_HDB00_ab-vmhana01 (stonith:fence_gce): Started
  - rsc_SAPHanaTopology_PRD_HDB00 (ocf::suse:SAPHanaTopology): Started
  - rsc_gcp_stonith_PRD_HDB00_ab-vmhana02 (stonith:fence_gce): Started
Node ab-vmhana02: online:
- Resources:
  - rsc_ip_PRD_HDB00 (ocf::heartbeat:IPaddr2): Started
  - rsc_socat_PRD_HDB00 (ocf::heartbeat:anything): Started
  - rsc_SAPHana_PRD_HDB00 (ocf::suse:SAPHana): Master
  - rsc_SAPHanaTopology_PRD_HDB00 (ocf::suse:SAPHanaTopology): Started

Inactive Resources:

Clone Set: msl_SAPHana_PRD_HDB00 [rsc_SAPHana_PRD_HDB00] (promotable):
- Masters: [ ab-vmhana02 ]
- Stopped: [ ab-vmhana01 ]

Migration Summary:

Node: ab-vmhana01:
- rsc_SAPHana_PRD_HDB00: migration-threshold=5000 fail-count=1000000 last-failure='Wed Jun 1 08:01:55 2022'

Failed Resource Actions:

rsc_SAPHana_PRD_HDB00_start_0 on ab-vmhana01 'not running' (7): call=32, status='complete', exitreason='', last-rc-change='2022-06-01 08:01:53Z', queued=0ms, exec=2482ms
ab-vmhana02:~ # SAPHanaSR-showAttr perl: warning: Setting locale failed. perl: warning: Please check that your locale settings: LANGUAGE = (unset), LC_ALL = (unset), LC_CTYPE = "UTF-8", LANG = (unset) are supported and installed on your system. perl: warning: Falling back to the standard locale ("C"). Global cib-time

global Wed Jun 1 08:04:53 2022

Resource is-managed

cln_SAPHanaTopology_PRD_HDB00 true

Sites srHook

FRA PRIM

Hosts clone_state lpa_prd_lpt node_state op_mode remoteHost roles score site srmode sync_state version vhost

ab-vmhana01 UNDEFINED 10 online logreplay ab-vmhana02 1:P:master1::worker: -9000 NUE sync SFAIL 2.00.052.00.1599235305 ab-vmhana01 ab-vmhana02 PROMOTED 1654070693 online logreplay ab-vmhana01 4:P:master1:master:worker:master 150 FRA sync PRIM 2.00.052.00.1599235305 ab-vmhana02

ab-vmhana02:~ # su - prdadm prdadm@ab-vmhana02:/usr/sap/PRD/HDB00> HDBSettings.sh systemReplicationStatus.py; echo RC:$? there are no secondary sites attached

Local System Replication State


mode: PRIMARY
site id: 2
site name: FRA
RC:10
```

3. Stop the cluster services, start the DB manually, and enable the HSR. It was working fine:
````
ab-vmhana01:~ # crm cluster stop
ab-vmhana02:~ # crm cluster stop
```
```
ab-vmhana01:~ # su - prdadm
prdadm@ab-vmhana01:/usr/sap/PRD/HDB00> HDB start

prdadm@ab-vmhana01:/usr/sap/PRD/HDB00> HDB info
USER          PID     PPID  %CPU        VSZ        RSS COMMAND
prdadm       1769     1767   0.0      18412       5280 -sh
prdadm       6587     1769  25.0      14536       3972  \_ /bin/sh /usr/sap/PRD/HDB00/HDB info
prdadm       6618     6587   0.0      38204       3932      \_ ps fx -U prdadm -o user:8,pid:8,ppid:8,pcpu:5,vsz:10,rss:10,args
prdadm       5859        1   0.0      23628       3176 sapstart pf=/hana/shared/PRD/profile/PRD_HDB00_ab-vmhana01
prdadm       5866     5859   1.0     461420      71040  \_ /usr/sap/PRD/HDB00/ab-vmhana01/trace/hdb.sapPRD_HDB00 -d -nw -f /usr/sap/PRD/HDB00/ab-vmhana01/daemon.ini pf=/usr/sap/PRD/SYS/profile/PRD_HDB00_ab-vmhana01
prdadm       5884     5866   279    7221876    3531128      \_ hdbnameserver
prdadm       6103     5866   1.3    2145956     145788      \_ hdbcompileserver
prdadm       6106     5866   1.7    2675160     177504      \_ hdbpreprocessor
prdadm       6150     5866   294    7780172    4185480      \_ hdbindexserver -port 30003
prdadm       6153     5866  19.7    5976692    1238496      \_ hdbxsengine -port 30007
prdadm       6524     5866  14.8    4397176     455544      \_ hdbwebdispatcher
prdadm      18753        1   0.0     716448      51440 hdbrsutil  --start --port 30003 --volume 3 --volumesuffix mnt00001/hdb00003.00003 --identifier 1654068741
prdadm      18422        1   0.0     716388      51840 hdbrsutil  --start --port 30001 --volume 1 --volumesuffix mnt00001/hdb00001 --identifier 1654068709
prdadm      18200        1   0.0     634324      31500 /usr/sap/PRD/HDB00/exe/sapstartsrv 
pf=/hana/shared/PRD/profile/PRD_HDB00_ab-vmhana01 -D -u prdadm
```
```
prdadm@ab-vmhana02:/usr/sap/PRD/HDB00> hdbnsutil -sr_register --remoteHost=ab-vmhana01 --remoteInstance=00 --replicationMode=sync --operationMode=logreplay --name=FRA
adding site ...
nameserver ab-vmhana02:30001 not responding.
collecting information ...
updating local ini files ...
done.
```
```
prdadm@ab-vmhana02:/usr/sap/PRD/HDB00> HDB start
```
```
prdadm@ab-vmhana01:/usr/sap/PRD/HDB00> HDBSettings.sh systemReplicationStatus.py; echo RC:$?
| Database | Host        | Port  | Service Name | Volume ID | Site ID | Site Name | Secondary   | Secondary | Secondary | Secondary | Secondary     | Replication | Replication | Replication    |
|          |             |       |              |           |         |           | Host        | Port      | Site ID   | Site Name | Active Status | Mode        | Status      | Status Details |
| -------- | ----------- | ----- | ------------ | --------- | ------- | --------- | ----------- | --------- | --------- | --------- | ------------- | ----------- | ----------- | -------------- |
| SYSTEMDB | ab-vmhana01 | 30001 | nameserver   |         1 |       1 | NUE       | ab-vmhana02 |     30001 |         2 | FRA       | YES           | SYNC        | ACTIVE      |                |
| PRD      | ab-vmhana01 | 30007 | xsengine     |         2 |       1 | NUE       | ab-vmhana02 |     30007 |         2 | FRA       | YES           | SYNC        | ACTIVE      |                |
| PRD      | ab-vmhana01 | 30003 | indexserver  |         3 |       1 | NUE       | ab-vmhana02 |     30003 |         2 | FRA       | YES           | SYNC        | ACTIVE      |                |

status system replication site "2": ACTIVE
overall system replication status: ACTIVE

Local System Replication State

mode: PRIMARY site id: 1 site name: NUE RC:15

prdadm@ab-vmhana01:/usr/sap/PRD/HDB00> hdbnsutil -sr_stateConfiguration --sapcontrol=1 SAPCONTROL-OK: mode=primary site id=1 site name=NUE SAPCONTROL-OK: done.

prdadm@ab-vmhana02:/usr/sap/PRD/HDB00> hdbnsutil -sr_stateConfiguration --sapcontrol=1 SAPCONTROL-OK: mode=sync site id=2 site name=FRA active primary site=1 primary masters=ab-vmhana01 SAPCONTROL-OK: done.


4. Stop the DB:

prdadm@ab-vmhana02:/usr/sap/PRD/HDB00> HDB stop prdadm@ab-vmhana01:/usr/sap/PRD/HDB00> HDB stop


5. Downgrade the `SAPHanaSR` RPM:

ab-vmhana01:~ # rpm -qa | grep SAPHanaSR-0 SAPHanaSR-0.155.0-4.17.1.noarch

ab-vmhana01:~ # zypper in --oldpackage SAPHanaSR-0.154.1-4.14.1 ab-vmhana02:~ # zypper in --oldpackage SAPHanaSR-0.154.1-4.14.1


6. Start the cluster services:

ab-vmhana01:~ # crm cluster start ab-vmhana02:~ # crm cluster start


DB still not working on `ab-vmhana01`:

ab-vmhana01:~ # crm_mon -rnf1 Cluster Summary:

Stack: corosync
Current DC: ab-vmhana01 (version 2.0.5+20201202.ba59be712-150300.4.21.1-2.0.5+20201202.ba59be712) - partition with quorum
Last updated: Wed Jun 1 08:33:46 2022
Last change: Wed Jun 1 08:25:26 2022 by root via crm_attribute on ab-vmhana02
2 nodes configured
8 resource instances configured

Node List:

Node ab-vmhana01: online:
- Resources:
  - rsc_gcp_stonith_PRD_HDB00_ab-vmhana01 (stonith:fence_gce): Started
  - rsc_ip_PRD_HDB00 (ocf::heartbeat:IPaddr2): Started
  - rsc_socat_PRD_HDB00 (ocf::heartbeat:anything): Started
  - rsc_SAPHana_PRD_HDB00 (ocf::suse:SAPHana): FAILED
  - rsc_SAPHanaTopology_PRD_HDB00 (ocf::suse:SAPHanaTopology): Started
  - rsc_gcp_stonith_PRD_HDB00_ab-vmhana02 (stonith:fence_gce): Started
Node ab-vmhana02: online:
- Resources:
  - rsc_SAPHana_PRD_HDB00 (ocf::suse:SAPHana): Starting
  - rsc_SAPHanaTopology_PRD_HDB00 (ocf::suse:SAPHanaTopology): Started

Inactive Resources:

No inactive resources

Migration Summary:

Node: ab-vmhana01:
- rsc_SAPHana_PRD_HDB00: migration-threshold=5000 fail-count=1000000 last-failure='Wed Jun 1 08:32:24 2022'

Failed Resource Actions:

rsc_SAPHana_PRD_HDB00_start_0 on ab-vmhana01 'not running' (7): call=36, status='complete', exitreason='', last-rc-change='2022-06-01 08:32:21Z', queued=0ms, exec=2378ms
```
I collected the following logs:
```
ab-vmhana01:~ # journalctl -f -u pacemaker -u corosync […] Jun 01 08:34:14 ab-vmhana01 pacemaker-schedulerd[25531]: warning: Unexpected result (not running) was recorded for start of rsc_SAPHana_PRD_HDB00:0 on ab-vmhana01 at Jun 1 08:32:21 2022 Jun 01 08:34:14 ab-vmhana01 pacemaker-schedulerd[25531]: warning: Unexpected result (not running) was recorded for start of rsc_SAPHana_PRD_HDB00:0 on ab-vmhana01 at Jun 1 08:32:21 2022 Jun 01 08:34:14 ab-vmhana01 pacemaker-schedulerd[25531]: warning: Forcing msl_SAPHana_PRD_HDB00 away from ab-vmhana01 after 1000000 failures (max=5000) Jun 01 08:34:14 ab-vmhana01 pacemaker-schedulerd[25531]: warning: Forcing msl_SAPHana_PRD_HDB00 away from ab-vmhana01 after 1000000 failures (max=5000)

Cleaning up the failed resource did not fix the issue:

ab-vmhana01:~ # crm resource cleanup rsc_SAPHana_PRD_HDB00 ab-vmhana01

ab-vmhana01:~ # journalctl -f -u pacemaker -u corosync […]
Jun 01 08:35:20 ab-vmhana01 pacemaker-schedulerd[25531]:  warning: Unexpected result (not running) was recorded for start of rsc_SAPHana_PRD_HDB00:0 on ab-vmhana01 at Jun  1 08:35:18 2022
Jun 01 08:35:20 ab-vmhana01 pacemaker-schedulerd[25531]:  warning: Unexpected result (not running) was recorded for start of rsc_SAPHana_PRD_HDB00:0 on ab-vmhana01 at Jun  1 08:35:18 2022
Jun 01 08:35:20 ab-vmhana01 pacemaker-schedulerd[25531]:  notice:  * Recover    rsc_SAPHana_PRD_HDB00:0                   (             Slave ab-vmhana01 )
Jun 01 08:35:20 ab-vmhana01 pacemaker-schedulerd[25531]:  notice: Calculated transition 6, saving inputs in /var/lib/pacemaker/pengine/pe-input-60.bz2
Jun 01 08:35:20 ab-vmhana01 pacemaker-schedulerd[25531]:  warning: Unexpected result (not running) was recorded for start of rsc_SAPHana_PRD_HDB00:0 on ab-vmhana01 at Jun  1 08:35:18 2022
Jun 01 08:35:20 ab-vmhana01 pacemaker-schedulerd[25531]:  warning: Unexpected result (not running) was recorded for start of rsc_SAPHana_PRD_HDB00:0 on ab-vmhana01 at Jun  1 08:35:18 2022
Jun 01 08:35:20 ab-vmhana01 pacemaker-schedulerd[25531]:  warning: Forcing msl_SAPHana_PRD_HDB00 away from ab-vmhana01 after 1000000 failures (max=5000)
Jun 01 08:35:20 ab-vmhana01 pacemaker-schedulerd[25531]:  warning: Forcing msl_SAPHana_PRD_HDB00 away from ab-vmhana01 after 1000000 failures (max=5000)

Used terraform.tfvars

$ grep -v "#" terraform.tfvars| awk NF
project = "<PROJECT ID>"
gcp_credentials_file = "<SA KEY>"
region = "us-west1"
os_image = "suse-sap-cloud/sles-15-sp3-sap"
public_key  = "/Users/ab/.ssh/gcp_key.pub"
private_key = "/Users/ab/.ssh/gcp_key"
cluster_ssh_pub = "salt://sshkeys/cluster.id_rsa.pub"
cluster_ssh_key = "salt://sshkeys/cluster.id_rsa"
ha_sap_deployment_repo = "https://download.opensuse.org/repositories/network:ha-clustering:sap-deployments:v8/"
provisioning_log_level = "info"
pre_deployment = true
bastion_enabled = false
machine_type = "n1-highmem-16"
hana_inst_master = "<GCP BUCKET>"
hana_master_password = "<PASSWORD>"
hana_primary_site = "NUE"
hana_secondary_site = "FRA"

ab-mohamed commented 2 years ago

I have found the following two constraints in the cluster configurations:

location SAPHanaTopology_PRD_HDB00_not_on_majority_maker cln_SAPHanaTopology_PRD_HDB00 -inf: None
location SAPHana_PRD_HDB00_not_on_majority_maker msl_SAPHana_PRD_HDB00 -inf: None

I did not see anylocation constraints before for HANA scale-Up Perf-Opt HA.

@yeoldegrove Can you please check the cluster configurations? I believe there are some added configurations in the last release. Maybe during the scale-out architecture testing?

yeoldegrove commented 2 years ago

@ab-mohamed The majority_maker constraints are already fixed in https://github.com/SUSE/saphanabootstrap-formula/releases/tag/0.9.1.

Let me check if the updated version fixes the issue and come back.

yeoldegrove commented 2 years ago

I can confirm that the saphanabootstrap-formula >= 0.9.1 fixes the issue about the unwanted majority_maker constraints. It will be available via ha_sap_deployment_repo = "https://download.opensuse.org/repositories/network:ha-clustering:sap-deployments:v8/ (#864) in the next days. Currently setting ha_sap_deployment_repo = "https://download.opensuse.org/repositories/network:ha-clustering:sap-deployments:devel/ should fix it.

ab-mohamed commented 2 years ago

@yeoldegrove Using ha_sap_deployment_repo = "https://download.opensuse.org/repositories/network:ha-clustering:sap-deployments:devel did not fix the issue.

I still can see the following wrong constraints:

location SAPHanaTopology_PRD_HDB00_not_on_majority_maker cln_SAPHanaTopology_PRD_HDB00 -inf: None
location SAPHana_PRD_HDB00_not_on_majority_maker msl_SAPHana_PRD_HDB00 -inf: None

Also, the srHook SUDO entries are not there.

yeoldegrove commented 2 years ago

@ab-mohamed Please use latest release 8.1.3 and ha_sap_deployment_repo = "https://download.opensuse.org/repositories/network:/ha-clustering:/sap-deployments:/v8". A lot of fixes are in the latest release.

ab-mohamed commented 2 years ago

@yeoldegrove I did use the latest release as discussed. I still see the incorrect cluster configurations:

location SAPHanaTopology_PRD_HDB00_not_on_majority_maker cln_SAPHanaTopology_PRD_HDB00 -inf: None
location SAPHana_PRD_HDB00_not_on_majority_maker msl_SAPHana_PRD_HDB00 -inf: None

ab-mohamed commented 2 years ago

Also, something happened to the srHook. Searching for the srHook script shows nothing :

$ su - prdadm -c "cdtrace;grep HADR.*load.*SAPHanaS nameserver_*.trc"

In the previous release, it showed me output like the following:

ts-testing-vmhana01:~ # su - prdadm -c "cdtrace;grep SAPHanaSR.srConnectionChanged.*called nameserver_*.trc"
[...]
nameserver_ts-testing-vmhana01.30001.000.trc:[9975]{-1}[-1/-1] 2022-05-30 12:18:20.003529 i ha_dr_SAPHanaSR  SAPHanaSR.py(00115) : SAPHanaSR SAPHanaSR.srConnectionChanged method called with Dict={'status': 15, 'is_in_sync': True, 'timestamp': '2022-05-30T12:18:20.003137+00:00', 'database': 'PRD', 'siteName': 'FRA', 'service_name': 'xsengine', 'hostname': 'ts-testing-vmhana01', 'volume': 2, 'system_status': 13, 'reason': '', 'database_status': 13, 'port': '30007'} ###
nameserver_ts-testing-vmhana01.30001.000.trc:[11035]{-1}[-1/-1] 2022-05-30 12:18:42.115957 i ha_dr_SAPHanaSR  SAPHanaSR.py(00086) : SAPHanaSR (0.162.0) SAPHanaSR.srConnectionChanged method called with Dict={'status': 15, 'is_in_sync': True, 'timestamp': '2022-05-30T12:18:42.115672+00:00', 'database': 'PRD', 'siteName': 'FRA', 'service_name': 'indexserver', 'hostname': 'ts-testing-vmhana01', 'volume': 3, 'system_status': 15, 'reason': '', 'database_status': 15, 'port': '30003'}
nameserver_ts-testing-vmhana01.30001.000.trc:[11035]{-1}[-1/-1] 2022-05-30 12:18:42.148534 i ha_dr_SAPHanaSR  SAPHanaSR.py(00115) : SAPHanaSR SAPHanaSR.srConnectionChanged method called with Dict={'status': 15, 'is_in_sync': True, 'timestamp': '2022-05-30T12:18:42.115672+00:00', 'database': 'PRD', 'siteName': 'FRA', 'service_name': 'indexserver', 'hostname': 'ts-testing-vmhana01', 'volume': 3, 'system_status': 15, 'reason': '', 'database_status': 15, 'port': '30003'} ###

yeoldegrove commented 2 years ago

@ab-mohamed You are right. The majority maker fix is not yet working on GCP (only azure so far). We need another change https://github.com/SUSE/saphanabootstrap-formula/pull/142 to tackle this. The PR also includes a fix for a regression that came with https://github.com/SUSE/saphanabootstrap-formula/pull/136. This is the missing hook you are seeing.

As soon as the rpm is available, I will release it to the v8 repo.

yeoldegrove commented 2 years ago

@ab-mohamed I released https://github.com/SUSE/ha-sap-terraform-deployments/releases/tag/8.1.4 earlier today with the fixes mentioned above. Please retry.

ab-mohamed commented 2 years ago

@yeoldegrove I completed the deployment successfully. the location constraints were removed., but the DB was not working on the first node:

ab-vmhana01:~ # crm_mon -rnf1
Cluster Summary:
  * Stack: corosync
  * Current DC: ab-vmhana01 (version 2.0.5+20201202.ba59be712-150300.4.21.1-2.0.5+20201202.ba59be712) - partition with quorum
  * Last updated: Tue Jun  7 12:45:10 2022
  * Last change:  Tue Jun  7 12:44:40 2022 by root via crm_attribute on ab-vmhana02
  * 2 nodes configured
  * 8 resource instances configured

Node List:
  * Node ab-vmhana01: online:
    * Resources:
      * rsc_gcp_stonith_PRD_HDB00_ab-vmhana01   (stonith:fence_gce):     Started
      * rsc_SAPHanaTopology_PRD_HDB00   (ocf::suse:SAPHanaTopology):     Started
      * rsc_gcp_stonith_PRD_HDB00_ab-vmhana02   (stonith:fence_gce):     Started
  * Node ab-vmhana02: online:
    * Resources:
      * rsc_ip_PRD_HDB00    (ocf::heartbeat:IPaddr2):    Started
      * rsc_socat_PRD_HDB00 (ocf::heartbeat:anything):   Started
      * rsc_SAPHana_PRD_HDB00   (ocf::suse:SAPHana):     Master
      * rsc_SAPHanaTopology_PRD_HDB00   (ocf::suse:SAPHanaTopology):     Started

Inactive Resources:
  * Clone Set: msl_SAPHana_PRD_HDB00 [rsc_SAPHana_PRD_HDB00] (promotable):
    * Masters: [ ab-vmhana02 ]
    * Stopped: [ ab-vmhana01 ]

Migration Summary:
  * Node: ab-vmhana01:
    * rsc_SAPHana_PRD_HDB00: migration-threshold=5000 fail-count=1000000 last-failure='Tue Jun  7 12:43:05 2022'

Failed Resource Actions:
  * rsc_SAPHana_PRD_HDB00_start_0 on ab-vmhana01 'not running' (7): call=41, status='complete', exitreason='', last-rc-change='2022-06-07 12:43:03Z', queued=0ms, exec=2042ms

Refreshing the stopped DB resource did not fix the issue.

I stopped the cluster services, and I started the DB manually. Everything was OK:

prdadm@ab-vmhana01:/usr/sap/PRD/HDB00> HDBSettings.sh systemReplicationStatus.py; echo RC:$?
| Database | Host        | Port  | Service Name | Volume ID | Site ID | Site Name | Secondary   | Secondary | Secondary | Secondary | Secondary     | Replication | Replication | Replication    |
|          |             |       |              |           |         |           | Host        | Port      | Site ID   | Site Name | Active Status | Mode        | Status      | Status Details |
| -------- | ----------- | ----- | ------------ | --------- | ------- | --------- | ----------- | --------- | --------- | --------- | ------------- | ----------- | ----------- | -------------- |
| SYSTEMDB | ab-vmhana01 | 30001 | nameserver   |         1 |       1 | NUE       | ab-vmhana02 |     30001 |         2 | FRA       | YES           | SYNC        | ACTIVE      |                |
| PRD      | ab-vmhana01 | 30007 | xsengine     |         2 |       1 | NUE       | ab-vmhana02 |     30007 |         2 | FRA       | YES           | SYNC        | ACTIVE      |                |
| PRD      | ab-vmhana01 | 30003 | indexserver  |         3 |       1 | NUE       | ab-vmhana02 |     30003 |         2 | FRA       | YES           | SYNC        | ACTIVE      |                |

status system replication site "2": ACTIVE
overall system replication status: ACTIVE

Local System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

mode: PRIMARY
site id: 1
site name: NUE
RC:15

ab-mohamed commented 2 years ago

Interesting finding! When the cluster services are down, starting the DB manually on the second node failed!:

prdadm@ab-vmhana01:/usr/sap/PRD/HDB00> sapcontrol -nr 00 -function GetProcessList

07.06.2022 13:07:31
GetProcessList
OK
name, description, dispstatus, textstatus, starttime, elapsedtime, pid
hdbdaemon, HDB Daemon, GRAY, Stopped, , , 4078

prdadm@ab-vmhana02:/usr/sap/PRD/HDB00>  sapcontrol -nr 00 -function GetProcessList

07.06.2022 13:07:40
GetProcessList
OK
name, description, dispstatus, textstatus, starttime, elapsedtime, pid
hdbdaemon, HDB Daemon, GRAY, Stopped, , , 21468

prdadm@ab-vmhana02:/usr/sap/PRD/HDB00> HDB start

StartService
Impromptu CCC initialization by 'rscpCInit'.
  See SAP note 1266393.
OK
OK
Starting instance using: /usr/sap/PRD/SYS/exe/hdb/sapcontrol -prot NI_HTTP -nr 00 -function StartWait 2700 2

07.06.2022 12:59:53
Start
OK

I found the following log entries:

==> nameserver_alert_ab-vmhana02.trc <==

[23178]{-1}[-1/-1] 2022-06-07 13:10:11.631644 e ServiceHandler   RequestHandlerRegistry.cpp(00033) : Could not register request handler with key: pers/ (already registered)
[23178]{-1}[-1/-1] 2022-06-07 13:10:12.858897 e sr_nameserver    DRClient.cpp(00920) : Could not reach any host of site '1' to send request 'dr_replicatetopologyfromprimary'. Hosts: (ab-vmhana01:40002) Errors:
nameserver communication error;internal error(5500)
[23178]{-1}[-1/-1] 2022-06-07 13:10:12.858916 e sr_nameserver    TREXNameServer.cpp(10779) : communication to remote site return with an error: internal error
[23178]{-1}[-1/-1] 2022-06-07 13:10:12.858979 e sr_nameserver    TREXNameServer.cpp(11276) : source site not reachable, Trying to reconnect in 30 seconds (1)
==> nameserver_ab-vmhana02.30001.001.trc <==

[23178]{-1}[-1/-1] 2022-06-07 13:10:42.861335 e sr_nameserver    DRClient.cpp(00920) : Could not reach any host of site '1' to send request 'dr_replicatetopologyfromprimary'. Hosts: (ab-vmhana01:40002) Errors:
nameserver communication error;internal error(5500)
[23178]{-1}[-1/-1] 2022-06-07 13:10:42.861370 e sr_nameserver    TREXNameServer.cpp(10779) : communication to remote site return with an error: internal error
[23178]{-1}[-1/-1] 2022-06-07 13:10:42.861381 e sr_nameserver    TREXNameServer.cpp(11276) : source site not reachable, Trying to reconnect in 30 seconds (2)
==> nameserver_alert_ab-vmhana02.trc <==

[23178]{-1}[-1/-1] 2022-06-07 13:10:42.861335 e sr_nameserver    DRClient.cpp(00920) : Could not reach any host of site '1' to send request 'dr_replicatetopologyfromprimary'. Hosts: (ab-vmhana01:40002) Errors:
nameserver communication error;internal error(5500)
[23178]{-1}[-1/-1] 2022-06-07 13:10:42.861370 e sr_nameserver    TREXNameServer.cpp(10779) : communication to remote site return with an error: internal error
[23178]{-1}[-1/-1] 2022-06-07 13:10:42.861381 e sr_nameserver    TREXNameServer.cpp(11276) : source site not reachable, Trying to reconnect in 30 seconds (2)
==> nameserver_ab-vmhana02.30001.001.trc <==

[23178]{-1}[-1/-1] 2022-06-07 13:11:12.863711 e sr_nameserver    DRClient.cpp(00920) : Could not reach any host of site '1' to send request 'dr_replicatetopologyfromprimary'. Hosts: (ab-vmhana01:40002) Errors:
nameserver communication error;internal error(5500)
[23178]{-1}[-1/-1] 2022-06-07 13:11:12.863736 e sr_nameserver    TREXNameServer.cpp(10779) : communication to remote site return with an error: internal error
[23178]{-1}[-1/-1] 2022-06-07 13:11:12.863745 e sr_nameserver    TREXNameServer.cpp(11276) : source site not reachable, Trying to reconnect in 30 seconds (3)
==> nameserver_alert_ab-vmhana02.trc <==

[23178]{-1}[-1/-1] 2022-06-07 13:11:12.863711 e sr_nameserver    DRClient.cpp(00920) : Could not reach any host of site '1' to send request 'dr_replicatetopologyfromprimary'. Hosts: (ab-vmhana01:40002) Errors:
nameserver communication error;internal error(5500)
[23178]{-1}[-1/-1] 2022-06-07 13:11:12.863736 e sr_nameserver    TREXNameServer.cpp(10779) : communication to remote site return with an error: internal error
[23178]{-1}[-1/-1] 2022-06-07 13:11:12.863745 e sr_nameserver    TREXNameServer.cpp(11276) : source site not reachable, Trying to reconnect in 30 seconds (3)
==> nameserver_ab-vmhana02.30001.001.trc <==

[23178]{-1}[-1/-1] 2022-06-07 13:11:42.866176 e sr_nameserver    DRClient.cpp(00920) : Could not reach any host of site '1' to send request 'dr_replicatetopologyfromprimary'. Hosts: (ab-vmhana01:40002) Errors:
nameserver communication error;internal error(5500)
[23178]{-1}[-1/-1] 2022-06-07 13:11:42.866201 e sr_nameserver    TREXNameServer.cpp(10779) : communication to remote site return with an error: internal error
[23178]{-1}[-1/-1] 2022-06-07 13:11:42.866212 e sr_nameserver    TREXNameServer.cpp(11276) : source site not reachable, Trying to reconnect in 30 seconds (4)
==> nameserver_alert_ab-vmhana02.trc <==

[23178]{-1}[-1/-1] 2022-06-07 13:11:42.866176 e sr_nameserver    DRClient.cpp(00920) : Could not reach any host of site '1' to send request 'dr_replicatetopologyfromprimary'. Hosts: (ab-vmhana01:40002) Errors:
nameserver communication error;internal error(5500)
[23178]{-1}[-1/-1] 2022-06-07 13:11:42.866201 e sr_nameserver    TREXNameServer.cpp(10779) : communication to remote site return with an error: internal error
[23178]{-1}[-1/-1] 2022-06-07 13:11:42.866212 e sr_nameserver    TREXNameServer.cpp(11276) : source site not reachable, Trying to reconnect in 30 seconds (4)
==> nameserver_ab-vmhana02.30001.001.trc <==

[23178]{-1}[-1/-1] 2022-06-07 13:12:12.868629 e sr_nameserver    DRClient.cpp(00920) : Could not reach any host of site '1' to send request 'dr_replicatetopologyfromprimary'. Hosts: (ab-vmhana01:40002) Errors:
nameserver communication error;internal error(5500)
[23178]{-1}[-1/-1] 2022-06-07 13:12:12.868660 e sr_nameserver    TREXNameServer.cpp(10779) : communication to remote site return with an error: internal error
[23178]{-1}[-1/-1] 2022-06-07 13:12:12.868671 e sr_nameserver    TREXNameServer.cpp(11276) : source site not reachable, Trying to reconnect in 30 seconds (5)
[23178]{-1}[-1/-1] 2022-06-07 13:12:12.868673 e sr_nameserver    TREXNameServer.cpp(11278) : Reconnecting is continued now every 30 seconds in the background, but no more trace entries are generated.
==> nameserver_alert_ab-vmhana02.trc <==

[23178]{-1}[-1/-1] 2022-06-07 13:12:12.868629 e sr_nameserver    DRClient.cpp(00920) : Could not reach any host of site '1' to send request 'dr_replicatetopologyfromprimary'. Hosts: (ab-vmhana01:40002) Errors:
nameserver communication error;internal error(5500)
[23178]{-1}[-1/-1] 2022-06-07 13:12:12.868660 e sr_nameserver    TREXNameServer.cpp(10779) : communication to remote site return with an error: internal error
[23178]{-1}[-1/-1] 2022-06-07 13:12:12.868671 e sr_nameserver    TREXNameServer.cpp(11276) : source site not reachable, Trying to reconnect in 30 seconds (5)
[23178]{-1}[-1/-1] 2022-06-07 13:12:12.868673 e sr_nameserver    TREXNameServer.cpp(11278) : Reconnecting is continued now every 30 seconds in the background, but no more trace entries are generated.
==> nameserver_ab-vmhana02.30001.001.trc <==

[23178]{-1}[-1/-1] 2022-06-07 13:12:42.870941 e sr_nameserver    DRClient.cpp(00920) : Could not reach any host of site '1' to send request 'dr_replicatetopologyfromprimary'. Hosts: (ab-vmhana01:40002) Errors:
nameserver communication error;internal error(5500)
[23178]{-1}[-1/-1] 2022-06-07 13:12:42.870969 e sr_nameserver    TREXNameServer.cpp(10779) : communication to remote site return with an error: internal error
==> nameserver_alert_ab-vmhana02.trc <==

[23178]{-1}[-1/-1] 2022-06-07 13:12:42.870941 e sr_nameserver    DRClient.cpp(00920) : Could not reach any host of site '1' to send request 'dr_replicatetopologyfromprimary'. Hosts: (ab-vmhana01:40002) Errors:
nameserver communication error;internal error(5500)
[23178]{-1}[-1/-1] 2022-06-07 13:12:42.870969 e sr_nameserver    TREXNameServer.cpp(10779) : communication to remote site return with an error: internal error

==> nameserver_ab-vmhana02.30001.001.trc <==

[23178]{-1}[-1/-1] 2022-06-07 13:13:12.873143 e sr_nameserver    DRClient.cpp(00920) : Could not reach any host of site '1' to send request 'dr_replicatetopologyfromprimary'. Hosts: (ab-vmhana01:40002) Errors:
nameserver communication error;internal error(5500)
[23178]{-1}[-1/-1] 2022-06-07 13:13:12.873191 e sr_nameserver    TREXNameServer.cpp(10779) : communication to remote site return with an error: internal error
==> nameserver_alert_ab-vmhana02.trc <==

[23178]{-1}[-1/-1] 2022-06-07 13:13:12.873143 e sr_nameserver    DRClient.cpp(00920) : Could not reach any host of site '1' to send request 'dr_replicatetopologyfromprimary'. Hosts: (ab-vmhana01:40002) Errors:
nameserver communication error;internal error(5500)
[23178]{-1}[-1/-1] 2022-06-07 13:13:12.873191 e sr_nameserver    TREXNameServer.cpp(10779) : communication to remote site return with an error: internal error

ab-mohamed commented 2 years ago

@yeoldegrove Using the updated main branch instead of the master one fixes the issue.

SUSE / ha-sap-terraform-deployments