SUSE / ha-sap-terraform-deployments

Automated SAP/HA Deployments in Public/Private Clouds
GNU General Public License v3.0
123 stars 88 forks source link

HANA native fencing resources fail to start - GCP deployment #839

Closed busetde closed 2 years ago

busetde commented 2 years ago

Hi,

Using master branch and deploy only for HANA HA, couple of tries to deploy that HANA database goes from GREEN to GRAY (and stay gray until fail) as below:

Cluster Summary:
  * Stack: corosync
  * Current DC: budi-vmhana01 (version 2.0.4+20200616.2deceaa3a-3.15.1-2.0.4+20200616.2deceaa3a) - partition with quorum
  * Last updated: Tue Apr  5 14:31:16 2022
  * Last change:  Tue Apr  5 14:29:07 2022 by root via crm_attribute on budi-vmhana01
  * 1 node configured
  * 6 resource instances configured

Node List:
  * Node budi-vmhana01: online:
    * Resources:
      * rsc_ip_HDB_HDB00        (ocf::heartbeat:IPaddr2):        Started
      * rsc_socat_HDB_HDB00     (ocf::heartbeat:anything):       Started
      * rsc_SAPHanaTopology_HDB_HDB00   (ocf::suse:SAPHanaTopology):     Started

Inactive Resources:
  * rsc_gcp_stonith_HDB_HDB00_budi-vmhana01     (stonith:fence_gce):     Stopped
  * Clone Set: msl_SAPHana_HDB_HDB00 [rsc_SAPHana_HDB_HDB00] (promotable):
    * rsc_SAPHana_HDB_HDB00     (ocf::suse:SAPHana):     Stopped budi-vmhana01 (Monitoring)
    * Stopped: [ budi-vmhana01 ]

Migration Summary:
  * Node: budi-vmhana01:
    * rsc_gcp_stonith_HDB_HDB00_budi-vmhana01: migration-threshold=5000 fail-count=1000000 last-failure='Tue Apr  5 14:28:59 2022'
    * rsc_SAPHana_HDB_HDB00: migration-threshold=5000 fail-count=1000000 last-failure='Tue Apr  5 14:29:43 2022'

Failed Resource Actions:
  * rsc_gcp_stonith_HDB_HDB00_budi-vmhana01_start_0 on budi-vmhana01 'error' (1): call=24, status='complete', exitreason='', last-rc-change='2022-04-05 14:28:54Z', queued=0ms, exec=5019ms
  * rsc_SAPHana_HDB_HDB00_start_0 on budi-vmhana01 'not running' (7): call=36, status='complete', exitreason='', last-rc-change='2022-04-05 14:29:41Z', queued=0ms, exec=2148ms

Result from CRM

Cluster Summary:
  * Stack: corosync
  * Current DC: budi-vmhana01 (version 2.0.4+20200616.2deceaa3a-3.15.1-2.0.4+20200616.2deceaa3a) - partition with quorum
  * Last updated: Tue Apr  5 14:31:16 2022
  * Last change:  Tue Apr  5 14:29:07 2022 by root via crm_attribute on budi-vmhana01
  * 1 node configured
  * 6 resource instances configured

Node List:
  * Node budi-vmhana01: online:
    * Resources:
      * rsc_ip_HDB_HDB00        (ocf::heartbeat:IPaddr2):        Started
      * rsc_socat_HDB_HDB00     (ocf::heartbeat:anything):       Started
      * rsc_SAPHanaTopology_HDB_HDB00   (ocf::suse:SAPHanaTopology):     Started

Inactive Resources:
  * rsc_gcp_stonith_HDB_HDB00_budi-vmhana01     (stonith:fence_gce):     Stopped
  * Clone Set: msl_SAPHana_HDB_HDB00 [rsc_SAPHana_HDB_HDB00] (promotable):
    * rsc_SAPHana_HDB_HDB00     (ocf::suse:SAPHana):     Stopped budi-vmhana01 (Monitoring)
    * Stopped: [ budi-vmhana01 ]

Migration Summary:
  * Node: budi-vmhana01:
    * rsc_gcp_stonith_HDB_HDB00_budi-vmhana01: migration-threshold=5000 fail-count=1000000 last-failure='Tue Apr  5 14:28:59 2022'
    * rsc_SAPHana_HDB_HDB00: migration-threshold=5000 fail-count=1000000 last-failure='Tue Apr  5 14:29:43 2022'

Failed Resource Actions:
  * rsc_gcp_stonith_HDB_HDB00_budi-vmhana01_start_0 on budi-vmhana01 'error' (1): call=24, status='complete', exitreason='', last-rc-change='2022-04-05 14:28:54Z', queued=0ms, exec=5019ms
  * rsc_SAPHana_HDB_HDB00_start_0 on budi-vmhana01 'not running' (7): call=36, status='complete', exitreason='', last-rc-change='2022-04-05 14:29:41Z', queued=0ms, exec=2148ms

If HANA node1 manually start then the progress completed successfully as below:

Cluster Summary:
  * Stack: corosync
  * Current DC: budi-vmhana01 (version 2.0.4+20200616.2deceaa3a-3.15.1-2.0.4+20200616.2deceaa3a) - partition with quorum
  * Last updated: Tue Apr  5 14:31:16 2022
  * Last change:  Tue Apr  5 14:29:07 2022 by root via crm_attribute on budi-vmhana01
  * 1 node configured
  * 6 resource instances configured

Node List:
  * Node budi-vmhana01: online:
    * Resources:
      * rsc_ip_HDB_HDB00        (ocf::heartbeat:IPaddr2):        Started
      * rsc_socat_HDB_HDB00     (ocf::heartbeat:anything):       Started
      * rsc_SAPHanaTopology_HDB_HDB00   (ocf::suse:SAPHanaTopology):     Started

Inactive Resources:
  * rsc_gcp_stonith_HDB_HDB00_budi-vmhana01     (stonith:fence_gce):     Stopped
  * Clone Set: msl_SAPHana_HDB_HDB00 [rsc_SAPHana_HDB_HDB00] (promotable):
    * rsc_SAPHana_HDB_HDB00     (ocf::suse:SAPHana):     Stopped budi-vmhana01 (Monitoring)
    * Stopped: [ budi-vmhana01 ]

Migration Summary:
  * Node: budi-vmhana01:
    * rsc_gcp_stonith_HDB_HDB00_budi-vmhana01: migration-threshold=5000 fail-count=1000000 last-failure='Tue Apr  5 14:28:59 2022'
    * rsc_SAPHana_HDB_HDB00: migration-threshold=5000 fail-count=1000000 last-failure='Tue Apr  5 14:29:43 2022'

Failed Resource Actions:
  * rsc_gcp_stonith_HDB_HDB00_budi-vmhana01_start_0 on budi-vmhana01 'error' (1): call=24, status='complete', exitreason='', last-rc-change='2022-04-05 14:28:54Z', queued=0ms, exec=5019ms
  * rsc_SAPHana_HDB_HDB00_start_0 on budi-vmhana01 'not running' (7): call=36, status='complete', exitreason='', last-rc-change='2022-04-05 14:29:41Z', queued=0ms, exec=2148ms

But CRM show that HANA node2 suddenly take over as below:

Cluster Summary:
  * Stack: corosync
  * Current DC: budi-vmhana01 (version 2.0.4+20200616.2deceaa3a-3.15.1-2.0.4+20200616.2deceaa3a) - partition with quorum
  * Last updated: Tue Apr  5 14:44:31 2022
  * Last change:  Tue Apr  5 14:44:09 2022 by root via crm_attribute on budi-vmhana02
  * 2 nodes configured
  * 8 resource instances configured

Node List:
  * Online: [ budi-vmhana01 budi-vmhana02 ]

Active Resources:
  * Resource Group: g_ip_HDB_HDB00:
    * rsc_ip_HDB_HDB00  (ocf::heartbeat:IPaddr2):        Started budi-vmhana02
    * rsc_socat_HDB_HDB00       (ocf::heartbeat:anything):       Started budi-vmhana02
  * Clone Set: msl_SAPHana_HDB_HDB00 [rsc_SAPHana_HDB_HDB00] (promotable):
    * Masters: [ budi-vmhana02 ]
  * Clone Set: cln_SAPHanaTopology_HDB_HDB00 [rsc_SAPHanaTopology_HDB_HDB00]:
    * Started: [ budi-vmhana01 budi-vmhana02 ]

Failed Resource Actions:
  * rsc_gcp_stonith_HDB_HDB00_budi-vmhana01_start_0 on budi-vmhana01 'error' (1): call=24, status='complete', exitreason='', last-rc-change='2022-04-05 14:28:54Z
', queued=0ms, exec=5019ms
  * rsc_SAPHana_HDB_HDB00_start_0 on budi-vmhana01 'not running' (7): call=36, status='complete', exitreason='', last-rc-change='2022-04-05 14:29:41Z', queued=0m
s, exec=2148ms
  * rsc_gcp_stonith_HDB_HDB00_budi-vmhana02_start_0 on budi-vmhana01 'error' (1): call=42, status='complete', exitreason='', last-rc-change='2022-04-05 14:40:25Z
', queued=0ms, exec=4667ms
  * rsc_gcp_stonith_HDB_HDB00_budi-vmhana01_start_0 on budi-vmhana02 'error' (1): call=24, status='complete', exitreason='', last-rc-change='2022-04-05 14:40:12Z
', queued=0ms, exec=4883ms
  * rsc_gcp_stonith_HDB_HDB00_budi-vmhana02_start_0 on budi-vmhana02 'error' (1): call=33, status='complete', exitreason='', last-rc-change='2022-04-05 14:40:20Z
', queued=0ms, exec=4599ms

SAP HANA Replication also SFAIL:

Cluster Summary:
  * Stack: corosync
  * Current DC: budi-vmhana01 (version 2.0.4+20200616.2deceaa3a-3.15.1-2.0.4+20200616.2deceaa3a) - partition with quorum
  * Last updated: Tue Apr  5 14:44:31 2022
  * Last change:  Tue Apr  5 14:44:09 2022 by root via crm_attribute on budi-vmhana02
  * 2 nodes configured
  * 8 resource instances configured

Node List:
  * Online: [ budi-vmhana01 budi-vmhana02 ]

Active Resources:
  * Resource Group: g_ip_HDB_HDB00:
    * rsc_ip_HDB_HDB00  (ocf::heartbeat:IPaddr2):        Started budi-vmhana02
    * rsc_socat_HDB_HDB00       (ocf::heartbeat:anything):       Started budi-vmhana02
  * Clone Set: msl_SAPHana_HDB_HDB00 [rsc_SAPHana_HDB_HDB00] (promotable):
    * Masters: [ budi-vmhana02 ]
  * Clone Set: cln_SAPHanaTopology_HDB_HDB00 [rsc_SAPHanaTopology_HDB_HDB00]:
    * Started: [ budi-vmhana01 budi-vmhana02 ]

Failed Resource Actions:
  * rsc_gcp_stonith_HDB_HDB00_budi-vmhana01_start_0 on budi-vmhana01 'error' (1): call=24, status='complete', exitreason='', last-rc-change='2022-04-05 14:28:54Z
', queued=0ms, exec=5019ms
  * rsc_SAPHana_HDB_HDB00_start_0 on budi-vmhana01 'not running' (7): call=36, status='complete', exitreason='', last-rc-change='2022-04-05 14:29:41Z', queued=0m
s, exec=2148ms
  * rsc_gcp_stonith_HDB_HDB00_budi-vmhana02_start_0 on budi-vmhana01 'error' (1): call=42, status='complete', exitreason='', last-rc-change='2022-04-05 14:40:25Z
', queued=0ms, exec=4667ms
  * rsc_gcp_stonith_HDB_HDB00_budi-vmhana01_start_0 on budi-vmhana02 'error' (1): call=24, status='complete', exitreason='', last-rc-change='2022-04-05 14:40:12Z
', queued=0ms, exec=4883ms
  * rsc_gcp_stonith_HDB_HDB00_budi-vmhana02_start_0 on budi-vmhana02 'error' (1): call=33, status='complete', exitreason='', last-rc-change='2022-04-05 14:40:20Z
', queued=0ms, exec=4599ms

Please kindly advise...

yeoldegrove commented 2 years ago

This is https://bugzilla.suse.com/show_bug.cgi?id=1198872

short: fence-agents-4.9.0+git.1624456340.8d746be9-150300.3.8.1 is broken as --zone parameter is now mandatory. workaround: downgrade to working fence-agents-4.9.0+git.1624456340.8d746be9-3.5.1

yeoldegrove commented 2 years ago

This issue is solved in the latest available rpms:

SUSE Linux Enterprise High Availability 15-SP4 (src):    fence-agents-4.9.0+git.1624456340.8d746be9-150300.3.11.1
SUSE Linux Enterprise High Availability 15-SP3 (src):    fence-agents-4.9.0+git.1624456340.8d746be9-150300.3.11.1

Updates are available in the official channels.