SUSE / ha-sap-terraform-deployments

Automated SAP/HA Deployments in Public/Private Clouds
GNU General Public License v3.0
123 stars 88 forks source link

NetWeaver 7.5 failover failure. #780

Closed ab-mohamed closed 3 years ago

ab-mohamed commented 3 years ago

Used cloud platform GCP

Used SLES4SAP version SLES15SP2 for SAP Applications

Used client machine OS Google Cloud Shell

Expected behavior vs observed behavior The Expected behavior - A successful failover means that:

  1. The ASCS service moves from the first node default-netweaver01 to the second node, default-netweaver02.
  2. The ERS Service moves from the second node default-netweaver02 to the first node, default-netweaver01.

Observed behavior: After the failover is completed, the ASCS service moved back to the first node, default-netweaver01 VM.

How to reproduce

  1. Clone the master brunch.

  2. Configure the Terraform variables file.

  3. Execute terraform init and terraform apply --auto-approve commands

  4. The deployment has been completed successfully and the cluster status shows that:

    • The ASCS service is working on default-netweaver01 VM.
    • The ERS service is working on defualt-netweaver02 VM. Here is the initial cluster status:**
      
      default-netweaver01:~ # crm_mon -rnf1
      Cluster Summary:
    • Stack: corosync
    • Current DC: default-netweaver01 (version 2.0.4+20200616.2deceaa3a-3.12.1-2.0.4+20200616.2deceaa3a) - partition with quorum
    • Last updated: Mon Oct 18 13:28:19 2021
    • Last change: Mon Oct 18 10:15:10 2021 by root via cibadmin on default-netweaver02
    • 2 nodes configured
    • 10 resource instances configured

Node List:

Inactive Resources:

Migration Summary:

Failed Resource Actions:


5. Reference to `https://github.com/SUSE/ha-sap-terraform-deployments/issues/779` -> Hotw to Reproduce -> Step 6, I have updated the `SAPInstance` RA configurations to be:

primitive rsc_sap_HA1_ASCS00 SAPInstance \ operations $id=rsc_sap_HA1_ASCS00-operations \ op monitor interval=11 timeout=60 \ op_params on-fail=restart \ params InstanceName=HA1_ASCS00_sapha1as START_PROFILE="/sapmnt/HA1/profile/HA1_ASCS00_sapha1as" AUTOMATIC_RECOVER=false \ meta resource-stickiness=5000 failure-timeout=60 migration-threshold=1 priority=10 primitive rsc_sap_HA1_ERS10 SAPInstance \ operations $id=rsc_sap_HA1_ERS10-operations \ op monitor interval=11 timeout=60 \ op_params on-fail=restart \ params InstanceName=HA1_ERS10_sapha1er START_PROFILE="/sapmnt/HA1/profile/HA1_ERS10_sapha1er" AUTOMATIC_RECOVER=false IS_ERS=true \ meta priority=1000


6. The cluster status remains the same as in step 4.

7. Move the `ASCS` service to the other node, `default-netweaver02`:

default-netweaver01:~ # crm resource move rsc_sap_HA1_ASCS00 force INFO: Move constraint created for rsc_sap_HA1_ASCS00


8.  Wait until the `ASCS` service moves successfully to `default-netweaver02` VM and the `ERS` service moves to `default-netweaver01` VM:

default-netweaver01:~ # crm_mon -rnf1 Cluster Summary:

Node List:

Inactive Resources:

Migration Summary:

Failed Resource Actions:

  1. I noticed that rsc_exporter_HA1_ASCS00 and rsc_exporter_HA1_ERS10 RAs are stopped now.

  2. Clear the ASCS RA:

    default-netweaver01:~ # crm resource clear rsc_sap_HA1_ASCS00
    INFO: Removed migration constraints for rsc_sap_HA1_ASCS00
  3. The ASCS service moves back to the first node, default-netweaver01:

    
    default-netweaver01:~ # crm_mon -rnf1
    Cluster Summary:
    * Stack: corosync
    * Current DC: default-netweaver01 (version 2.0.4+20200616.2deceaa3a-3.12.1-2.0.4+20200616.2deceaa3a) - partition with quorum
    * Last updated: Mon Oct 18 13:46:55 2021
    * Last change:  Mon Oct 18 13:41:17 2021 by root via crm_resource on default-netweaver01
    * 2 nodes configured
    * 10 resource instances configured

Node List:

Inactive Resources:

Migration Summary:

Failed Resource Actions:

  1. Repeat steps 7, 8, 9, and 10. This time with the help of monitoring the pacemaker and corosync services:
    default-netweaver01:~ #  journalctl -u pacemaker -u corosync -f
    [...]
    Oct 18 14:10:06 default-netweaver01 pacemaker-controld[18686]:  notice: State transition S_IDLE -> S_POLICY_ENGINE
    Oct 18 14:10:06 default-netweaver01 pacemaker-schedulerd[18685]:  warning: Unexpected result (not running) was recorded for monitor of rsc_sap_HA1_ERS10 on default-netweaver01 at Oct 18 13:42:00 2021
    Oct 18 14:10:06 default-netweaver01 pacemaker-schedulerd[18685]:  warning: Unexpected result (error) was recorded for start of rsc_exporter_HA1_ERS10 on default-netweaver01 at Oct 18 10:13:06 2021
    Oct 18 14:10:06 default-netweaver01 pacemaker-schedulerd[18685]:  warning: Unexpected result (error) was recorded for start of rsc_exporter_HA1_ASCS00 on default-netweaver02 at Oct 18 13:36:18 2021
    Oct 18 14:10:06 default-netweaver01 pacemaker-schedulerd[18685]:  warning: Unexpected result (not running) was recorded for monitor of rsc_sap_HA1_ERS10 on default-netweaver02 at Oct 18 13:36:05 2021
    Oct 18 14:10:06 default-netweaver01 pacemaker-schedulerd[18685]:  warning: Forcing rsc_exporter_HA1_ERS10 away from default-netweaver01 after 1000000 failures (max=3)
    Oct 18 14:10:06 default-netweaver01 pacemaker-schedulerd[18685]:  warning: Forcing rsc_exporter_HA1_ASCS00 away from default-netweaver02 after 1000000 failures (max=3)
    Oct 18 14:10:06 default-netweaver01 pacemaker-schedulerd[18685]:  notice:  * Move       rsc_ip_HA1_ASCS00                           ( default-netweaver02 -> default-netweaver01 )
    Oct 18 14:10:06 default-netweaver01 pacemaker-schedulerd[18685]:  notice:  * Move       rsc_fs_HA1_ASCS00                           ( default-netweaver02 -> default-netweaver01 )
    Oct 18 14:10:06 default-netweaver01 pacemaker-schedulerd[18685]:  notice:  * Move       rsc_sap_HA1_ASCS00                          ( default-netweaver02 -> default-netweaver01 )
    Oct 18 14:10:06 default-netweaver01 pacemaker-schedulerd[18685]:  notice:  * Start      rsc_exporter_HA1_ASCS00                     (                        default-netweaver01 )
    Oct 18 14:10:06 default-netweaver01 pacemaker-schedulerd[18685]:  notice:  * Move       rsc_ip_HA1_ERS10                            ( default-netweaver01 -> default-netweaver02 )
    Oct 18 14:10:06 default-netweaver01 pacemaker-schedulerd[18685]:  notice:  * Move       rsc_fs_HA1_ERS10                            ( default-netweaver01 -> default-netweaver02 )
    Oct 18 14:10:06 default-netweaver01 pacemaker-schedulerd[18685]:  notice:  * Move       rsc_sap_HA1_ERS10                           ( default-netweaver01 -> default-netweaver02 )
    Oct 18 14:10:06 default-netweaver01 pacemaker-schedulerd[18685]:  notice:  * Start      rsc_exporter_HA1_ERS10                      (                        default-netweaver02 )
    
    default-netweaver01:~ # crm_mon -rnf1
    Cluster Summary:
    * Stack: corosync
    * Current DC: default-netweaver01 (version 2.0.4+20200616.2deceaa3a-3.12.1-2.0.4+20200616.2deceaa3a) - partition with quorum
    * Last updated: Mon Oct 18 14:12:20 2021
    * Last change:  Mon Oct 18 14:10:06 2021 by root via crm_resource on default-netweaver01
    * 2 nodes configured
    * 10 resource instances configured

Node List:

Inactive Resources:

Migration Summary:

Failed Resource Actions:

Best regards, Ab

yeoldegrove commented 3 years ago

This should be fixed in https://github.com/SUSE/sapnwbootstrap-formula/pull/88

yeoldegrove commented 3 years ago

realted to of #712

ab-mohamed commented 3 years ago

Thanks, @yeoldegrove, for the quick update.

How can I use the mentioned fix in my deployment? I am using the master brunch and ha_sap_deployment_repo = "https://download.opensuse.org/repositories/network:/ha-clustering:/sap-deployments:/v7/" repo.

Best regards, Ab

yeoldegrove commented 3 years ago

@ab-mohamed sapnwbootstrap-formula-0.6.7+git.1630666671.a8b69d3 was just released in "https://download.opensuse.org/repositories/network:/ha-clustering:/sap-deployments:/v7/". Please have a look if this resolves you issue.

ab-mohamed commented 3 years ago

@yeoldegrove, Thank you for your release which fixes this issue.

You can close this issue. :)

Best regards, Ab