Deployment on azure arbitrarily fails with `Cluster is not created yet. Run cluster_initialized before`

rtorrero commented 3 years ago

Used cloud platform Azure

Used SLES4SAP version SLES15SP2

Used client machine OS Linux

Expected behaviour vs observed behaviour While attempting to deploy a 2 node cluster (+monitoring) in azure, terraform will often fail almost at the end of the process with the following message:

2021-06-14 13:19:31,506 [salt.loaded.int.module.cmdmod:395 ][INFO    ][4417] Executing command '/usr/sbin/crm' in directory '/root'
2021-06-14 13:19:31,981 [salt.loaded.int.module.cmdmod:836 ][ERROR   ][4417] Command '/usr/sbin/crm' failed with return code: 1
2021-06-14 13:19:31,982 [salt.loaded.int.module.cmdmod:838 ][ERROR   ][4417] stdout: Could not connect to the CIB: Transport endpoint is not connected
crm_mon: Error: cluster is not available on this node
ERROR: status: crm_mon (rc=102):
2021-06-14 13:19:31,982 [salt.loaded.int.module.cmdmod:842 ][ERROR   ][4417] retcode: 1
2021-06-14 13:19:31,983 [salt.state       :323 ][ERROR   ][4417] Cluster is not created yet. Run cluster_initialized before
2021-06-14 13:19:31,983 [salt.state       :2063][INFO    ][4417] Completed state [update] at time 13:19:31.983469 (duration_in_ms=478.278)
2021-06-14 13:19:31,984 [salt.state       :1881][INFO    ][4417] Running state [configure-cluster-properties] at time 13:19:31.984224
2021-06-14 13:19:31,984 [salt.state       :1914][INFO    ][4417] Executing state crm.cluster_properties_present for [configure-cluster-properties]
2021-06-14 13:19:31,986 [salt.loaded.int.module.cmdmod:395 ][INFO    ][4417] Executing command '/usr/sbin/crm' in directory '/root'

How to reproduce Specify the step by step process to reproduce the issue. This usually would look like something like this:

Move to the azure cloud providers folder
Create the terraform.tfvars file based on shared file in this bug report
Run the next terraform commands:
```
terraform init
terraform apply
```

Used terraform.tfvars

pre_deployment         = true
provisioning_log_level = "info"

# Region where to deploy the configuration
az_region = "westeurope"
admin_user = "cloudadmin"

# Optional SUSE Customer Center Registration parameters
reg_code = "<cleared-for-bug-report>"
reg_email = "<cleared-for-bug-report>"

#reg_additional_modules = {
#    "sle-module-adv-systems-management/12/x86_64" = ""
#    "sle-module-containers/12/x86_64" = ""
#    "sle-ha-geo/12.4/x86_64" = "<<REG_CODE>>"
#}

# Public sles4sap image
os_image = "SUSE:sles-sap-15-sp2-byos:gen2:latest"
# You can use PAYG image if you don't have the registration code above, though they are more expensive
# To do this comment the reg_code and reg_email above and use the next image os_image option
#os_image = "SUSE:sles-sap-15-sp2:gen2:latest"

public_key      = "~/.ssh/id_rsa_cloud.pub"
private_key     = "~/.ssh/id_rsa_cloud"

# Azure storage account name
storage_account_name = "<cleared-for-bug-report>"

# Azure storage account secret key (key1 or key2)
storage_account_key = "<cleared-for-bug-report>"

# Number of nodes in the cluster
hana_master_password               = "<cleared-for-bug-report>"
hana_count                         = 2
hana_ha_enabled                    = true
hana_vm_size                       = "Standard_E4s_v3"
hana_enable_accelerated_networking = false
hana_active_active                 = false

# Azure storage account path
hana_inst_master = "<cleared-for-bug-report>/51052481"

# Local folder where HANA installation master will be mounted
hana_inst_folder = "/sapmedia/HANA"
hana_extract_dir = "/sapmedia_extract/HANA"

# Path to a custom ssh public key to upload to the nodes
# Used for cluster communication for example
cluster_ssh_pub = "salt://sshkeys/cluster.id_rsa.pub"

# Path to a custom ssh private key to upload to the nodes
# Used for cluster communication for example
cluster_ssh_key = "salt://sshkeys/cluster.id_rsa"

# HA packages Repository
ha_sap_deployment_repo = "https://download.opensuse.org/repositories/network:ha-clustering:sap-deployments:devel"

# Enable the host to be monitored by exporters
monitoring_enabled = true

# Enable drbd cluster
drbd_enabled = false

# Netweaver variables

netweaver_enabled          = false
netweaver_ha_enabled       = true
netweaver_master_password  = "<cleared-for-bug-report>"
netweaver_app_server_count = 2
netweaver_xscs_vm_size     = "Standard_E4s_v3"
netweaver_app_vm_size      = "Standard_E4s_v3"

netweaver_storage_account_key  = "<cleared-for-bug-report>"
netweaver_storage_account_name = "shapnetweaver"
netweaver_storage_account      = "//shapnetweaver.file.core.windows.net/netweaver"

# The next line would be like saying shapnetweaver/netwaever/SWPM_10_SP26_6
netweaver_swpm_folder = "SWPM_10_SP26_6"
netweaver_sapexe_folder   = "kernel_nw75_sar"
netweaver_additional_dvds = ["51050829_3", "51053787"]

bastion_enabled = false

Logs

2021-06-14 13:19:29,443 [salt.loaded.int.module.cmdmod:395 ][INFO    ][4417] Executing command systemctl in directory '/root'
2021-06-14 13:19:29,464 [salt.loaded.int.module.cmdmod:838 ][DEBUG   ][4417] stdout: * corosync.service - Corosync Cluster Engine
   Loaded: loaded (/usr/lib/systemd/system/corosync.service; disabled; vendor preset: disabled)
   Active: active (running) since Mon 2021-06-14 13:19:02 UTC; 27s ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
  Process: 17570 ExecStart=/usr/share/corosync/corosync start (code=exited, status=0/SUCCESS)
 Main PID: 17592 (corosync)
    Tasks: 2
   CGroup: /system.slice/corosync.service
           `-17592 corosync
2021-06-14 13:19:29,466 [salt.loaded.int.module.cmdmod:395 ][INFO    ][4417] Executing command systemctl in directory '/root'
2021-06-14 13:19:29,478 [salt.loaded.int.module.cmdmod:838 ][DEBUG   ][4417] stdout: active
2021-06-14 13:19:29,479 [salt.loaded.int.module.cmdmod:395 ][INFO    ][4417] Executing command systemctl in directory '/root'
2021-06-14 13:19:29,499 [salt.loaded.int.module.cmdmod:838 ][DEBUG   ][4417] stdout: disabled
2021-06-14 13:19:29,500 [salt.loaded.int.module.cmdmod:842 ][DEBUG   ][4417] retcode: 1
2021-06-14 13:19:29,502 [salt.state       :321 ][INFO    ][4417] The service corosync is already running
2021-06-14 13:19:29,502 [salt.state       :2063][INFO    ][4417] Completed state [corosync] at time 13:19:29.502694 (duration_in_ms=60.424)
2021-06-14 13:19:29,503 [salt.state       :1881][INFO    ][4417] Running state [corosync] at time 13:19:29.503242
2021-06-14 13:19:29,504 [salt.state       :1914][INFO    ][4417] Executing state service.mod_watch for [corosync]
2021-06-14 13:19:29,506 [salt.loaded.int.module.cmdmod:395 ][INFO    ][4417] Executing command systemctl in directory '/root'
2021-06-14 13:19:29,519 [salt.loaded.int.module.cmdmod:838 ][DEBUG   ][4417] stdout: active
2021-06-14 13:19:29,520 [salt.loaded.int.module.config:483 ][DEBUG   ][4417] key: systemd.scope, ret: _|-
2021-06-14 13:19:29,521 [salt.loaded.int.module.cmdmod:395 ][INFO    ][4417] Executing command systemd-run in directory '/root'
2021-06-14 13:19:31,468 [salt.loaded.int.module.cmdmod:840 ][DEBUG   ][4417] stderr: Running scope as unit: run-r010249d97d254607b4cf57eb5d51ac6d.scope
2021-06-14 13:19:31,469 [salt.state       :321 ][INFO    ][4417] {'corosync': True}
2021-06-14 13:19:31,469 [salt.state       :2063][INFO    ][4417] Completed state [corosync] at time 13:19:31.469311 (duration_in_ms=1966.069)
2021-06-14 13:19:31,470 [salt.state       :1881][INFO    ][4417] Running state [/tmp/sbd.config] at time 13:19:31.469989
2021-06-14 13:19:31,471 [salt.state       :1914][INFO    ][4417] Executing state file.managed for [/tmp/sbd.config]
2021-06-14 13:19:31,477 [salt.fileclient  :1101][DEBUG   ][4417] In saltenv 'base', looking at rel_path 'cluster/templates/sbd_resource.j2' to resolve 'salt://cluster/templates/sbd_resource.j2'
2021-06-14 13:19:31,477 [salt.fileclient  :1109][DEBUG   ][4417] In saltenv 'base', ** considering ** path '/var/cache/salt/minion/files/base/cluster/templates/sbd_resource.j2' to resolve 'salt://cluster/templates/sbd_resource.j2'
2021-06-14 13:19:31,477 [salt.fileclient  :1129][DEBUG   ][4417] Fetching file from saltenv 'base', ** attempting ** 'salt://cluster/templates/sbd_resource.j2'
2021-06-14 13:19:31,478 [salt.fileclient  :1157][DEBUG   ][4417] No dest file found
2021-06-14 13:19:31,479 [salt.fileclient  :1238][INFO    ][4417] Fetching file from saltenv 'base', ** done ** 'cluster/templates/sbd_resource.j2'
2021-06-14 13:19:31,480 [salt.utils.jinja :91  ][DEBUG   ][4417] Jinja search path: ['/var/cache/salt/minion/files/base']
2021-06-14 13:19:31,503 [salt.state       :321 ][INFO    ][4417] File changed:
New file
2021-06-14 13:19:31,503 [salt.state       :2063][INFO    ][4417] Completed state [/tmp/sbd.config] at time 13:19:31.503331 (duration_in_ms=33.342)
2021-06-14 13:19:31,505 [salt.state       :1881][INFO    ][4417] Running state [update] at time 13:19:31.505190
2021-06-14 13:19:31,505 [salt.state       :1914][INFO    ][4417] Executing state crm.cluster_configured for [update]
2021-06-14 13:19:31,506 [salt.loaded.int.module.cmdmod:395 ][INFO    ][4417] Executing command '/usr/sbin/crm' in directory '/root'
2021-06-14 13:19:31,981 [salt.loaded.int.module.cmdmod:836 ][ERROR   ][4417] Command '/usr/sbin/crm' failed with return code: 1
2021-06-14 13:19:31,982 [salt.loaded.int.module.cmdmod:838 ][ERROR   ][4417] stdout: Could not connect to the CIB: Transport endpoint is not connected
crm_mon: Error: cluster is not available on this node
ERROR: status: crm_mon (rc=102):
2021-06-14 13:19:31,982 [salt.loaded.int.module.cmdmod:842 ][ERROR   ][4417] retcode: 1
2021-06-14 13:19:31,983 [salt.state       :323 ][ERROR   ][4417] Cluster is not created yet. Run cluster_initialized before
2021-06-14 13:19:31,983 [salt.state       :2063][INFO    ][4417] Completed state [update] at time 13:19:31.983469 (duration_in_ms=478.278)
2021-06-14 13:19:31,984 [salt.state       :1881][INFO    ][4417] Running state [configure-cluster-properties] at time 13:19:31.984224
2021-06-14 13:19:31,984 [salt.state       :1914][INFO    ][4417] Executing state crm.cluster_properties_present for [configure-cluster-properties]
2021-06-14 13:19:31,986 [salt.loaded.int.module.cmdmod:395 ][INFO    ][4417] Executing command '/usr/sbin/crm' in directory '/root'
2021-06-14 13:19:32,482 [salt.loaded.int.module.cmdmod:838 ][DEBUG   ][4417] stdout: Critical: Unable to get stonith-history
Connection to the cluster-daemons terminated
Reading stonith-history failed
2021-06-14 13:19:32,483 [salt.loaded.int.module.cmdmod:395 ][INFO    ][4417] Executing command '/usr/sbin/crm' in directory '/root'
2021-06-14 13:19:32,792 [salt.loaded.int.module.cmdmod:838 ][DEBUG   ][4417] stdout: true
2021-06-14 13:19:32,793 [salt.loaded.int.module.cmdmod:1197][DEBUG   ][4417] output: true
2021-06-14 13:19:32,793 [salt.loaded.int.module.cmdmod:395 ][INFO    ][4417] Executing command '/usr/sbin/crm' in directory '/root'
2021-06-14 13:19:33,357 [salt.loaded.int.module.cmdmod:395 ][INFO    ][4417] Executing command '/usr/sbin/crm' in directory '/root'
2021-06-14 13:19:33,695 [salt.loaded.int.module.cmdmod:838 ][DEBUG   ][4417] stdout: 60s
2021-06-14 13:19:33,696 [salt.loaded.int.module.cmdmod:1197][DEBUG   ][4417] output: 60s
2021-06-14 13:19:33,697 [salt.loaded.int.module.cmdmod:395 ][INFO    ][4417] Executing command '/usr/sbin/crm' in directory '/root'
2021-06-14 13:19:34,119 [salt.state       :321 ][INFO    ][4417] {'stonith-enabled': True, 'stonith-timeout': '144s'}
2021-06-14 13:19:34,120 [salt.state       :2063][INFO    ][4417] Completed state [configure-cluster-properties] at time 13:19:34.120235 (duration_in_ms=2136.011)
2021-06-14 13:19:34,120 [salt.state       :1881][INFO    ][4417] Running state [/tmp/cluster.config] at time 13:19:34.120881
2021-06-14 13:19:34,121 [salt.state       :1914][INFO    ][4417] Executing state file.managed for [/tmp/cluster.config]
2021-06-14 13:19:34,125 [salt.fileclient  :1101][DEBUG   ][4417] In saltenv 'base', looking at rel_path 'hana/templates/scale_up_resources.j2' to resolve 'salt://hana/templates/scale_up_resources.j2'
2021-06-14 13:19:34,126 [salt.fileclient  :1109][DEBUG   ][4417] In saltenv 'base', ** considering ** path '/var/cache/salt/minion/files/base/hana/templates/scale_up_resources.j2' to resolve 'salt://hana/templates/scale_up_resources.j2'
2021-06-14 13:19:34,126 [salt.fileclient  :1129][DEBUG   ][4417] Fetching file from saltenv 'base', ** attempting ** 'salt://hana/templates/scale_up_resources.j2'
2021-06-14 13:19:34,126 [salt.fileclient  :1157][DEBUG   ][4417] No dest file found
2021-06-14 13:19:34,127 [salt.fileclient  :1238][INFO    ][4417] Fetching file from saltenv 'base', ** done ** 'hana/templates/scale_up_resources.j2'
2021-06-14 13:19:34,128 [salt.utils.jinja :91  ][DEBUG   ][4417] Jinja search path: ['/var/cache/salt/minion/files/base']
2021-06-14 13:19:34,244 [salt.state       :321 ][INFO    ][4417] File changed:
New file
2021-06-14 13:19:34,244 [salt.state       :2063][INFO    ][4417] Completed state [/tmp/cluster.config] at time 13:19:34.244663 (duration_in_ms=123.782)
2021-06-14 13:19:34,246 [salt.state       :1881][INFO    ][4417] Running state [update] at time 13:19:34.245984
2021-06-14 13:19:34,246 [salt.state       :1914][INFO    ][4417] Executing state crm.cluster_configured for [update]
2021-06-14 13:19:34,247 [salt.loaded.int.module.cmdmod:395 ][INFO    ][4417] Executing command '/usr/sbin/crm' in directory '/root'
2021-06-14 13:19:34,751 [salt.loaded.int.module.cmdmod:838 ][DEBUG   ][4417] stdout: Cluster Summary:
  * Stack: corosync
  * Current DC: NONE
  * Last updated: Mon Jun 14 13:19:34 2021
  * Last change:  Mon Jun 14 13:19:34 2021 by root via cibadmin on vmhana01
  * 1 node configured
  * 1 resource instance configured

Node List:
  * Node vmhana01: UNCLEAN (offline)

Full List of Resources:
  * stonith-sbd (stonith:external/sbd):  Stopped

Please, request any additional logs privately as they might have secrets that I missed removing.

arbulu89 commented 3 years ago

@rtorrero In which of the 2 nodes is this error happening?

rtorrero commented 3 years ago

@rtorrero In which of the 2 nodes is this error happening?

vmhana01

yeoldegrove commented 3 years ago

I am able to easily reproduce this on a successfully deployed HANA cluster in azure by doing ...

vmhana01:~ # systemctl stop corosync;systemctl stop hawk;systemctl disable hawk; rm -rf /var/lib/pacemaker/*/*
vmhana02:~ # systemctl stop corosync;systemctl stop hawk;systemctl disable hawk; rm -rf /var/lib/pacemaker/*/*
vmhana01:~ # salt-call --local state.apply cluster
vmhana02:~ # salt-call --local state.apply cluster

Looking at the cluster state while the salt-call runs, I noticed that the cluster itself is initialized correctly. The issue seems to be that corosync is restarted in my use case.

----------
          ID: corosync_service
    Function: service.running
        Name: corosync
      Result: True
     Comment: Service restarted
     Started: 10:53:45.217711
    Duration: 2671.614 ms
     Changes:
              ----------
              corosync:
                  True
----------

It takes quite a while to come up and the configure-sbd-resource, configure-cluster-properties and configure-the-cluster states are run before it is completely restarted.

The solution in my case was to build in a crm cluster wait_for_startup loop after the corosync restart.

You can find an PR fixing this here: https://github.com/SUSE/habootstrap-formula/pull/86

yeoldegrove commented 3 years ago

@rtorrero SUSE/habootstrap-formula#86 was just merged. Could you please try it out via setting this in terraform.tfvars ...

ha_sap_deployment_repo = "https://download.opensuse.org/repositories/network:/ha-clustering:/sap-deployments:/devel/SLE_15_SP2/"

And making sure habootstrap-formula-0.4.4 is used?

rtorrero commented 3 years ago

This issue doesn't seem to be happening anymore, thanks!

SUSE / ha-sap-terraform-deployments

Deployment on azure arbitrarily fails with `Cluster is not created yet. Run cluster_initialized before` #713