SUSE / ha-sap-terraform-deployments

Automated SAP/HA Deployments in Public/Private Clouds
GNU General Public License v3.0
123 stars 88 forks source link

S/4HANA Monitonring Server Failure #802

Closed ab-mohamed closed 2 years ago

ab-mohamed commented 2 years ago

Used cloud platform GCP

Used SLES4SAP version SLES4SAP 15 SP2

Used client machine OS Google Cloud Shell

Expected behavior vs. observed behavior The Monitoring Server was not able to show the S/4HANA HA cluster status.

How to reproduce

  1. Using the current master branch, start a new S/4HANA deployment.
  2. Using the Monitoring Server Dashboars -> ClusterLabs HA Cluster details -> netweaver, it shows no output, as shown below: image

Troubleshooting Steps

  1. Checking the S/4HANA HA cluster status. It is up and running:
    
    demo1-netweaver01:~ # crm_mon -rnf1
    Cluster Summary:
    * Stack: corosync
    * Current DC: demo1-netweaver01 (version 2.0.4+20200616.2deceaa3a-3.12.1-2.0.4+20200616.2deceaa3a) - partition with quorum
    * Last updated: Mon Dec  6 14:57:13 2021
    * Last change:  Mon Dec  6 13:04:42 2021 by root via cibadmin on demo1-netweaver02
    * 2 nodes configured
    * 10 resource instances configured

Node List:

Inactive Resources:

Migration Summary:


2.  Checking Promuthes configurations:

demo1-monitoring:~ # cat /etc/prometheus/prometheus.yml [...]

  1. Checking if all the ports mentioned above are in the listening mode:
    demo1-netweaver01:~ # ss -tupenl | grep -e State -e 9100 -e 9680 -e 9664
    Netid   State    Recv-Q   Send-Q     Local Address:Port      Peer Address:Port
    tcp     LISTEN   0        128            10.0.1.34:9680           0.0.0.0:*      users:(("sap_host_export",pid=4296,fd=3)) ino:38372 sk:10 <->
    tcp     LISTEN   0        128                    *:9100                 *:*      users:(("node_exporter",pid=1174,fd=3)) uid:472 ino:25341 sk:16 v6only:0 <->
    
    demo1-netweaver01:~ # systemctl status prometheus-node_exporter.service
    ● prometheus-node_exporter.service - Prometheus exporter for machine metrics
    Loaded: loaded (/usr/lib/systemd/system/prometheus-node_exporter.service; enabled; vendor preset: disabled)
    Active: active (running) since Mon 2021-12-06 13:02:04 UTC; 2h 16min ago
     Docs: https://github.com/prometheus/node_exporter
    Main PID: 1174 (node_exporter)
    Tasks: 9
    CGroup: /system.slice/prometheus-node_exporter.service
           └─1174 /usr/bin/node_exporter --collector.systemd --no-collector.mdadm

Dec 06 13:02:08 demo1-netweaver01 node_exporter[1174]: level=info ts=2021-12-06T13:02:08.287Z caller=node_exporter.go:113 collector=thermal_zone Dec 06 13:02:08 demo1-netweaver01 node_exporter[1174]: level=info ts=2021-12-06T13:02:08.287Z caller=node_exporter.go:113 collector=time Dec 06 13:02:08 demo1-netweaver01 node_exporter[1174]: level=info ts=2021-12-06T13:02:08.287Z caller=node_exporter.go:113 collector=timex Dec 06 13:02:08 demo1-netweaver01 node_exporter[1174]: level=info ts=2021-12-06T13:02:08.287Z caller=node_exporter.go:113 collector=udp_queues Dec 06 13:02:08 demo1-netweaver01 node_exporter[1174]: level=info ts=2021-12-06T13:02:08.287Z caller=node_exporter.go:113 collector=uname Dec 06 13:02:08 demo1-netweaver01 node_exporter[1174]: level=info ts=2021-12-06T13:02:08.287Z caller=node_exporter.go:113 collector=vmstat Dec 06 13:02:08 demo1-netweaver01 node_exporter[1174]: level=info ts=2021-12-06T13:02:08.287Z caller=node_exporter.go:113 collector=xfs Dec 06 13:02:08 demo1-netweaver01 node_exporter[1174]: level=info ts=2021-12-06T13:02:08.287Z caller=node_exporter.go:113 collector=zfs Dec 06 13:02:08 demo1-netweaver01 node_exporter[1174]: level=info ts=2021-12-06T13:02:08.287Z caller=node_exporter.go:195 msg="Listening on" address=:9100 Dec 06 13:02:08 demo1-netweaver01 node_exporter[1174]: level=info ts=2021-12-06T13:02:08.287Z caller=tls_config.go:191 msg="TLS is disabled." http2=false

demo1-netweaver01:~ # systemctl status prometheus-sap_host_exporter@HA1_ASCS00.service ● prometheus-sap_host_exporter@HA1_ASCS00.service - Cluster Controlled prometheus-sap_host_exporter@HA1_ASCS00 Loaded: loaded (/usr/lib/systemd/system/prometheus-sap_host_exporter@.service; disabled; vendor preset: disabled) Drop-In: /run/systemd/system/prometheus-sap_host_exporter@HA1_ASCS00.service.d └─50-pacemaker.conf Active: active (running) since Mon 2021-12-06 13:03:23 UTC; 2h 15min ago Docs: https://github.com/SUSE/sap_host_exporter Main PID: 4296 (sap_host_export) Tasks: 6 CGroup: /system.slice/system-prometheus\x2dsap_host_exporter.slice/prometheus-sap_host_exporter@HA1_ASCS00.service └─4296 /usr/bin/sap_host_exporter --config /etc/sap_host_exporter/HA1_ASCS00.yaml

Dec 06 13:03:23 demo1-netweaver01 systemd[1]: Started Cluster Controlled prometheus-sap_host_exporter@HA1_ASCS00. Dec 06 13:03:24 demo1-netweaver01 sap_host_exporter[4296]: time="2021-12-06T13:03:24Z" level=info msg="Using config file: /etc/sap_host_exporter/HA1_ASCS00.yaml" Dec 06 13:03:24 demo1-netweaver01 sap_host_exporter[4296]: time="2021-12-06T13:03:24Z" level=info msg="Monitoring SAP Instance SID: HA1, Name: ASCS00, Number: 0, Hostname: sapha1as" Dec 06 13:03:24 demo1-netweaver01 sap_host_exporter[4296]: time="2021-12-06T13:03:24Z" level=info msg="Start Service collector registered" Dec 06 13:03:24 demo1-netweaver01 sap_host_exporter[4296]: time="2021-12-06T13:03:24Z" level=info msg="Enqueue Server optional collector registered" Dec 06 13:03:24 demo1-netweaver01 sap_host_exporter[4296]: time="2021-12-06T13:03:24Z" level=info msg="Serving metrics on sapha1as:9680"

The `ha_cluster_exporter` service is not running. 

4. Starting the `ha_cluster_exporter` service:

demo1-netweaver01:~ # /usr/bin/ha_cluster_exporter demo1-netweaver01:~ # echo $? 0


5. Again, the `ha_cluster_exporter` service is not in listening mode:

demo1-netweaver01:~ # ss -tupenl | grep -e State -e 9100 -e 9680 -e 9664 Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port
tcp LISTEN 0 128 10.0.1.34:9680 0.0.0.0: users:(("sap_host_export",pid=4296,fd=3)) ino:38372 sk:10 <-> tcp LISTEN 0 128 :9100 : users:(("node_exporter",pid=1174,fd=3)) uid:472 ino:25341 sk:16 v6only:0 <->

6. I found that `/usr/bin/ha_cluster_exporter` file is empty:

demo1-netweaver01:~ # ls -lh /usr/bin/ha_cluster_exporter -rwxr-xr-x 1 root root 0 Dec 3 09:16 /usr/bin/ha_cluster_exporter

demo1-netweaver01:~ # cat /usr/bin/ha_cluster_exporter demo1-netweaver01:~ # echo $? 0

yeoldegrove commented 2 years ago

I just was able to deploy a complete NW stack with the latest master code and ha_sap_deployment_repo = "https://download.opensuse.org/repositories/network:/ha-clustering:/sap-deployments:/v7/" I cannot see any issues with the ha_cluster_exporter binary on my system:

test-netweaver01:~ # ls -la /usr/bin/ha_cluster_exporter
-rwxr-xr-x 1 root root 10313728 Jan  7 00:27 /usr/bin/ha_cluster_exporter
test-netweaver01:~ # file /usr/bin/ha_cluster_exporter
/usr/bin/ha_cluster_exporter: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, stripped

image

I used NW750 and will try S/4HANA 2020 next, as #804 also shows issues there.

yeoldegrove commented 2 years ago

@ab-mohamed Could you specify which S/4HANA version you are using?

Version 2020/2021 will only work with code from https://github.com/SUSE/sapnwbootstrap-formula/pull/92 and https://github.com/SUSE/ha-sap-terraform-deployments/pull/808. Please use the latest develop branch of https://github.com/SUSE/ha-sap-terraform-deployments together with ha_sap_deployment_repo = "https://download.opensuse.org/repositories/network:/ha-clustering:/sap-deployments:/devel/" and give it a shot. If this works for you, we can forward port things to the master branch.

BUT, these changes should be totally unrelated to the "monitoring issues" you faced. I could not reproduce them.

yeoldegrove commented 2 years ago

@ab-mohamed any comment on the S/4HANA version?

yeoldegrove commented 2 years ago

Reopen in case this still happens with 8.0.0 release.