ClusterLabs HA Multi-Cluster overview dashboard incorrect results

ab-mohamed commented 3 years ago

Used cloud platform GCP

Used SLES4SAP version SLES15SP2

Used client machine OS Google Cloud Shell

Expected behavior vs observed behavior Expected behavior: When NetWeaver 7.5 HA cluster is deployed successfully, the ClusterLabs HA Multi-Cluster Overview dashboard should shows 100% Up nodes for NetWeaver HA cluster nodes.

Observed behavior: It shows 50% Up nodes for NetWeaver HA cluster nodes.

How to reproduce

NetWeaver 7.5 normal deployment.
Here are screenshots from the dashboards:

PAS and AAS are up and running:

ha1adm 2> sapcontrol -nr 01 -function GetProcessList
20.10.2021 10:42:07
GetProcessList
OK
name, description, dispstatus, textstatus, starttime, elapsedtime, pid
disp+work, Dispatcher, GREEN, Running, 2021 10 20 08:58:28, 1:43:39, 22253
igswd_mt, IGS Watchdog, GREEN, Running, 2021 10 20 08:58:28, 1:43:39, 22254
gwrd, Gateway, GREEN, Running, 2021 10 20 08:58:29, 1:43:38, 22258
icman, ICM, GREEN, Running, 2021 10 20 08:58:29, 1:43:38, 22259


ha1adm 4> sapcontrol -nr 02 -function GetProcessList

20.10.2021 10:43:07 GetProcessList OK name, description, dispstatus, textstatus, starttime, elapsedtime, pid disp+work, Dispatcher, GREEN, Running, 2021 10 20 09:01:22, 1:41:45, 17306 igswd_mt, IGS Watchdog, GREEN, Running, 2021 10 20 09:01:22, 1:41:45, 17307 gwrd, Gateway, GREEN, Running, 2021 10 20 09:01:23, 1:41:44, 17311 icman, ICM, GREEN, Running, 2021 10 20 09:01:23, 1:41:44, 17312

stefanotorresi commented 3 years ago

hmm there should be 4 nodes in total up for netweaver, 2 of which are clustered. Is node exporter running in the PAS and AAS nodes?

ab-mohamed commented 3 years ago

@stefanotorresi I did not check it before destroying the environment.

What should be the used command to check the node exporter on PAS and AAS?

ab-mohamed commented 3 years ago

@stefanotorresi, the exporter systemd service works on PAS and AAS nodes in addition to the ASCS and ERS nodes:

 dev-demo1-netweaver01:~ # systemctl status prometheus-node_exporter.service
● prometheus-node_exporter.service - Prometheus exporter for machine metrics
   Loaded: loaded (/usr/lib/systemd/system/prometheus-node_exporter.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2021-10-20 11:51:18 UTC; 2h 21min ago
     Docs: https://github.com/prometheus/node_exporter
 Main PID: 3763 (node_exporter)
    Tasks: 9
   CGroup: /system.slice/prometheus-node_exporter.service
           └─3763 /usr/bin/node_exporter --collector.systemd --no-collector.mdadm
Oct 20 11:51:18 dev-demo1-netweaver01 node_exporter[3763]: level=info ts=2021-10-20T11:51:18.816Z caller=node_exporter.go:113 collector=thermal_zone
Oct 20 11:51:18 dev-demo1-netweaver01 node_exporter[3763]: level=info ts=2021-10-20T11:51:18.816Z caller=node_exporter.go:113 collector=time
Oct 20 11:51:18 dev-demo1-netweaver01 node_exporter[3763]: level=info ts=2021-10-20T11:51:18.816Z caller=node_exporter.go:113 collector=timex
Oct 20 11:51:18 dev-demo1-netweaver01 node_exporter[3763]: level=info ts=2021-10-20T11:51:18.816Z caller=node_exporter.go:113 collector=udp_queues
Oct 20 11:51:18 dev-demo1-netweaver01 node_exporter[3763]: level=info ts=2021-10-20T11:51:18.816Z caller=node_exporter.go:113 collector=uname
Oct 20 11:51:18 dev-demo1-netweaver01 node_exporter[3763]: level=info ts=2021-10-20T11:51:18.816Z caller=node_exporter.go:113 collector=vmstat
Oct 20 11:51:18 dev-demo1-netweaver01 node_exporter[3763]: level=info ts=2021-10-20T11:51:18.816Z caller=node_exporter.go:113 collector=xfs
Oct 20 11:51:18 dev-demo1-netweaver01 node_exporter[3763]: level=info ts=2021-10-20T11:51:18.816Z caller=node_exporter.go:113 collector=zfs
Oct 20 11:51:18 dev-demo1-netweaver01 node_exporter[3763]: level=info ts=2021-10-20T11:51:18.816Z caller=node_exporter.go:195 msg="Listening on" address=:9100
Oct 20 11:51:18 dev-demo1-netweaver01 node_exporter[3763]: level=info ts=2021-10-20T11:51:18.817Z caller=tls_config.go:191 msg="TLS is disabled." http2=false

dev-demo1-netweaver03:~ # systemctl status prometheus-node_exporter.service
● prometheus-node_exporter.service - Prometheus exporter for machine metrics
   Loaded: loaded (/usr/lib/systemd/system/prometheus-node_exporter.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2021-10-20 11:50:20 UTC; 2h 22min ago
     Docs: https://github.com/prometheus/node_exporter
 Main PID: 3887 (node_exporter)
    Tasks: 7
   CGroup: /system.slice/prometheus-node_exporter.service
           └─3887 /usr/bin/node_exporter --collector.systemd --no-collector.mdadm
Oct 20 11:50:20 dev-demo1-netweaver03 node_exporter[3887]: level=info ts=2021-10-20T11:50:20.792Z caller=node_exporter.go:113 collector=thermal_zone
Oct 20 11:50:20 dev-demo1-netweaver03 node_exporter[3887]: level=info ts=2021-10-20T11:50:20.792Z caller=node_exporter.go:113 collector=time
Oct 20 11:50:20 dev-demo1-netweaver03 node_exporter[3887]: level=info ts=2021-10-20T11:50:20.792Z caller=node_exporter.go:113 collector=timex
Oct 20 11:50:20 dev-demo1-netweaver03 node_exporter[3887]: level=info ts=2021-10-20T11:50:20.792Z caller=node_exporter.go:113 collector=udp_queues
Oct 20 11:50:20 dev-demo1-netweaver03 node_exporter[3887]: level=info ts=2021-10-20T11:50:20.792Z caller=node_exporter.go:113 collector=uname
Oct 20 11:50:20 dev-demo1-netweaver03 node_exporter[3887]: level=info ts=2021-10-20T11:50:20.792Z caller=node_exporter.go:113 collector=vmstat
Oct 20 11:50:20 dev-demo1-netweaver03 node_exporter[3887]: level=info ts=2021-10-20T11:50:20.792Z caller=node_exporter.go:113 collector=xfs
Oct 20 11:50:20 dev-demo1-netweaver03 node_exporter[3887]: level=info ts=2021-10-20T11:50:20.792Z caller=node_exporter.go:113 collector=zfs
Oct 20 11:50:20 dev-demo1-netweaver03 node_exporter[3887]: level=info ts=2021-10-20T11:50:20.792Z caller=node_exporter.go:195 msg="Listening on" address=:9100
Oct 20 11:50:20 dev-demo1-netweaver03 node_exporter[3887]: level=info ts=2021-10-20T11:50:20.792Z caller=tls_config.go:191 msg="TLS is disabled." http2=false

dev-demo1-netweaver01:~ # ps -ef | grep exporter
root      1182  1073  0 14:11 pts/0    00:00:00 grep --color=auto exporter
prometh+  3763     1  2 11:51 ?        00:03:13 /usr/bin/node_exporter --collector.systemd --no-collector.mdadm
root     18659     1  0 12:09 ?        00:00:43 /usr/bin/ha_cluster_exporter
root     19279     1  0 12:09 ?        00:00:11 /usr/bin/sap_host_exporter --config /etc/sap_host_exporter/HA1_ASCS00.yaml

dev-demo1-netweaver03:~ # ps -ef | grep exporter
prometh+  3887     1  0 11:50 ?        00:00:00 /usr/bin/node_exporter --collector.systemd --no-collector.mdadm
root     24663     1  0 12:52 ?        00:00:00 /usr/bin/sap_host_exporter --config /etc/sap_host_exporter/HA1_PAS01.yaml
root     31498 31416  0 14:11 pts/0    00:00:00 grep --color=auto exporter

yeoldegrove commented 3 years ago

@ab-mohamed Which branch are you running? We had a few changes to the monitoring setup in develop lately.

ab-mohamed commented 3 years ago

@yeoldegrove the current ‘master’ branch.

yeoldegrove commented 3 years ago

@ab-mohamed can you confirm that this is fixed in develop ?

ab-mohamed commented 3 years ago

@yeoldegrove IS this the correct repo for the develop branch? ha_sap_deployment_repo = "https://download.opensuse.org/repositories/network:/ha-clustering:/sap-deployments:/devel/"

ab-mohamed commented 3 years ago

@yeoldegrove Using ha_sap_deployment_repo = "https://download.opensuse.org/repositories/network:/ha-clustering:/sap-deployments:/devel/", I can see that all NetWeaver nodes are up and running:

Can you please apply this fix to the master brunch?

ab-mohamed commented 2 years ago

@yeoldegrove I can't see this fix in the master branch. I am still seeing the same reported in https://github.com/SUSE/ha-sap-terraform-deployments/issues/781#issue-1031259562.

SUSE / ha-sap-terraform-deployments

ClusterLabs HA Multi-Cluster overview dashboard incorrect results #781