ClusterLabs / ha_cluster_exporter

Prometheus exporter for Pacemaker based Linux HA clusters
Apache License 2.0
79 stars 35 forks source link

Corosyc/Pacemaker metrics issue on one node in cluster #245

Open ivicavujovic opened 6 months ago

ivicavujovic commented 6 months ago

Hi, we have several three-node clusters with the Corosyc/Pacemaker setup. There is a ha_cluster_exporter set on all of them, and it works just fine. Only on one node in one cluster, we get an error like this when I open the metrics URL:

An error has occurred while serving metrics:

collected metric "ha_cluster_corosync_member_votes" { label:<name:"local" value:"false" > label:<name:"node" value:"NR" > label:<name:"node_id" value:"32566" > gauge:<value:3 > } was collected before with the same name and label values

I checked all nodes in the cluster, and all of them have different IDs:

ha_cluster_corosync_member_votes{local="false",node="xxxx",node_id="2"} 1
ha_cluster_corosync_member_votes{local="false",node="NR",node_id="32636"} 3
ha_cluster_corosync_member_votes{local="true",node="node1.infra.env",node_id="1"} 1
ha_cluster_corosync_member_votes{local="false",node="xxxx",node_id="1"} 1
ha_cluster_corosync_member_votes{local="false",node="xxxx",node_id="3"} 1
ha_cluster_corosync_member_votes{local="false",node="NR",node_id="32652"} 2

Logs are very similar on all of the nodes:

level=info msg="Starting ha_cluster_exporter (version=1.3.0+git.1653405719.2a65dfc, branch=HEAD, revision=2a65dfc015e614e53f34effbd0847cc20317b952)"
level=info msg="Build context (go=go1.16.15, user=runner@fv-az341-182, date=20220524-15:44:13)"
level=warn msg="Reading config file failed" err="Config File \"ha_cluster_exporter\" Not Found in \"[/ /root/.config /etc /usr/etc]\""
level=info msg="Default config values will be used"
level=warn msg="Registration failure" err="could not initialize 'sbd' collector: '/usr/sbin/sbd' does not exist"
level=warn msg="Registration failure" err="could not initialize 'drbd' collector: '/sbin/drbdsetup' does not exist"
level=info msg="pacemaker collector registered."
level=info msg="corosync collector registered."
level=info msg="Serving metrics on :9664/metrics"
level=warn msg="Reading web config file failed" err="stat /etc/ha_cluster_exporter.web.yaml: no such file or directory"
level=info msg="Default web config or commandline values will be used"
level=info msg="TLS is disabled." http2=false

All nodes have the same configuration (OS, HDD, RAM, CPU) and are built and provisioned using Puppet configuration management.

Service file is very simple:

[Unit]
Description=Prometheus ha_cluster_exporter
Wants=network-online.target
After=network-online.target

[Service]
User=root
Group=root

ExecStart=/usr/local/bin/ha_cluster_exporter

ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=always

[Install]
WantedBy=multi-user.target

For the love of God, I cannot find what could be an issue here. Did we make some misconfiguration, or did we miss some of that? There is nothing special set; we install the exporter and run it.

OS is Debian 11, version of exporter is 1.3.3 (but same issue with older versions too).

maomaoaichirou commented 6 months ago

this is bug

stefanotorresi commented 6 months ago

Thanks for your bug report. This is definitely not supposed to happen. Could you please report the output of corosync-quorumtool -p on both nodes?

ivicavujovic commented 6 months ago

Yes, here is the output from all three nodes:

root@node1:~# corosync-quorumtool -p
Quorum information
------------------
Date:             Tue Mar  5 14:11:15 2024
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          1
Ring ID:          1.149
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1         NR node1.infra.env (local)
         2          1         NR XXXX:YYYY:ZZZZ:QQQQ::62%32695
         3          1         NR XXXX:YYYY:ZZZZ:QQQQ::63%32695
root@node2:~# corosync-quorumtool -p
Quorum information
------------------
Date:             Tue Mar  5 14:11:56 2024
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          2
Ring ID:          1.149
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1         NR XXXX:YYYY:ZZZZ:QQQQ::61%32620
         2          1         NR node2.infra.env (local)
         3          1         NR XXXX:YYYY:ZZZZ:QQQQ::63%32620
root@node3:~# corosync-quorumtool -p
Quorum information
------------------
Date:             Tue Mar  5 14:12:31 2024
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          3
Ring ID:          1.149
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1         NR XXXX:YYYY:ZZZZ:QQQQ::61%32728
         2          1         NR XXXX:YYYY:ZZZZ:QQQQ::62%32728
         3          1         NR node3.infra.env (local)
stefanotorresi commented 5 months ago

Thanks, I will look into it sometime over the coming weeks and let you know.

ivicavujovic commented 5 months ago

Thanks a lot for the effort.