ClusterLabs / ha_cluster_exporter

Prometheus exporter for Pacemaker based Linux HA clusters
Apache License 2.0
78 stars 35 forks source link

corosync parser error: could not parse node id in corosync-quorumtool output: could not find Node ID line #188

Open jamesyu558 opened 3 years ago

jamesyu558 commented 3 years ago

Hi Support,

The following corosync parser error on the "Node ID" exists on the v1.2.0. So I upgraded the ha_cluster_exporter from v1.2.0 to the latest version v.1.2.1 on my RHEL7 VM. But unfortunately, this error still exists on v1.2.1.

The error message is and noticed that the field name complained by corosync is "Node ID": msg="'corosync' collector scrape failed: corosync parser error: could not parse node id in corosync-quorumtool output: could not find Node ID line"

See below:


# corosync-quorumtool
Quorum information
------------------
Date:             Thu Apr  8 09:55:31 2021
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          2
Ring ID:          1/568
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           1
Flags:            2Node Quorate WaitForAll

Membership information
----------------------
    Nodeid      Votes Name
         1          1 XXXXXXXXXX
         2          1 XXXXXXXXXX (local)

Can you please help?
stefanotorresi commented 3 years ago

I'm not able to reproduce this issue: your example matches the regex we're using to parse quorumtool output. What's the output of ha_cluster_exporter --version?

jamesyu558 commented 3 years ago

Here it is:

cd /var/lib/pacemaker_exporter/

ls -l

total 18436 -rwxr-xr-x. 1 postgres postgres 9437184 Apr 6 08:37 ha_cluster_exporter-amd64

./ha_cluster_exporter-amd64 --version

version 1.2.1+git.1606912430.4fceb77 built with go1.15.5 linux/amd64 2020-12-02T17:30:26+00:00

jamesyu558 commented 3 years ago

IF you have a debug module, I should be able to install it and see exactly what happened to this parser error. Please let me know if more information you need from me.. Really appreciate your help!!!

jamesyu558 commented 3 years ago

in my environment, I have pacemaker installed as well, together with this prometheus exporter installed for Grafana...

stefanotorresi commented 3 years ago

Nope, we don't have a debug module. I guess the best shot you have is to download the sources and run it with a step debugger to inspect what input is being actually fed to the regex here: https://github.com/ClusterLabs/ha_cluster_exporter/blob/4fceb77b3a195bbce12f54e23569a66e20f50bc3/collector/corosync/parser.go#L85-L93

Btw, what corosync version you're using?

jamesyu558 commented 3 years ago

hold on let me check

jamesyu558 commented 3 years ago

corosync -v

Corosync Cluster Engine, version '2.4.3' Copyright (c) 2006-2009 Red Hat, Inc.

jamesyu558 commented 3 years ago

How exactly to debug this on RHEL7? Do you have a specific steps to set it up?

jamesyu558 commented 3 years ago

Or modify the source code to print out the variable "quorumToolOutput" from "parseNodeId" when it gets called?

stefanotorresi commented 3 years ago

You could clone the project and then use https://github.com/go-delve/delve to debug it, but that assumes some familiarity with the Go language and toolkit!

jamesyu558 commented 3 years ago

Thanks...I can figure this out. I let you know soon what value of "quorumToolOutput" is passed over to this function....Thank you again.

stefanotorresi commented 3 years ago

Or modify the source code to print out the variable "quorumToolOutput" from "parseNodeId" when it gets called?

yes, you could also do that by adding

log.Debug(string(quorumToolOutput)) 

after line 85

jamesyu558 commented 3 years ago

even better...thx

jamesyu558 commented 3 years ago

Will get back to you tomorrow morning this time....

jamesyu558 commented 3 years ago

We modified that function like this: func parseNodeId(quorumToolOutput []byte) (string, error) { nodeRe := regexp.MustCompile((?m)Node ID:\s+(\w+)) matches := nodeRe.FindSubmatch(quorumToolOutput) var x = string(quorumToolOutput) if matches == nil { return "", errors.New("could NOT find Node ID line :" + x) } return string(matches[1]), nil }

Then in the log, we see this: could not parse node id in corosync-quorumtool output: could NOT find Node ID line :"

Notice that we changed "not" to "NOT" in purpose and see if the code can take out changes.... Looks like the x variable is an empty space....

Any more ideas?

borisjacquot commented 1 year ago

Hello, is there any update about this issue?

stefanotorresi commented 1 year ago

I need an example output from corosync-quorumtool to reproduce the issue. That is, an output that doesn't correctly match the (?m)Node ID:\s+(\w+) regular expression. You can verify that yourself at https://regex101.com/r/riyToT/1. As you can see, the example provided by OP matches correctly, so I don't know what's up there.

Until I get an actual example, there is not much I can do.

adr1enb commented 1 year ago

Hello @stefanotorresi i've the same issue, here is the output :

ha_cluster_exporter time="2023-05-02T18:37:54Z" level=warning msg="Corosync Collector scrape failed: could not parse ring id and seq number in corosync-quorumtool output: could not find Ring ID line"

Quorum information
------------------
Date:             Tue May  2 18:38:03 2023
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          2
Ring ID:          2.4a46b
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           1  
Flags:            2Node Quorate LastManStanding 

Membership information
----------------------
    Nodeid      Votes Name
         2          1 lb-int01.xxx.yyy.zzz (local)
         3          1 lb-int02.xxx.yyy.zzz

Issue on Debian 11

stefanotorresi commented 1 year ago

hmm, ok, that does match the regex, so it's not helping me either: https://regex101.com/r/JuhDCK/1

stefanotorresi commented 1 year ago

oh, by the way, please always report the versions of the exporter and corosync you're using.

adr1enb commented 1 year ago

Here it is :

corosync 3.1.2-2 ha_cluster_exporter-1.0.1

I've just updated to 1.3.2, it seems fixed :thinking:

Frazew commented 1 year ago

tl;dr: if that can help anyone, make sure you test running corosync-quorumtool with same user as the one your ha_cluster_exporter process runs under and that it does work indeed under that user.


./ha-cluster-exporter --version
ha_cluster_exporter, version 1.3.3+git.1683650163.1000ba6 (branch: HEAD, revision: 1000ba696a5ef85737f70808a12e5a01bee5c281)
  build user:       runner@fv-az1100-952
  build date:       20230529-08:55:18
  go version:       go1.20.4
  platform:         linux/amd64
  tags:             netgo
$ corosync-quorumtool
Cannot initialize CMAP service

In this case (unprivileged user) and I guess in other cases, corosync-quorumtool exits with exit code 1 which is ignored as per this comment. stdout is empty hence the failure to find a node ID and stderr contains that error. The fix here was to make sure the user has the proper permissions for corosync-quorumtool not to fail.

I guess a possible improvement would be ignoring the return code as is currently done but also failing when stdout is empty and stderr is not, since that might indicate failure of the command itself?

stefanotorresi commented 9 months ago

failing when stdout is empty and stderr is not, since that might indicate failure of the command itself

That's a good suggestion! We'll see to implement this tweak in the next iteration.