ClusterLabs / crmsh

Command-line interface for High-Availability cluster management on GNU/Linux systems.
GNU General Public License v2.0
131 stars 94 forks source link

history: events not detected #169

Open vvidic opened 7 years ago

vvidic commented 7 years ago

Commands like history events and history transitions do not provide expected info with crmsh 2.3.2 and pacemaker 1.1.15.

It seems the events.txt files created by hb_report are empty so perhaps the event patterns need to be updated?

Also most of the logging now goes into pacemaker.log so maybe this needs to be checked instead of ha-log.txt?

krig commented 7 years ago

Can you provide a bit more detail as to what is not working?

For events.txt, crmsh doesn't actually read that file but recreates the same information based on the logfiles (like ha-log.txt etc.) so if there are problems with collecting those logs, that might be the root cause why there isn't the expected output from events or transitions.

What distribution is it you are on?

vvidic commented 7 years ago

Sure, here is a test.zip example created on a test Debian/unstable cluster with:

/usr/share/crmsh/hb_report/hb_report -f 09:00 test

Some resources were moved and node put into standby but still no events or transitions get reported.

One thing different on Debian syslog is that /var/log/messages only contains kernel info while /var/log/syslog collects info from all services.

krig commented 7 years ago

Aah, that is probably the main reason why this isn't working great on debian. The best solution is probably to detect if it's a Debian system and to use /var/log/syslog instead of /var/log/messages as the source for transition messages there.

dmuhamedagic commented 7 years ago

On Mon, Nov 28, 2016 at 02:06:06AM -0800, Kristoffer Grönlund wrote:

Aah, that is probably the main reason why this isn't working great on debian. The best solution is probably to detect if it's a Debian system and to use /var/log/syslog instead of /var/log/messages as the source for transition messages there.

report doesn't look at any particular syslog file by default, but logs a test message and then finds the file where it went to.

vvidic commented 7 years ago

Yes, the problem with log detection seems to be the grep exits with 2 probably due to subdirs.

# grep -l  -e Mark:HB_REPORT:1480335149 /var/log/*; echo $?
grep: /var/log/apache2: Is a directory
grep: /var/log/apt: Is a directory
grep: /var/log/corosync: Is a directory
/var/log/daemon.log
grep: /var/log/fsck: Is a directory
grep: /var/log/ntpstats: Is a directory
/var/log/syslog
2
vvidic commented 7 years ago

Another problem seems to be that the logger is called after the HA logs are checked:

# /usr/share/crmsh/hb_report/hb_report -v -f 09:00 test2         
crm2: DEBUG: setting CRM_DAEMON_DIR to /usr/lib/pacemaker
crm2: DEBUG: setting PCMK_LIB to /var/lib/pacemaker
crm2: DEBUG: this is OpenAIS cluster stack
crm2: DEBUG: reading log settings from /etc/corosync/corosync.conf
crm2: DEBUG: log settings: facility=daemon logfile= debugfile=
crm2: DEBUG: querying CRM for nodes
crm2: DEBUG: nodes: crm1 crm2
crm2: DEBUG: ssh crm1 OK
crm2: DEBUG: journalctl from: '1480323600' until: '0' from_time '2016-11-28 09:00' to_time: '2016-11-28 13:39' > /tmp/.hb_report.workdir.7QNfa1/test2/journal.log
crm2: DEBUG: no HA log found in /var/log /var/logs /var/syslog /var/adm
        /var/log/ha /var/log/cluster /var/log/pacemaker
        /var/log/heartbeat /var/log/crm /var/log/corosync /var/log/openais
crm2: DEBUG: the log file is in the syslog format
crm2: DEBUG: found log /var/log/messages
crm2: DEBUG: including segment [2268-] from /var/log/messages
crm2: DEBUG: running:  hb_report __slave
crm2: DEBUG: running: ssh -o StrictHostKeyChecking=no -o EscapeChar=none -o ConnectTimeout=15 crm1 " /usr/share/crmsh/hb_report/hb_report __slave"
crm2: DEBUG: setting CRM_DAEMON_DIR to /usr/lib/pacemaker
crm2: DEBUG: setting PCMK_LIB to /var/lib/pacemaker
crm2: DEBUG: this is OpenAIS cluster stack
crm2: DEBUG: reading log settings from /etc/corosync/corosync.conf
crm2: DEBUG: log settings: facility=daemon logfile= debugfile=
crm2: DEBUG: run: logger -p daemon.info Mark:HB_REPORT:1480340342
crm1: DEBUG: setting CRM_DAEMON_DIR to /usr/lib/pacemaker
crm1: DEBUG: setting PCMK_LIB to /var/lib/pacemaker
crm1: DEBUG: this is OpenAIS cluster stack
crm1: DEBUG: reading log settings from /etc/corosync/corosync.conf
crm1: DEBUG: log settings: facility=daemon logfile= debugfile=
crm1: DEBUG: run: logger -p daemon.info Mark:HB_REPORT:1480340342
crm2: DEBUG: journalctl from: '1480323600' until: '0' from_time '2016-11-28 09:00' to_time: '2016-11-28 13:39' > /tmp/.hb_report.workdir.Vjygi7//crm2/journal.log
crm2: DEBUG: the log file is in the syslog format
crm2: DEBUG: found log /var/log/messages
crm1: DEBUG: journalctl from: '1480323600' until: '0' from_time '2016-11-28 09:00' to_time: '2016-11-28 13:39' > /tmp/.hb_report.workdir.SG02tv//crm1/journal.log
crm1: DEBUG: the log file is in the syslog format
crm1: DEBUG: found log /var/log/messages
crm2: DEBUG: including segment [2268-] from /var/log/messages
crm1: DEBUG: including segment [2272-] from /var/log/messages
crm2: DEBUG: looking for PE files in /var/lib/pacemaker/pengine
crm2: DEBUG: found 6 pengine input files in /var/lib/pacemaker/pengine
crm1: DEBUG: looking for PE files in /var/lib/pacemaker/pengine
crm2: DEBUG: the log file is in the syslog format
crm2: DEBUG: found log /var/log/pacemaker.log
crm1: DEBUG: the log file is in the syslog format
crm1: DEBUG: found log /var/log/pacemaker.log
crm2: DEBUG: including segment [4652-] from /var/log/pacemaker.log
crm1: DEBUG: including segment [7182-] from /var/log/pacemaker.log
The report is saved in ./test2.tar.bz2
Report timespan: 11/28/16 09:00:00 - 01/01/70 00:00:00

Thank you for taking time to create this report.
vvidic commented 7 years ago

Adding more verbosity fails if dash is used as /bin/sh:

# /usr/share/crmsh/hb_report/hb_report -v -v -f 09:00 test2
crm2: INFO: high debug level, please read debug.out
/usr/share/crmsh/hb_report/hb_report: 1285: /usr/share/crmsh/hb_report/hb_report: Bad substitution
vvidic commented 7 years ago

Seems that changing the condition from is_node to is_collector in hb_report helps with logfile detection:

#
# part 4: find the logs and cut out the segment for the period
#

is_collector && getlog

Can you confirm what is the correct condition here?

krig commented 7 years ago

Hrm. No, that should really be is_node. There seems to be a host of dash incompatibilities at play. Could you try just changing the shebang to #!/bin/bash and see if that helps at all? Then at least it would be possible to focus on the log collection issues without having to fix dash incompatibility as well.

vvidic commented 7 years ago

Changing to bash does not seem to make a difference. It seems to me logger is only run for __slave calls but log detection is run before that in the master process so it will always fail?

# ./hb_report -f 09:00 -v test
crm2: DEBUG: setting CRM_DAEMON_DIR to /usr/lib/pacemaker
crm2: DEBUG: setting PCMK_LIB to /var/lib/pacemaker
crm2: DEBUG: this is OpenAIS cluster stack
crm2: DEBUG: reading log settings from /etc/corosync/corosync.conf
crm2: DEBUG: log settings: facility=daemon logfile= debugfile=
crm2: DEBUG: querying CRM for nodes
crm2: DEBUG: nodes: crm1 crm2
crm2: DEBUG: ssh crm1 OK
crm2: DEBUG: journalctl from: '1480410000' until: '0' from_time '2016-11-29 09:00' to_time: '2016-11-29 09:01' > /tmp/.hb_report.workdir.oHEpP5/test/journal.log
crm2: DEBUG: no HA log found in /var/log /var/logs /var/syslog /var/adm
        /var/log/ha /var/log/cluster /var/log/pacemaker
        /var/log/heartbeat /var/log/crm /var/log/corosync /var/log/openais
crm2: DEBUG: the log file is in the syslog format
crm2: DEBUG: running:  hb_report __slave
crm2: DEBUG: running: ssh -o StrictHostKeyChecking=no -o EscapeChar=none -o ConnectTimeout=15 crm1 " /usr/share/crmsh/hb_report/hb_report __slave"
crm2: DEBUG: setting CRM_DAEMON_DIR to /usr/lib/pacemaker
crm2: DEBUG: setting PCMK_LIB to /var/lib/pacemaker
crm2: DEBUG: this is OpenAIS cluster stack
crm2: DEBUG: reading log settings from /etc/corosync/corosync.conf
crm2: DEBUG: log settings: facility=daemon logfile= debugfile=
crm2: DEBUG: run: logger -p daemon.info Mark:HB_REPORT:1480410082
crm1: DEBUG: setting CRM_DAEMON_DIR to /usr/lib/pacemaker
crm1: DEBUG: setting PCMK_LIB to /var/lib/pacemaker
crm1: DEBUG: this is OpenAIS cluster stack
crm1: DEBUG: reading log settings from /etc/corosync/corosync.conf
crm1: DEBUG: log settings: facility=daemon logfile= debugfile=
crm1: DEBUG: run: logger -p daemon.info Mark:HB_REPORT:1480410082
crm2: DEBUG: journalctl from: '1480410000' until: '0' from_time '2016-11-29 09:00' to_time: '2016-11-29 09:01' > /tmp/.hb_report.workdir.a9hTwR//crm2/journal.log
crm2: DEBUG: the log file is in the syslog format
crm1: DEBUG: journalctl from: '1480410000' until: '0' from_time '2016-11-29 09:00' to_time: '2016-11-29 09:01' > /tmp/.hb_report.workdir.4Qrhs1//crm1/journal.log
crm1: DEBUG: the log file is in the syslog format
crm2: DEBUG: looking for PE files in /var/lib/pacemaker/pengine
crm1: DEBUG: looking for PE files in /var/lib/pacemaker/pengine
crm2: DEBUG: the log file is in the syslog format
crm2: DEBUG: found log /var/log/pacemaker.log
crm1: WARN: could not figure out the log format of /var/log/pacemaker.log
crm2: DEBUG: including segment [171-] from /var/log/pacemaker.log
The report is saved in ./test.tar.bz2
Report timespan: 11/29/16 09:00:00 - 01/01/70 00:00:00

Thank you for taking time to create this report.
dmuhamedagic commented 7 years ago

On Mon, Nov 28, 2016 at 05:43:33AM -0800, Valentin Vidić wrote:

Adding more verbosity fails if dash is used as /bin/sh:

# /usr/share/crmsh/hb_report/hb_report -v -v -f 09:00 test2
crm2: INFO: high debug level, please read debug.out
/usr/share/crmsh/hb_report/hb_report: 1285: /usr/share/crmsh/hb_report/hb_report: Bad substitution

Apparently because of array use:

1284     if echo "$SHELL" | grep bash > /dev/null &&
1285             [ ${BASH_VERSINFO[0]} = "4" ]; then
1286         exec 3>>$WORKDIR/debug.out
1287         BASH_XTRACEFD=3
1288     else
1289         exec 2>>$WORKDIR/debug.out
1290     fi

Perhaps something like this would help:

is_bash_v4() {
    local v
    v=$(echo $BASH_VERSION | sed 's/[.].*//')
    [ $(expr "$v" : [0-9]) = 1 ] && $v -ge 4
}

if echo "$SHELL" | grep bash > /dev/null &&
        is_bash_v4; then
...

I'll make a pull request for that.

dmuhamedagic commented 7 years ago

On Tue, Nov 29, 2016 at 12:49:57AM -0800, Kristoffer Grönlund wrote:

Hrm. No, that should really be is_node. There seems to be a host of dash incompatibilities at play. Could you try just changing the shebang to #!/bin/bash and see if that helps at all? Then at least it would be possible to focus on the log collection issues without having to fix dash incompatibility as well.

I doubt that there are any more dash vs bash issues.

dmuhamedagic commented 7 years ago

On Tue, Nov 29, 2016 at 01:07:16AM -0800, Valentin Vidić wrote:

Changing to bash does not seem to make a difference. It seems to me logger is only run for __slave calls but log detection is run before that in the master process so it will always fail?

The test report you uploaded seems to contain all information hb_report could have collected. /var/log/pacemaker.log has the messages at the info severity (the new debug level in pacemaker). There are also transitions recorded. Perhaps the output from "crm -d history" could shed some light on the matter.

P.S.1 events.txt in the report is empty because hb_report searches only one of the logs and it prefers the one where it finds its Mark message. This is because in the past there were sometimes duplicate messages due to syslog logging to several logs. Since all interesting logging got moved to pacemaker.log which is not created via syslog, but by simply writing to a file, this log file is not considered.

P.S.2 SUSE has a couple of patches for pacemaker which increase the severity of some important log messages from "info" to "notice". Valentin, perhaps you could consider applying them in Debian. That should help a bit too.

vvidic commented 7 years ago

Thanks for checking the report I sent. Here is what I get by default from crm -d history:

crm(live)history# events
DEBUG: setting source to live
DEBUG: setting report times: <2016-11-30 12:45:01> - <None>
DEBUG: Found description.txt, created 1480513493.0
DEBUG: history: setting year to 2016 (ts: 1480513493.0)
DEBUG: found pacemaker version: 1.1.15-e174ec8 in cib: /var/cache/crm/history-root/live/crm2/cib.xml
DEBUG: parsing /var/cache/crm/history-root/live/crm1/messages
DEBUG: parsing /var/cache/crm/history-root/live/crm2/messages
DEBUG: Transition metadata saved to /var/cache/crm/history-root/live/__meta.json
INFO: Fetching updated logs from cluster nodes. Please wait...
DEBUG: Candidate logs: ['/var/cache/crm/history-root/live/crm1/messages', '/var/cache/crm/history-root/live/crm2/messages']
INFO: No updatable logs found (missing .info for logs)
INFO: Retrieving information from cluster nodes, please wait...
DEBUG: tarball: live.tar.bz2, in dir: /var/cache/crm/history-root
DEBUG: tar -xj < live.tar.bz2
DEBUG: Found description.txt, created 1480513510.0
DEBUG: history: setting year to 2016 (ts: 1480513510.0)
DEBUG: piping string to /usr/bin/less -R

So no events are found as expected in messages log since it only containts kernel lines. Same goes for transitions:

crm(live)history# transitions 
Time                            Name            Node            Tags

However if I change the log file list in history.py to consider pacemaker.log before messages it looks better:

crm(live)history# events
DEBUG: setting source to live
DEBUG: setting report times: <2016-11-30 12:48:51> - <None>
DEBUG: Found description.txt, created 1480513678.0
DEBUG: history: setting year to 2016 (ts: 1480513678.0)
DEBUG: found pacemaker version: 1.1.15-e174ec8 in cib: /var/cache/crm/history-root/live/crm2/cib.xml
DEBUG: parsing /var/cache/crm/history-root/live/crm1/pacemaker.log
DEBUG: {Transition: crm1:pe-input-77
DEBUG: +Event resource: lb
DEBUG: +Event resource: lb
DEBUG: +Event resource: ip
DEBUG: +Event resource: ip
DEBUG: +Event resource: ip
DEBUG: }Transition: crm1:pe-input-77 Complete
DEBUG: parsing /var/cache/crm/history-root/live/crm2/pacemaker.log
DEBUG: +Event node: crm1
DEBUG: +Event node: crm1
DEBUG: Transition metadata saved to /var/cache/crm/history-root/live/__meta.json
INFO: Fetching updated logs from cluster nodes. Please wait...
DEBUG: Candidate logs: ['/var/cache/crm/history-root/live/crm1/pacemaker.log', '/var/cache/crm/history-root/live/crm2/pacemaker.log']
DEBUG: Updating crm1 : /var/cache/crm/history-root/live/crm1/pacemaker.log : /var/log/pacemaker.log : 90900
DEBUG: Updating crm2 : /var/cache/crm/history-root/live/crm2/pacemaker.log : /var/log/pacemaker.log : 157462
DEBUG: updating /var/log/pacemaker.log from crm1 (pos 90900)
DEBUG: updating /var/log/pacemaker.log from crm2 (pos 157462)
DEBUG: parsing /var/cache/crm/history-root/live/crm1/pacemaker.log
DEBUG: {Transition: crm1:pe-input-77
DEBUG: +Event resource: lb
DEBUG: +Event resource: lb
DEBUG: +Event resource: ip
DEBUG: +Event resource: ip
DEBUG: +Event resource: ip
DEBUG: }Transition: crm1:pe-input-77 Complete
DEBUG: parsing /var/cache/crm/history-root/live/crm2/pacemaker.log
DEBUG: +Event node: crm1
DEBUG: +Event node: crm1
DEBUG: Transition metadata saved to /var/cache/crm/history-root/live/__meta.json
DEBUG: piping string to /usr/bin/less -R
Nov 30 13:35:29 [504] crm2       crmd:     info: election_count_vote:   Election 1 (owner: 1) lost: vote from crm1 (Uptime)
Nov 30 13:35:29 [504] crm2       crmd:     info: election_count_vote:   Election 2 (owner: 1) lost: vote from crm1 (Uptime)
Nov 30 13:35:29 [474] crm1       crmd:   notice: te_rsc_command:        Initiating start operation lb_start_0 locally on crm1 | action 6
Nov 30 13:35:29 [471] crm1       lrmd:     info: log_execute:   executing - rsc:lb action:start call_id:10
Nov 30 13:35:31 [474] crm1       crmd:   notice: te_rsc_command:        Initiating start operation ip_start_0 locally on crm1 | action 8
Nov 30 13:35:31 [471] crm1       lrmd:     info: log_execute:   executing - rsc:ip action:start call_id:12
Nov 30 13:35:31 [471] crm1       lrmd:     info: log_finished:  finished - rsc:ip action:start call_id:12 pid:620 exit-code:0 exec-time:22ms queue-time:0ms

One transition is also detected now:

crm(live)history# transitions
Time                            Name            Node            Tags
2016-11-30 13:35:29 - 13:35:31: pe-input-77     crm1            ip lb
dmuhamedagic commented 7 years ago

On Wed, Nov 30, 2016 at 06:01:21AM -0800, Valentin Vidić wrote:

DEBUG: Candidate logs: ['/var/cache/crm/history-root/live/crm1/messages', '/var/cache/crm/history-root/live/crm2/messages']

Hmpf. Perhaps move "pacemaker.log" to the front here:

21 _LOG_FILES = ("messages", "ha-log.txt", "ha-log", "cluster-log.txt", ...

Still, that'd be a kludge and would probably break for some installations.

BTW, "messages" got to be preferred because of bsc#998891 in 99f2056f655850bed60ab9136c4a2b805afd9442. I don't know what where the circumstances in that bug report, but wonder if that's a good decision, since "ha-log.txt" contains the log which got actually picked by hb_report. Normally, it should even be coming from "messages" on the suse platforms.

Alternatively, could we make history use all logs per node. Or have some kind of heuristics to pick the best log.

vvidic commented 7 years ago

Yep, until there is a more general solution I patched the Debian package to prefer pacemaker.log and syslog as can be seen here:

https://anonscm.debian.org/cgit/debian-ha/crmsh.git/tree/debian/patches/0012-Fix-crm-history.patch

Using multiple log files would be best but I guess line duplicates and line ordering are a problem there?