Open vvidic opened 7 years ago
Can you provide a bit more detail as to what is not working?
For events.txt
, crmsh doesn't actually read that file but recreates the same information based on the logfiles (like ha-log.txt etc.) so if there are problems with collecting those logs, that might be the root cause why there isn't the expected output from events
or transitions
.
What distribution is it you are on?
Sure, here is a test.zip example created on a test Debian/unstable cluster with:
/usr/share/crmsh/hb_report/hb_report -f 09:00 test
Some resources were moved and node put into standby but still no events or transitions get reported.
One thing different on Debian syslog is that /var/log/messages
only contains kernel info while /var/log/syslog
collects info from all services.
Aah, that is probably the main reason why this isn't working great on debian. The best solution is probably to detect if it's a Debian system and to use /var/log/syslog instead of /var/log/messages as the source for transition messages there.
On Mon, Nov 28, 2016 at 02:06:06AM -0800, Kristoffer Grönlund wrote:
Aah, that is probably the main reason why this isn't working great on debian. The best solution is probably to detect if it's a Debian system and to use /var/log/syslog instead of /var/log/messages as the source for transition messages there.
report doesn't look at any particular syslog file by default, but logs a test message and then finds the file where it went to.
Yes, the problem with log detection seems to be the grep exits with 2 probably due to subdirs.
# grep -l -e Mark:HB_REPORT:1480335149 /var/log/*; echo $?
grep: /var/log/apache2: Is a directory
grep: /var/log/apt: Is a directory
grep: /var/log/corosync: Is a directory
/var/log/daemon.log
grep: /var/log/fsck: Is a directory
grep: /var/log/ntpstats: Is a directory
/var/log/syslog
2
Another problem seems to be that the logger is called after the HA logs are checked:
# /usr/share/crmsh/hb_report/hb_report -v -f 09:00 test2
crm2: DEBUG: setting CRM_DAEMON_DIR to /usr/lib/pacemaker
crm2: DEBUG: setting PCMK_LIB to /var/lib/pacemaker
crm2: DEBUG: this is OpenAIS cluster stack
crm2: DEBUG: reading log settings from /etc/corosync/corosync.conf
crm2: DEBUG: log settings: facility=daemon logfile= debugfile=
crm2: DEBUG: querying CRM for nodes
crm2: DEBUG: nodes: crm1 crm2
crm2: DEBUG: ssh crm1 OK
crm2: DEBUG: journalctl from: '1480323600' until: '0' from_time '2016-11-28 09:00' to_time: '2016-11-28 13:39' > /tmp/.hb_report.workdir.7QNfa1/test2/journal.log
crm2: DEBUG: no HA log found in /var/log /var/logs /var/syslog /var/adm
/var/log/ha /var/log/cluster /var/log/pacemaker
/var/log/heartbeat /var/log/crm /var/log/corosync /var/log/openais
crm2: DEBUG: the log file is in the syslog format
crm2: DEBUG: found log /var/log/messages
crm2: DEBUG: including segment [2268-] from /var/log/messages
crm2: DEBUG: running: hb_report __slave
crm2: DEBUG: running: ssh -o StrictHostKeyChecking=no -o EscapeChar=none -o ConnectTimeout=15 crm1 " /usr/share/crmsh/hb_report/hb_report __slave"
crm2: DEBUG: setting CRM_DAEMON_DIR to /usr/lib/pacemaker
crm2: DEBUG: setting PCMK_LIB to /var/lib/pacemaker
crm2: DEBUG: this is OpenAIS cluster stack
crm2: DEBUG: reading log settings from /etc/corosync/corosync.conf
crm2: DEBUG: log settings: facility=daemon logfile= debugfile=
crm2: DEBUG: run: logger -p daemon.info Mark:HB_REPORT:1480340342
crm1: DEBUG: setting CRM_DAEMON_DIR to /usr/lib/pacemaker
crm1: DEBUG: setting PCMK_LIB to /var/lib/pacemaker
crm1: DEBUG: this is OpenAIS cluster stack
crm1: DEBUG: reading log settings from /etc/corosync/corosync.conf
crm1: DEBUG: log settings: facility=daemon logfile= debugfile=
crm1: DEBUG: run: logger -p daemon.info Mark:HB_REPORT:1480340342
crm2: DEBUG: journalctl from: '1480323600' until: '0' from_time '2016-11-28 09:00' to_time: '2016-11-28 13:39' > /tmp/.hb_report.workdir.Vjygi7//crm2/journal.log
crm2: DEBUG: the log file is in the syslog format
crm2: DEBUG: found log /var/log/messages
crm1: DEBUG: journalctl from: '1480323600' until: '0' from_time '2016-11-28 09:00' to_time: '2016-11-28 13:39' > /tmp/.hb_report.workdir.SG02tv//crm1/journal.log
crm1: DEBUG: the log file is in the syslog format
crm1: DEBUG: found log /var/log/messages
crm2: DEBUG: including segment [2268-] from /var/log/messages
crm1: DEBUG: including segment [2272-] from /var/log/messages
crm2: DEBUG: looking for PE files in /var/lib/pacemaker/pengine
crm2: DEBUG: found 6 pengine input files in /var/lib/pacemaker/pengine
crm1: DEBUG: looking for PE files in /var/lib/pacemaker/pengine
crm2: DEBUG: the log file is in the syslog format
crm2: DEBUG: found log /var/log/pacemaker.log
crm1: DEBUG: the log file is in the syslog format
crm1: DEBUG: found log /var/log/pacemaker.log
crm2: DEBUG: including segment [4652-] from /var/log/pacemaker.log
crm1: DEBUG: including segment [7182-] from /var/log/pacemaker.log
The report is saved in ./test2.tar.bz2
Report timespan: 11/28/16 09:00:00 - 01/01/70 00:00:00
Thank you for taking time to create this report.
Adding more verbosity fails if dash
is used as /bin/sh
:
# /usr/share/crmsh/hb_report/hb_report -v -v -f 09:00 test2
crm2: INFO: high debug level, please read debug.out
/usr/share/crmsh/hb_report/hb_report: 1285: /usr/share/crmsh/hb_report/hb_report: Bad substitution
Seems that changing the condition from is_node
to is_collector
in hb_report
helps with logfile detection:
#
# part 4: find the logs and cut out the segment for the period
#
is_collector && getlog
Can you confirm what is the correct condition here?
Hrm. No, that should really be is_node
. There seems to be a host of dash incompatibilities at play. Could you try just changing the shebang to #!/bin/bash
and see if that helps at all? Then at least it would be possible to focus on the log collection issues without having to fix dash incompatibility as well.
Changing to bash does not seem to make a difference. It seems to me logger
is only run for __slave
calls but log detection is run before that in the master process so it will always fail?
# ./hb_report -f 09:00 -v test
crm2: DEBUG: setting CRM_DAEMON_DIR to /usr/lib/pacemaker
crm2: DEBUG: setting PCMK_LIB to /var/lib/pacemaker
crm2: DEBUG: this is OpenAIS cluster stack
crm2: DEBUG: reading log settings from /etc/corosync/corosync.conf
crm2: DEBUG: log settings: facility=daemon logfile= debugfile=
crm2: DEBUG: querying CRM for nodes
crm2: DEBUG: nodes: crm1 crm2
crm2: DEBUG: ssh crm1 OK
crm2: DEBUG: journalctl from: '1480410000' until: '0' from_time '2016-11-29 09:00' to_time: '2016-11-29 09:01' > /tmp/.hb_report.workdir.oHEpP5/test/journal.log
crm2: DEBUG: no HA log found in /var/log /var/logs /var/syslog /var/adm
/var/log/ha /var/log/cluster /var/log/pacemaker
/var/log/heartbeat /var/log/crm /var/log/corosync /var/log/openais
crm2: DEBUG: the log file is in the syslog format
crm2: DEBUG: running: hb_report __slave
crm2: DEBUG: running: ssh -o StrictHostKeyChecking=no -o EscapeChar=none -o ConnectTimeout=15 crm1 " /usr/share/crmsh/hb_report/hb_report __slave"
crm2: DEBUG: setting CRM_DAEMON_DIR to /usr/lib/pacemaker
crm2: DEBUG: setting PCMK_LIB to /var/lib/pacemaker
crm2: DEBUG: this is OpenAIS cluster stack
crm2: DEBUG: reading log settings from /etc/corosync/corosync.conf
crm2: DEBUG: log settings: facility=daemon logfile= debugfile=
crm2: DEBUG: run: logger -p daemon.info Mark:HB_REPORT:1480410082
crm1: DEBUG: setting CRM_DAEMON_DIR to /usr/lib/pacemaker
crm1: DEBUG: setting PCMK_LIB to /var/lib/pacemaker
crm1: DEBUG: this is OpenAIS cluster stack
crm1: DEBUG: reading log settings from /etc/corosync/corosync.conf
crm1: DEBUG: log settings: facility=daemon logfile= debugfile=
crm1: DEBUG: run: logger -p daemon.info Mark:HB_REPORT:1480410082
crm2: DEBUG: journalctl from: '1480410000' until: '0' from_time '2016-11-29 09:00' to_time: '2016-11-29 09:01' > /tmp/.hb_report.workdir.a9hTwR//crm2/journal.log
crm2: DEBUG: the log file is in the syslog format
crm1: DEBUG: journalctl from: '1480410000' until: '0' from_time '2016-11-29 09:00' to_time: '2016-11-29 09:01' > /tmp/.hb_report.workdir.4Qrhs1//crm1/journal.log
crm1: DEBUG: the log file is in the syslog format
crm2: DEBUG: looking for PE files in /var/lib/pacemaker/pengine
crm1: DEBUG: looking for PE files in /var/lib/pacemaker/pengine
crm2: DEBUG: the log file is in the syslog format
crm2: DEBUG: found log /var/log/pacemaker.log
crm1: WARN: could not figure out the log format of /var/log/pacemaker.log
crm2: DEBUG: including segment [171-] from /var/log/pacemaker.log
The report is saved in ./test.tar.bz2
Report timespan: 11/29/16 09:00:00 - 01/01/70 00:00:00
Thank you for taking time to create this report.
On Mon, Nov 28, 2016 at 05:43:33AM -0800, Valentin Vidić wrote:
Adding more verbosity fails if
dash
is used as/bin/sh
:# /usr/share/crmsh/hb_report/hb_report -v -v -f 09:00 test2 crm2: INFO: high debug level, please read debug.out /usr/share/crmsh/hb_report/hb_report: 1285: /usr/share/crmsh/hb_report/hb_report: Bad substitution
Apparently because of array use:
1284 if echo "$SHELL" | grep bash > /dev/null &&
1285 [ ${BASH_VERSINFO[0]} = "4" ]; then
1286 exec 3>>$WORKDIR/debug.out
1287 BASH_XTRACEFD=3
1288 else
1289 exec 2>>$WORKDIR/debug.out
1290 fi
Perhaps something like this would help:
is_bash_v4() {
local v
v=$(echo $BASH_VERSION | sed 's/[.].*//')
[ $(expr "$v" : [0-9]) = 1 ] && $v -ge 4
}
if echo "$SHELL" | grep bash > /dev/null &&
is_bash_v4; then
...
I'll make a pull request for that.
On Tue, Nov 29, 2016 at 12:49:57AM -0800, Kristoffer Grönlund wrote:
Hrm. No, that should really be
is_node
. There seems to be a host of dash incompatibilities at play. Could you try just changing the shebang to#!/bin/bash
and see if that helps at all? Then at least it would be possible to focus on the log collection issues without having to fix dash incompatibility as well.
I doubt that there are any more dash vs bash issues.
On Tue, Nov 29, 2016 at 01:07:16AM -0800, Valentin Vidić wrote:
Changing to bash does not seem to make a difference. It seems to me
logger
is only run for__slave
calls but log detection is run before that in the master process so it will always fail?
The test report you uploaded seems to contain all information hb_report could have collected. /var/log/pacemaker.log has the messages at the info severity (the new debug level in pacemaker). There are also transitions recorded. Perhaps the output from "crm -d history" could shed some light on the matter.
P.S.1 events.txt in the report is empty because hb_report searches only one of the logs and it prefers the one where it finds its Mark message. This is because in the past there were sometimes duplicate messages due to syslog logging to several logs. Since all interesting logging got moved to pacemaker.log which is not created via syslog, but by simply writing to a file, this log file is not considered.
P.S.2 SUSE has a couple of patches for pacemaker which increase the severity of some important log messages from "info" to "notice". Valentin, perhaps you could consider applying them in Debian. That should help a bit too.
Thanks for checking the report I sent. Here is what I get by default from crm -d history
:
crm(live)history# events
DEBUG: setting source to live
DEBUG: setting report times: <2016-11-30 12:45:01> - <None>
DEBUG: Found description.txt, created 1480513493.0
DEBUG: history: setting year to 2016 (ts: 1480513493.0)
DEBUG: found pacemaker version: 1.1.15-e174ec8 in cib: /var/cache/crm/history-root/live/crm2/cib.xml
DEBUG: parsing /var/cache/crm/history-root/live/crm1/messages
DEBUG: parsing /var/cache/crm/history-root/live/crm2/messages
DEBUG: Transition metadata saved to /var/cache/crm/history-root/live/__meta.json
INFO: Fetching updated logs from cluster nodes. Please wait...
DEBUG: Candidate logs: ['/var/cache/crm/history-root/live/crm1/messages', '/var/cache/crm/history-root/live/crm2/messages']
INFO: No updatable logs found (missing .info for logs)
INFO: Retrieving information from cluster nodes, please wait...
DEBUG: tarball: live.tar.bz2, in dir: /var/cache/crm/history-root
DEBUG: tar -xj < live.tar.bz2
DEBUG: Found description.txt, created 1480513510.0
DEBUG: history: setting year to 2016 (ts: 1480513510.0)
DEBUG: piping string to /usr/bin/less -R
So no events are found as expected in messages
log since it only containts kernel lines. Same goes for transitions:
crm(live)history# transitions
Time Name Node Tags
However if I change the log file list in history.py
to consider pacemaker.log
before messages
it looks better:
crm(live)history# events
DEBUG: setting source to live
DEBUG: setting report times: <2016-11-30 12:48:51> - <None>
DEBUG: Found description.txt, created 1480513678.0
DEBUG: history: setting year to 2016 (ts: 1480513678.0)
DEBUG: found pacemaker version: 1.1.15-e174ec8 in cib: /var/cache/crm/history-root/live/crm2/cib.xml
DEBUG: parsing /var/cache/crm/history-root/live/crm1/pacemaker.log
DEBUG: {Transition: crm1:pe-input-77
DEBUG: +Event resource: lb
DEBUG: +Event resource: lb
DEBUG: +Event resource: ip
DEBUG: +Event resource: ip
DEBUG: +Event resource: ip
DEBUG: }Transition: crm1:pe-input-77 Complete
DEBUG: parsing /var/cache/crm/history-root/live/crm2/pacemaker.log
DEBUG: +Event node: crm1
DEBUG: +Event node: crm1
DEBUG: Transition metadata saved to /var/cache/crm/history-root/live/__meta.json
INFO: Fetching updated logs from cluster nodes. Please wait...
DEBUG: Candidate logs: ['/var/cache/crm/history-root/live/crm1/pacemaker.log', '/var/cache/crm/history-root/live/crm2/pacemaker.log']
DEBUG: Updating crm1 : /var/cache/crm/history-root/live/crm1/pacemaker.log : /var/log/pacemaker.log : 90900
DEBUG: Updating crm2 : /var/cache/crm/history-root/live/crm2/pacemaker.log : /var/log/pacemaker.log : 157462
DEBUG: updating /var/log/pacemaker.log from crm1 (pos 90900)
DEBUG: updating /var/log/pacemaker.log from crm2 (pos 157462)
DEBUG: parsing /var/cache/crm/history-root/live/crm1/pacemaker.log
DEBUG: {Transition: crm1:pe-input-77
DEBUG: +Event resource: lb
DEBUG: +Event resource: lb
DEBUG: +Event resource: ip
DEBUG: +Event resource: ip
DEBUG: +Event resource: ip
DEBUG: }Transition: crm1:pe-input-77 Complete
DEBUG: parsing /var/cache/crm/history-root/live/crm2/pacemaker.log
DEBUG: +Event node: crm1
DEBUG: +Event node: crm1
DEBUG: Transition metadata saved to /var/cache/crm/history-root/live/__meta.json
DEBUG: piping string to /usr/bin/less -R
Nov 30 13:35:29 [504] crm2 crmd: info: election_count_vote: Election 1 (owner: 1) lost: vote from crm1 (Uptime)
Nov 30 13:35:29 [504] crm2 crmd: info: election_count_vote: Election 2 (owner: 1) lost: vote from crm1 (Uptime)
Nov 30 13:35:29 [474] crm1 crmd: notice: te_rsc_command: Initiating start operation lb_start_0 locally on crm1 | action 6
Nov 30 13:35:29 [471] crm1 lrmd: info: log_execute: executing - rsc:lb action:start call_id:10
Nov 30 13:35:31 [474] crm1 crmd: notice: te_rsc_command: Initiating start operation ip_start_0 locally on crm1 | action 8
Nov 30 13:35:31 [471] crm1 lrmd: info: log_execute: executing - rsc:ip action:start call_id:12
Nov 30 13:35:31 [471] crm1 lrmd: info: log_finished: finished - rsc:ip action:start call_id:12 pid:620 exit-code:0 exec-time:22ms queue-time:0ms
One transition is also detected now:
crm(live)history# transitions
Time Name Node Tags
2016-11-30 13:35:29 - 13:35:31: pe-input-77 crm1 ip lb
On Wed, Nov 30, 2016 at 06:01:21AM -0800, Valentin Vidić wrote:
DEBUG: Candidate logs: ['/var/cache/crm/history-root/live/crm1/messages', '/var/cache/crm/history-root/live/crm2/messages']
Hmpf. Perhaps move "pacemaker.log" to the front here:
21 _LOG_FILES = ("messages", "ha-log.txt", "ha-log", "cluster-log.txt", ...
Still, that'd be a kludge and would probably break for some installations.
BTW, "messages" got to be preferred because of bsc#998891 in 99f2056f655850bed60ab9136c4a2b805afd9442. I don't know what where the circumstances in that bug report, but wonder if that's a good decision, since "ha-log.txt" contains the log which got actually picked by hb_report. Normally, it should even be coming from "messages" on the suse platforms.
Alternatively, could we make history use all logs per node. Or have some kind of heuristics to pick the best log.
Yep, until there is a more general solution I patched the Debian package to prefer pacemaker.log
and syslog
as can be seen here:
https://anonscm.debian.org/cgit/debian-ha/crmsh.git/tree/debian/patches/0012-Fix-crm-history.patch
Using multiple log files would be best but I guess line duplicates and line ordering are a problem there?
Commands like
history events
andhistory transitions
do not provide expected info with crmsh 2.3.2 and pacemaker 1.1.15.It seems the
events.txt
files created byhb_report
are empty so perhaps the event patterns need to be updated?Also most of the logging now goes into
pacemaker.log
so maybe this needs to be checked instead ofha-log.txt
?