Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
1.99k stars 573 forks source link

Since updating to 2.12.0-1, Icinga Zone Agents occasionally stall with no useful logging. #8173

Closed steaksauce- closed 3 years ago

steaksauce- commented 4 years ago

Describe the bug

Icinga2 Satellites Hang on Reload, indefinitely -- service appears running and FRESHNESS never alerts until the service is stopped.

Systemd shows this: Aug 11 13:36:14 ica02m02n.redacted systemd[1]: icinga2.service reload operation timed out. Stopping.

To Reproduce

This appears to be a race condition, only affecting Icinga 2 agents (I have not seen it on the master yet). Eventually Icinga 2 will attempt to reload from Icinga Director and then stall. Because the systemd unit file waits 30min for timeout, we see the following in the logs:

Aug 11 13:36:14 ica02m02n.redacted[1]: icinga2.service reload operation timed out. Stopping.
Aug 11 14:06:15 ica02m02n.redacted[1]: icinga2.service reload operation timed out. Stopping.
Aug 11 14:36:15 ica02m02n.redacted[1]: icinga2.service reload operation timed out. Stopping.
Aug 11 15:06:15 ica02m02n.redacted[1]: icinga2.service reload operation timed out. Stopping.
Aug 11 15:36:15 ica02m02n.redacted[1]: icinga2.service reload operation timed out. Stopping.
Aug 11 16:06:16 ica02m02n.redacted[1]: icinga2.service reload operation timed out. Stopping.
Aug 11 16:36:16 ica02m02n.redacted[1]: icinga2.service reload operation timed out. Stopping.
Aug 11 17:06:16 ica02m02n.redacted[1]: icinga2.service reload operation timed out. Stopping.
Aug 11 17:36:16 ica02m02n.redacted[1]: icinga2.service reload operation timed out. Stopping.
Aug 11 18:51:01 ica02m02n.redacted[1]: icinga2.service reload operation timed out. Stopping.

This did not happen prior to updating to version 2.12.0-1 (previous version that we used was 2.11.3-1)

It seems like the logging mechanism is stopping before the actual crash. The logs are virtually empty prior to crash (I currently have debugging log turned on to try to catch it next time around).

Expected behavior

I would expect that Icinga would reload without issue.

Screenshots

If applicable, add screenshots to help explain your problem.

Your Environment

Include as many relevant details about the environment you experienced the problem in

icinga2 - The Icinga 2 network monitoring daemon (version: 2.12.0-1)

Copyright (c) 2012-2020 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

System information:
  Platform: CentOS Linux
  Platform version: 7 (Core)
  Kernel: Linux
  Kernel version: 3.10.0-1127.18.2.el7.x86_64
  Architecture: x86_64

Build information:
  Compiler: GNU 4.8.5
  Build host: runner-hh8q3bz2-project-322-concurrent-0
  OpenSSL version: OpenSSL 1.0.2k-fips  26 Jan 2017

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid
Disabled features: compatlog elasticsearch gelf graphite icingadb influxdb livestatus notification opentsdb perfdata statusdata syslog
Enabled features: api checker command debuglog mainlog

Please ignore all of my warnings about unused rules :)

[2020-08-12 14:22:08 -0500] information/cli: Icinga application loader (version: 2.12.0-1)
[2020-08-12 14:22:08 -0500] information/cli: Loading configuration file(s).
[2020-08-12 14:22:10 -0500] information/ConfigItem: Committing config item(s).
[2020-08-12 14:22:10 -0500] information/ApiListener: My API identity: ica02m02n.nsvltn.ena.net
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule '5 MIN CPU UTIL' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1:0-1:29) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'MULTI-CIRCUIT' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 11:1-11:29) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'PING - VOICE' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 21:1-21:28) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule '5 MIN CPU UTIL' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 31:1-31:30) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule '5 MIN CPU UTIL' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 41:1-41:30) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'MULTI-CIRCUIT' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 51:1-51:29) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule '5 MIN CPU UTIL' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 61:1-61:30) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'NPCD' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 71:1-71:20) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'SYMMETRY' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 80:1-80:24) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule '1 MIN CPU UTIL' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 90:1-90:30) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'BGPD' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 101:1-101:20) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'DNS LOOKUP NSTEST.ENA.NET' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 112:1-112:41) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'PING DNS' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 121:1-121:24) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'HTTP' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 131:1-131:20) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'HTTP Blocked' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 140:1-140:28) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'HTTPD' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 149:1-149:21) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'HTTPS Blocked' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 160:1-160:29) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'IPTABLES' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 169:1-169:24) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'LOAD' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 180:1-180:20) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'DISK' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 195:1-195:20) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'ROOT FS' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 204:1-204:23) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'DEFUNCT' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 219:1-219:23) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'SSH' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 234:1-234:19) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'NRPE' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 243:1-243:20) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'NTP' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 252:1-252:19) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'NAMED' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 267:1-267:21) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'NF CONNTRACK CONNS' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 282:1-282:34) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'NTP STRATUM' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 297:1-297:27) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'NTPD' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 312:1-312:20) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'PUPPET AGENT' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 327:1-327:28) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'RSYSLOG' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 342:1-342:23) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'NTP PEER' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 357:1-357:24) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'MARIADB' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 387:1-387:23) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'MYSQL Stats' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 397:1-397:27) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'MARIADB' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 406:1-406:23) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'MYSQL Stats' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 416:1-416:27) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'Mariadb Replication' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 425:1-425:35) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'PING' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 436:1-436:20) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'PING' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 445:1-445:20) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'PING' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 454:1-454:20) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'BROCADE CHASSIS' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 463:1-463:31) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule '5 MIN CPU UTIL' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 473:1-473:30) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'HTTP AUP' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 525:1-525:24) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'POWER REDUNDANCY' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 828:1-828:32) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'PING - VOICE' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 838:1-838:28) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'PING' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 847:1-847:20) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'CARD HEALTH' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 856:1-856:27) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'CHASSIS HEALTH' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 865:1-865:30) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule '1 MIN CPU UTIL' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 874:1-874:30) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule '1 MIN CPU UTIL' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 883:1-883:30) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'FAILOVER' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 892:1-892:24) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'BROCADE CHASSIS' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 901:1-901:31) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule '5 MIN CPU UTIL' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 910:1-910:30) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule '1 MIN CPU UTIL' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 919:1-919:30) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'PING-Customer Interface 169.139.1.6' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 928:1-928:51) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'PING-Ten4/1 172.23.196.65' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 938:1-938:41) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'PING-Ten4/1 172.23.196.66' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 948:1-948:41) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'PING-Customer Interface 169.139.9.6' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 958:1-958:51) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'CARD HEALTH' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 968:1-968:27) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule '1 MIN CPU UTIL' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 988:1-988:30) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'CARD HEALTH' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 997:1-997:27) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'BROCADE 5 MIN CPU UTIL' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1006:1-1006:38) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule '1 MIN CPU UTIL' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1094:1-1094:30) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'HTTP GA' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1149:1-1149:23) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'FAILOVER' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1170:1-1170:24) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'CHECK KVM GUESTS' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1239:1-1239:32) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'HTTPD' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1502:1-1502:21) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'PUPPET AGENT' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1513:1-1513:28) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'RSYSLOG' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1524:1-1524:23) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'CPU UTIL' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1602:1-1602:24) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule '1 MIN CPU UTIL' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1646:1-1646:30) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule '5 MIN CPU UTIL' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1694:1-1694:30) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'WLC AP STATUS' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1726:1-1726:29) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'WLC AP USERS' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1735:1-1735:28) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'CPU UTIL' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1746:1-1746:24) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'MEMORY' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1762:1-1762:22) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'CPU UTIL' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1813:1-1813:24) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'DISK UTIL' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1822:1-1822:25) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'MEMORY' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1831:1-1831:22) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'CPU UTIL' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1840:1-1840:24) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'MEMORY' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1854:1-1854:22) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'DISK UTIL' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1868:1-1868:25) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'HA STATUS' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1882:1-1882:25) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'CPU UTIL' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1894:1-1894:24) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'MEMORY' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1903:1-1903:22) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'ROOT VDOM CPU UTIL' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1912:1-1912:34) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'ROOT VDOM MEMORY' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1921:1-1921:32) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'CPU UTIL - NODE 1 SLOT 1' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1930:1-1930:40) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'CPU UTIL - NODE 2 SLOT 1' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1940:1-1940:40) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'CPU UTIL - NODE 1 SLOT 2' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1951:1-1951:40) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'CPU UTIL - NODE 2 SLOT 2' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1962:1-1962:40) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'CPU UTIL - NODE 1 SLOT 3' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1973:1-1973:40) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'CPU UTIL - NODE 2 SLOT 3' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1984:1-1984:40) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'CPU UTIL - NODE 1 SLOT 4' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 1995:1-1995:40) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'CPU UTIL - NODE 2 SLOT 4' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 2006:1-2006:40) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'CPU UTIL - NODE 1 SLOT 5' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 2017:1-2017:40) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'CPU UTIL - NODE 2 SLOT 5' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 2028:1-2028:40) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'CPU UTIL - NODE 1 SLOT 6' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 2039:1-2039:40) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'CPU UTIL - NODE 2 SLOT 6' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 2050:1-2050:40) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'MEMORY - NODE 1 SLOT 1' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 2061:1-2061:38) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'MEMORY - NODE 1 SLOT 2' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 2071:1-2071:38) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'MEMORY - NODE 1 SLOT 3' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 2082:1-2082:38) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'MEMORY - NODE 1 SLOT 4' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 2093:1-2093:38) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'MEMORY - NODE 1 SLOT 5' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 2104:1-2104:38) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'MEMORY - NODE 1 SLOT 6' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 2115:1-2115:38) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'MEMORY - NODE 2 SLOT 1' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 2126:1-2126:38) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'MEMORY - NODE 2 SLOT 2' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 2137:1-2137:38) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'MEMORY - NODE 2 SLOT 3' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 2148:1-2148:38) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'MEMORY - NODE 2 SLOT 4' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 2159:1-2159:38) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'MEMORY - NODE 2 SLOT 5' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 2170:1-2170:38) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'MEMORY - NODE 2 SLOT 6' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 2181:1-2181:38) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] warning/ApplyRule: Apply rule 'SESSIONS' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 2192:1-2192:24) for type 'Service' does not match anywhere!
[2020-08-12 14:22:19 -0500] information/ConfigItem: Instantiated 2436 Hosts.
[2020-08-12 14:22:19 -0500] information/ConfigItem: Instantiated 6 Downtimes.
[2020-08-12 14:22:19 -0500] information/ConfigItem: Instantiated 1 NotificationCommand.
[2020-08-12 14:22:19 -0500] information/ConfigItem: Instantiated 2 FileLoggers.
[2020-08-12 14:22:19 -0500] information/ConfigItem: Instantiated 10000 Comments.
[2020-08-12 14:22:19 -0500] information/ConfigItem: Instantiated 1 IcingaApplication.
[2020-08-12 14:22:19 -0500] information/ConfigItem: Instantiated 9 HostGroups.
[2020-08-12 14:22:19 -0500] information/ConfigItem: Instantiated 1 EventCommand.
[2020-08-12 14:22:19 -0500] information/ConfigItem: Instantiated 1 CheckerComponent.
[2020-08-12 14:22:19 -0500] information/ConfigItem: Instantiated 4 Zones.
[2020-08-12 14:22:19 -0500] information/ConfigItem: Instantiated 3 Endpoints.
[2020-08-12 14:22:19 -0500] information/ConfigItem: Instantiated 1 ExternalCommandListener.
[2020-08-12 14:22:19 -0500] information/ConfigItem: Instantiated 1 ApiUser.
[2020-08-12 14:22:19 -0500] information/ConfigItem: Instantiated 1 ApiListener.
[2020-08-12 14:22:19 -0500] information/ConfigItem: Instantiated 298 CheckCommands.
[2020-08-12 14:22:19 -0500] information/ConfigItem: Instantiated 7 TimePeriods.
[2020-08-12 14:22:19 -0500] information/ConfigItem: Instantiated 6329 Services.
[2020-08-12 14:22:19 -0500] information/ConfigItem: Instantiated 7 ServiceGroups.
[2020-08-12 14:22:19 -0500] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2020-08-12 14:22:19 -0500] information/cli: Finished validating the configuration file(s).

Additional context

Maybe related to #8160, but this is a distributed environment that otherwise works fine, except hanging on the occasional config RELOAD.

Issue can be solved with manual intervention: systemctl stop icinga2 systemctl start icinga

More information to come if the debug log produces something useful.

A similar issue happened with 2.10 with users using graphite, but we are not using graphite.

yoshi314 commented 4 years ago

i have the same issue. things break when config reload happens. On affected node all checks are hanging with defunct. C Can you verify this on your end?

steaksauce- commented 4 years ago

@yoshi314 The FRESHNESS check built into Icinga shows that all checks hang (as well as picking a few services that haven't reported back in a while).

This is odd behavior with the FRESHNESS check. I believe the check looks at the DB to see if service checks have been executed in X amount of time.

steaksauce- commented 4 years ago

Naturally my debug log filled up the drive last night before there was a crash.

steaksauce- commented 4 years ago

This happened a few times throughout the weekend, but the log drive filled up from the debug log :(

steaksauce- commented 4 years ago

Last icinga2.log entry before the request to shutdown:

[2020-08-17 15:55:29 -0500] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 18, rate: 161.933/s (9716/min 55956/5min 171832/15min);

Right before that there was a config reload to this zone:

[2020-08-17 15:55:18 -0500] information/ApiListener: Config validation for stage '/var/lib/icinga2/api/zones-stage/' was OK, replacing into '/var/lib/icinga2/api/zones/' and triggering reload.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'Zone-NSVLTN//.checksums' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'Zone-NSVLTN//.timestamp' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'Zone-NSVLTN//director/hosts.conf' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'Zone-NSVLTN//director/services.conf' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'director-global//.checksums' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'director-global//.timestamp' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'director-global//director/001-director-basics.conf' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'director-global//director/command_templates.conf' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'director-global//director/commands.conf' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'director-global//director/host_templates.conf' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'director-global//director/hostgroups.conf' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'director-global//director/service_apply.conf' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'director-global//director/service_templates.conf' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'director-global//director/servicegroups.conf' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'director-global//director/timeperiod_templates.conf' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'director-global//director/timeperiods.conf' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'director-global//director/user_templates.conf' from config sync staging to production zones directory
steaksauce- commented 4 years ago

I think it's worth noting that the zone has 2 agents, and the reload usually only kills one of the 2 agents (not the same agent every time though), and so far it only affects this zone.

steaksauce- commented 4 years ago

Same thing happened on the other agent in the zone this evening:

[2020-08-17 19:37:08 -0500] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 4, rate: 144.967/s (8698/min 54217/5min 172939/15min);
[2020-08-17 19:37:09 -0500] information/ApiListener: Config validation for stage '/var/lib/icinga2/api/zones-stage/' was OK, replacing into '/var/lib/icinga2/api/zones/' and triggering reload.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'Zone-NSVLTN//.checksums' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'Zone-NSVLTN//.timestamp' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'Zone-NSVLTN//director/hosts.conf' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'Zone-NSVLTN//director/services.conf' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'director-global//.checksums' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'director-global//.timestamp' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'director-global//director/001-director-basics.conf' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'director-global//director/command_templates.conf' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'director-global//director/commands.conf' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'director-global//director/host_templates.conf' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'director-global//director/hostgroups.conf' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'director-global//director/service_apply.conf' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'director-global//director/service_templates.conf' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'director-global//director/servicegroups.conf' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'director-global//director/timeperiod_templates.conf' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'director-global//director/timeperiods.conf' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'director-global//director/user_templates.conf' from config sync staging to production zones directory.
[2020-08-17 19:37:23 -0500] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 2, rate: 159/s (9540/min 54348/5min 172795/15min);
yoshi314 commented 4 years ago

i have temporarily worked my issue around by having master node where i deploy my config on 2.12 and ones that receive config via api stay on 2.11.x

steaksauce- commented 4 years ago

@yoshi314 are you using the director to deploy configs?

steaksauce- commented 4 years ago

Finally caught this before systemd timed out. Not much change in the information, but the last thing is that the api is stopped when reloading, then it just hangs.

# tail -f /var/log/icinga2/icinga2.log
[2020-08-18 11:10:22 -0500] information/ApiListener: Copying file 'director-global//director/timeperiod_templates.conf' from config sync staging to production zones directory.
[2020-08-18 11:10:22 -0500] information/ApiListener: Copying file 'director-global//director/timeperiods.conf' from config sync staging to production zones directory.
[2020-08-18 11:10:22 -0500] information/ApiListener: Copying file 'director-global//director/user_templates.conf' from config sync staging to production zones directory.
[2020-08-18 11:10:25 -0500] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 2, rate: 169.75/s (10185/min 56482/5min 170796/15min);
[2020-08-18 11:10:40 -0500] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 62, rate: 168.05/s (10083/min 56997/5min 170351/15min);
[2020-08-18 11:10:40 -0500] information/Application: Received request to shut down.
[2020-08-18 11:10:41 -0500] information/Application: Shutting down...
[2020-08-18 11:10:41 -0500] information/CheckerComponent: 'checker' stopped.
[2020-08-18 11:10:41 -0500] information/ExternalCommandListener: 'command' stopped.
[2020-08-18 11:10:41 -0500] information/ApiListener: 'api' stopped.

Service specifically hangs in the reloading state until the 30 minutes is up and systemd notifies.

yoshi314 commented 4 years ago

@steaksauce- partially. but mostly i use manual file configs.

the primary master node receives config from director, and that doesn't seem to bother it. the secondary node in the zone that receives everything from the primary master is what's behaving faulty.

steaksauce- commented 4 years ago

Interesting. We are exclusively director (25k+ services, so why not) and run into the issue. Ours is either agent in the zone, but never at the same time (unless one goes unresolved long enough).

Just wondering if it was director related or not. But it's not on every config refresh, and it's only this one zone, as opposed to any of the other 3.

Al2Klimov commented 4 years ago

Please could any of you

  1. catch the stall again
  2. create core dumps [1]
  3. ask for an upload link?

[1]

[root@aklimov8173 ~]# pushd `mktemp -d`
/tmp/tmp.hFWJ0sQLDD ~
[root@aklimov8173 tmp.hFWJ0sQLDD]# for pid in $(pidof icinga2); do gdb -p $pid -batch -ex "generate-core-file" -ex "detach" -ex "q"; done
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fe2c18c5c4d in recvmsg () from /lib64/libpthread.so.0
warning: target file /proc/10006/cmdline contained unexpected null characters
Saved corefile core.10006
[Inferior 1 (process 10006) detached]
[New LWP 10010]
[New LWP 10009]
[New LWP 10008]
[New LWP 10007]
[New LWP 10005]
[New LWP 10004]
[New LWP 9996]
[New LWP 9995]
[New LWP 9994]
[New LWP 9993]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fe2c15af56d in nanosleep () from /lib64/libc.so.6
warning: target file /proc/9992/cmdline contained unexpected null characters
Saved corefile core.9992
[Inferior 1 (process 9992) detached]
[New LWP 10003]
[New LWP 10002]
[New LWP 10001]
[New LWP 10000]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fe2c15af56d in nanosleep () from /lib64/libc.so.6
warning: target file /proc/9971/cmdline contained unexpected null characters
Saved corefile core.9971
[Inferior 1 (process 9971) detached]
[root@aklimov8173 tmp.hFWJ0sQLDD]# ls
core.10006  core.9971  core.9992
[root@aklimov8173 tmp.hFWJ0sQLDD]#
yoshi314 commented 4 years ago

i'll try that on Monday

steaksauce- commented 4 years ago

I'm in dependency hell for gdb on CentOS 7.

Missing separate debuginfos, use: debuginfo-install sssd-client-1.16.4-37.el7_8.4.x86_64 systemd-libs-219-73.el7_8.9.x86_64
(gdb) quit
A debugging session is active.

        Inferior 1 [process 32735] will be detached.

Quit anyway? (y or n) y
Detaching from program: /usr/lib64/icinga2/sbin/icinga2, process 32735
[Inferior 1 (process 32735) detached]
[root@ica02m02n.nsvltn ~]# debuginfo-install sssd-client-1.16.4-37.el7_8.4.x86_64 systemd-libs-219-73.el7_8.9.x86_64
Loaded plugins: auto-update-debuginfo, fastestmirror, langpacks, versionlock
enabling epel-debuginfo
enabling base-debuginfo
enabling ius-debuginfo
Loading mirror speeds from cached hostfile
 * epel-debuginfo: mirror.nodesdirect.com
Could not find debuginfo for main pkg: sssd-client-1.16.4-37.el7_8.4.x86_64
Package glibc-debuginfo-2.17-307.el7.1.x86_64 already installed and latest version
Package e2fsprogs-debuginfo-1.42.9-17.el7.x86_64 already installed and latest version
Package krb5-debuginfo-1.15.1-46.el7.x86_64 already installed and latest version
Package pam-debuginfo-1.1.8-23.el7.x86_64 already installed and latest version
Could not find debuginfo pkg for dependency package libsss_idmap-1.16.4-37.el7_8.4.x86_64
Could not find debuginfo pkg for dependency package libsss_nss_idmap-1.16.4-37.el7_8.4.x86_64
Could not find debuginfo for main pkg: systemd-libs-219-73.el7_8.9.x86_64
Package libcap-debuginfo-2.22-11.el7.x86_64 already installed and latest version
Package elfutils-debuginfo-0.176-4.el7.x86_64 already installed and latest version
Package gcc-debuginfo-4.8.5-39.el7.x86_64 already installed and latest version
Package libgcrypt-debuginfo-1.5.3-14.el7.x86_64 already installed and latest version
Package libgpg-error-debuginfo-1.12-3.el7.x86_64 already installed and latest version
Package lz4-debuginfo-1.7.5-3.el7.x86_64 already installed and latest version
Package xz-debuginfo-5.2.2-1.el7.x86_64 already installed and latest version
Package libselinux-debuginfo-2.5-15.el7.x86_64 already installed and latest version
No debuginfo packages available to install

Let me see if I can resolve this and I'll try to generate everything after it crashes again

Al2Klimov commented 4 years ago

(gdb) quit

You don't need the gdb "shell". Just use the for loop as I've shown.

[root@aklimov8173 tmp.hFWJ0sQLDD]# for pid in $(pidof icinga2); do gdb -p $pid -batch -ex "generate-core-file" -ex "detach" -ex "q"; done

steaksauce- commented 4 years ago

@Al2Klimov just doing some testing -- I fixed the deps on one box, and I'll go for the other. Out of curiosity do these steps need to be completed BEFORE or AFTER the crash?

Al2Klimov commented 4 years ago
  1. Make sure the command works and generates core files
  2. Wait for an agent to stall
  3. Run the command before systemd timeouts
steaksauce- commented 4 years ago

Hard part will be catching it before it times out. I'll watch the logs and wait for slow/non-existent activity and try to catch it in the act. I'll keep an eye on the director too to look out for zone updates that would trigger a reload.

Al2Klimov commented 4 years ago

Hard part will be catching it before it times out.

Actually I don't care for systemd itself, but it'll likely kill Icinga and that's the problem.

steaksauce- commented 4 years ago

I'm still watching for it. We have a alot of new monitoring changes in August every year, and things are starting to die down on changes causing reloads.

Oddly enough, systemd never kills it. It will stay in reloading and time out every 30 minutes until someone (or monit) goes in and stops it.

steaksauce- commented 4 years ago

I just caught it in action before systemd caught it. Grabbing the dumps now, but looks like we ran out of disk space before the last dump finished. I will get them uploaded before the weekend -- for some reason we have a go-live on a Friday morning (whatever happened to read-only Fridays).

Al2Klimov commented 4 years ago

I just caught it in action before systemd caught it.

🚀

https://nextcloud.icinga.com/index.php/s/swXMRA2doJcRnDF

steaksauce- commented 4 years ago

Working on uploading now. It's a couple of GB in size, not sure if it will go through or hit a restriction.

Al2Klimov commented 4 years ago

Oh, and please gzip the core files!

steaksauce- commented 4 years ago

whoops! Uploaded without that -- I can gzip and reupload if you think it's worth the effort

steaksauce- commented 4 years ago

On the bright side, you now should have a compressed and uncompressed version.

Keep in mind that one of the dumps did not complete before the disk filled up (I forget which one).

Al2Klimov commented 4 years ago

28233 hangs here:

https://github.com/Icinga/icinga2/blob/338d0aaa8ca17b94cc84048d4811c0c478f363a4/lib/cli/daemoncommand.cpp#L769

(gdb) thread apply all bt

Thread 9 (Thread 0x7f5aa6ac08c0 (LWP 28233)):
#0  0x00007f5aa3d6f1d9 in __libc_wait (stat_loc=0x3031) at ../sysdeps/unix/sysv/linux/wait.c:35
#1  0x0000000000af6446 in icinga::DaemonCommand::Run(boost::program_options::variables_map const&, std::vector<std::string, std::allocator<std::string> > const&) const (this=<optimized out>, vm=..., ap=...) at ../cli/daemoncommand.cpp:769

... while waiting for 12337...

(gdb) thread 9
[Switching to thread 9 (Thread 0x7f5aa6ac08c0 (LWP 28233))]
#0  0x00007f5aa3d6f1d9 in __libc_wait (stat_loc=0x3031) at ../sysdeps/unix/sysv/linux/wait.c:35
35    pid_t result = INLINE_SYSCALL (wait4, 4, WAIT_ANY, stat_loc, 0,
(gdb) frame 1
#1  0x0000000000af6446 in icinga::DaemonCommand::Run(boost::program_options::variables_map const&, std::vector<std::string, std::allocator<std::string> > const&) const (this=<optimized out>, vm=..., ap=...) at ../cli/daemoncommand.cpp:769
769 ../cli/daemoncommand.cpp: Datei oder Verzeichnis nicht gefunden.
(gdb) p currentWorker
$1 = 12337

... which... seems not to have finished:

"/home/centos/8173/core.12337" is not a core dump: File format not recognized

@steaksauce- Please could you do any of the following to provide non-corrupted cores?

steaksauce- commented 4 years ago

We are having some planned outages tonight to bump up the CPU on the master to help process service checks. I'll see if we can bring these two agents down tonight as well to get some dedicated log space.

I'm on vacation starting Wednesday afternoon (Central US time) and will be returning after Labor Day.

Al2Klimov commented 4 years ago

I think I've got it:

Active: reloading (reload) since Do 2020-09-03 09:19:44 UTC; 1h 32min ago
Al2Klimov commented 4 years ago

@steaksauce- Please could you try #8211?

https://nextcloud.icinga.com/index.php/s/p3qJbcs6mqE8HGG

steaksauce- commented 4 years ago

Working on deploying to one of the afflicted agents now. We also got the expanded logging space setup.

steaksauce- commented 4 years ago

day 1 - unpatched agent hung on reload. No issues out of the other agent yet. We want to let it run the rest of the week since the problem only happened some of the time

steaksauce- commented 4 years ago

@Al2Klimov -- worth noting debug log is only enabled to troubleshoot this issue. We do not normally use debug log, and this issue occurs on afflicted nodes whether or not debug log is enabled. I say this because I looked at #8211 and it seems like it's logic built around debugging being enabled.

  • turn off debug logging
  • rotate the logs more often (/etc/logrotate.d/icinga2)
  • make /var/log/icinga2/debug.log a pipe (mkfifo) and do e.g. while true; do cat /var/log/icinga2/debug.log; done
steaksauce- commented 4 years ago

6 days in an no issues on the upgraded node.

Should I be using the version from nextcloud, or will this be introduced in the repos soon?

Al2Klimov commented 4 years ago

You should keep an eye on #8211.

N-o-X commented 3 years ago

Has been fixed with #8211