Closed steaksauce- closed 3 years ago
i have the same issue. things break when config reload happens. On affected node all checks are hanging with defunct. C Can you verify this on your end?
@yoshi314 The FRESHNESS check built into Icinga shows that all checks hang (as well as picking a few services that haven't reported back in a while).
This is odd behavior with the FRESHNESS check. I believe the check looks at the DB to see if service checks have been executed in X amount of time.
Naturally my debug log filled up the drive last night before there was a crash.
This happened a few times throughout the weekend, but the log drive filled up from the debug log :(
Last icinga2.log
entry before the request to shutdown:
[2020-08-17 15:55:29 -0500] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 18, rate: 161.933/s (9716/min 55956/5min 171832/15min);
Right before that there was a config reload to this zone:
[2020-08-17 15:55:18 -0500] information/ApiListener: Config validation for stage '/var/lib/icinga2/api/zones-stage/' was OK, replacing into '/var/lib/icinga2/api/zones/' and triggering reload.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'Zone-NSVLTN//.checksums' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'Zone-NSVLTN//.timestamp' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'Zone-NSVLTN//director/hosts.conf' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'Zone-NSVLTN//director/services.conf' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'director-global//.checksums' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'director-global//.timestamp' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'director-global//director/001-director-basics.conf' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'director-global//director/command_templates.conf' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'director-global//director/commands.conf' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'director-global//director/host_templates.conf' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'director-global//director/hostgroups.conf' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'director-global//director/service_apply.conf' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'director-global//director/service_templates.conf' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'director-global//director/servicegroups.conf' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'director-global//director/timeperiod_templates.conf' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'director-global//director/timeperiods.conf' from config sync staging to production zones directory.
[2020-08-17 15:55:18 -0500] information/ApiListener: Copying file 'director-global//director/user_templates.conf' from config sync staging to production zones directory
I think it's worth noting that the zone has 2 agents, and the reload usually only kills one of the 2 agents (not the same agent every time though), and so far it only affects this zone.
Same thing happened on the other agent in the zone this evening:
[2020-08-17 19:37:08 -0500] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 4, rate: 144.967/s (8698/min 54217/5min 172939/15min);
[2020-08-17 19:37:09 -0500] information/ApiListener: Config validation for stage '/var/lib/icinga2/api/zones-stage/' was OK, replacing into '/var/lib/icinga2/api/zones/' and triggering reload.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'Zone-NSVLTN//.checksums' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'Zone-NSVLTN//.timestamp' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'Zone-NSVLTN//director/hosts.conf' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'Zone-NSVLTN//director/services.conf' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'director-global//.checksums' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'director-global//.timestamp' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'director-global//director/001-director-basics.conf' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'director-global//director/command_templates.conf' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'director-global//director/commands.conf' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'director-global//director/host_templates.conf' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'director-global//director/hostgroups.conf' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'director-global//director/service_apply.conf' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'director-global//director/service_templates.conf' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'director-global//director/servicegroups.conf' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'director-global//director/timeperiod_templates.conf' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'director-global//director/timeperiods.conf' from config sync staging to production zones directory.
[2020-08-17 19:37:09 -0500] information/ApiListener: Copying file 'director-global//director/user_templates.conf' from config sync staging to production zones directory.
[2020-08-17 19:37:23 -0500] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 2, rate: 159/s (9540/min 54348/5min 172795/15min);
i have temporarily worked my issue around by having master node where i deploy my config on 2.12 and ones that receive config via api stay on 2.11.x
@yoshi314 are you using the director to deploy configs?
Finally caught this before systemd timed out. Not much change in the information, but the last thing is that the api is stopped when reloading, then it just hangs.
# tail -f /var/log/icinga2/icinga2.log
[2020-08-18 11:10:22 -0500] information/ApiListener: Copying file 'director-global//director/timeperiod_templates.conf' from config sync staging to production zones directory.
[2020-08-18 11:10:22 -0500] information/ApiListener: Copying file 'director-global//director/timeperiods.conf' from config sync staging to production zones directory.
[2020-08-18 11:10:22 -0500] information/ApiListener: Copying file 'director-global//director/user_templates.conf' from config sync staging to production zones directory.
[2020-08-18 11:10:25 -0500] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 2, rate: 169.75/s (10185/min 56482/5min 170796/15min);
[2020-08-18 11:10:40 -0500] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 62, rate: 168.05/s (10083/min 56997/5min 170351/15min);
[2020-08-18 11:10:40 -0500] information/Application: Received request to shut down.
[2020-08-18 11:10:41 -0500] information/Application: Shutting down...
[2020-08-18 11:10:41 -0500] information/CheckerComponent: 'checker' stopped.
[2020-08-18 11:10:41 -0500] information/ExternalCommandListener: 'command' stopped.
[2020-08-18 11:10:41 -0500] information/ApiListener: 'api' stopped.
Service specifically hangs in the reloading
state until the 30 minutes is up and systemd notifies.
@steaksauce- partially. but mostly i use manual file configs.
the primary master node receives config from director, and that doesn't seem to bother it. the secondary node in the zone that receives everything from the primary master is what's behaving faulty.
Interesting. We are exclusively director (25k+ services, so why not) and run into the issue. Ours is either agent in the zone, but never at the same time (unless one goes unresolved long enough).
Just wondering if it was director related or not. But it's not on every config refresh, and it's only this one zone, as opposed to any of the other 3.
Please could any of you
[1]
[root@aklimov8173 ~]# pushd `mktemp -d`
/tmp/tmp.hFWJ0sQLDD ~
[root@aklimov8173 tmp.hFWJ0sQLDD]# for pid in $(pidof icinga2); do gdb -p $pid -batch -ex "generate-core-file" -ex "detach" -ex "q"; done
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fe2c18c5c4d in recvmsg () from /lib64/libpthread.so.0
warning: target file /proc/10006/cmdline contained unexpected null characters
Saved corefile core.10006
[Inferior 1 (process 10006) detached]
[New LWP 10010]
[New LWP 10009]
[New LWP 10008]
[New LWP 10007]
[New LWP 10005]
[New LWP 10004]
[New LWP 9996]
[New LWP 9995]
[New LWP 9994]
[New LWP 9993]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fe2c15af56d in nanosleep () from /lib64/libc.so.6
warning: target file /proc/9992/cmdline contained unexpected null characters
Saved corefile core.9992
[Inferior 1 (process 9992) detached]
[New LWP 10003]
[New LWP 10002]
[New LWP 10001]
[New LWP 10000]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fe2c15af56d in nanosleep () from /lib64/libc.so.6
warning: target file /proc/9971/cmdline contained unexpected null characters
Saved corefile core.9971
[Inferior 1 (process 9971) detached]
[root@aklimov8173 tmp.hFWJ0sQLDD]# ls
core.10006 core.9971 core.9992
[root@aklimov8173 tmp.hFWJ0sQLDD]#
i'll try that on Monday
I'm in dependency hell for gdb
on CentOS 7.
Missing separate debuginfos, use: debuginfo-install sssd-client-1.16.4-37.el7_8.4.x86_64 systemd-libs-219-73.el7_8.9.x86_64
(gdb) quit
A debugging session is active.
Inferior 1 [process 32735] will be detached.
Quit anyway? (y or n) y
Detaching from program: /usr/lib64/icinga2/sbin/icinga2, process 32735
[Inferior 1 (process 32735) detached]
[root@ica02m02n.nsvltn ~]# debuginfo-install sssd-client-1.16.4-37.el7_8.4.x86_64 systemd-libs-219-73.el7_8.9.x86_64
Loaded plugins: auto-update-debuginfo, fastestmirror, langpacks, versionlock
enabling epel-debuginfo
enabling base-debuginfo
enabling ius-debuginfo
Loading mirror speeds from cached hostfile
* epel-debuginfo: mirror.nodesdirect.com
Could not find debuginfo for main pkg: sssd-client-1.16.4-37.el7_8.4.x86_64
Package glibc-debuginfo-2.17-307.el7.1.x86_64 already installed and latest version
Package e2fsprogs-debuginfo-1.42.9-17.el7.x86_64 already installed and latest version
Package krb5-debuginfo-1.15.1-46.el7.x86_64 already installed and latest version
Package pam-debuginfo-1.1.8-23.el7.x86_64 already installed and latest version
Could not find debuginfo pkg for dependency package libsss_idmap-1.16.4-37.el7_8.4.x86_64
Could not find debuginfo pkg for dependency package libsss_nss_idmap-1.16.4-37.el7_8.4.x86_64
Could not find debuginfo for main pkg: systemd-libs-219-73.el7_8.9.x86_64
Package libcap-debuginfo-2.22-11.el7.x86_64 already installed and latest version
Package elfutils-debuginfo-0.176-4.el7.x86_64 already installed and latest version
Package gcc-debuginfo-4.8.5-39.el7.x86_64 already installed and latest version
Package libgcrypt-debuginfo-1.5.3-14.el7.x86_64 already installed and latest version
Package libgpg-error-debuginfo-1.12-3.el7.x86_64 already installed and latest version
Package lz4-debuginfo-1.7.5-3.el7.x86_64 already installed and latest version
Package xz-debuginfo-5.2.2-1.el7.x86_64 already installed and latest version
Package libselinux-debuginfo-2.5-15.el7.x86_64 already installed and latest version
No debuginfo packages available to install
Let me see if I can resolve this and I'll try to generate everything after it crashes again
(gdb) quit
You don't need the gdb "shell". Just use the for loop as I've shown.
[root@aklimov8173 tmp.hFWJ0sQLDD]# for pid in $(pidof icinga2); do gdb -p $pid -batch -ex "generate-core-file" -ex "detach" -ex "q"; done
@Al2Klimov just doing some testing -- I fixed the deps on one box, and I'll go for the other. Out of curiosity do these steps need to be completed BEFORE or AFTER the crash?
Hard part will be catching it before it times out. I'll watch the logs and wait for slow/non-existent activity and try to catch it in the act. I'll keep an eye on the director too to look out for zone updates that would trigger a reload.
Hard part will be catching it before it times out.
Actually I don't care for systemd itself, but it'll likely kill Icinga and that's the problem.
I'm still watching for it. We have a alot of new monitoring changes in August every year, and things are starting to die down on changes causing reloads.
Oddly enough, systemd never kills it. It will stay in reloading and time out every 30 minutes until someone (or monit
) goes in and stops it.
I just caught it in action before systemd caught it. Grabbing the dumps now, but looks like we ran out of disk space before the last dump finished. I will get them uploaded before the weekend -- for some reason we have a go-live on a Friday morning (whatever happened to read-only Fridays).
I just caught it in action before systemd caught it.
Working on uploading now. It's a couple of GB in size, not sure if it will go through or hit a restriction.
Oh, and please gzip the core files!
whoops! Uploaded without that -- I can gzip and reupload if you think it's worth the effort
On the bright side, you now should have a compressed and uncompressed version.
Keep in mind that one of the dumps did not complete before the disk filled up (I forget which one).
28233 hangs here:
(gdb) thread apply all bt
Thread 9 (Thread 0x7f5aa6ac08c0 (LWP 28233)):
#0 0x00007f5aa3d6f1d9 in __libc_wait (stat_loc=0x3031) at ../sysdeps/unix/sysv/linux/wait.c:35
#1 0x0000000000af6446 in icinga::DaemonCommand::Run(boost::program_options::variables_map const&, std::vector<std::string, std::allocator<std::string> > const&) const (this=<optimized out>, vm=..., ap=...) at ../cli/daemoncommand.cpp:769
... while waiting for 12337...
(gdb) thread 9
[Switching to thread 9 (Thread 0x7f5aa6ac08c0 (LWP 28233))]
#0 0x00007f5aa3d6f1d9 in __libc_wait (stat_loc=0x3031) at ../sysdeps/unix/sysv/linux/wait.c:35
35 pid_t result = INLINE_SYSCALL (wait4, 4, WAIT_ANY, stat_loc, 0,
(gdb) frame 1
#1 0x0000000000af6446 in icinga::DaemonCommand::Run(boost::program_options::variables_map const&, std::vector<std::string, std::allocator<std::string> > const&) const (this=<optimized out>, vm=..., ap=...) at ../cli/daemoncommand.cpp:769
769 ../cli/daemoncommand.cpp: Datei oder Verzeichnis nicht gefunden.
(gdb) p currentWorker
$1 = 12337
... which... seems not to have finished:
"/home/centos/8173/core.12337" is not a core dump: File format not recognized
@steaksauce- Please could you do any of the following to provide non-corrupted cores?
/etc/icinga2/features-available/debuglog.conf
)/etc/logrotate.d/icinga2
)/var/log/icinga2/debug.log
a pipe (mkfifo) and do e.g. while true; do cat /var/log/icinga2/debug.log; done
We are having some planned outages tonight to bump up the CPU on the master to help process service checks. I'll see if we can bring these two agents down tonight as well to get some dedicated log space.
I'm on vacation starting Wednesday afternoon (Central US time) and will be returning after Labor Day.
I think I've got it:
Active: reloading (reload) since Do 2020-09-03 09:19:44 UTC; 1h 32min ago
@steaksauce- Please could you try #8211?
Working on deploying to one of the afflicted agents now. We also got the expanded logging space setup.
day 1 - unpatched agent hung on reload. No issues out of the other agent yet. We want to let it run the rest of the week since the problem only happened some of the time
@Al2Klimov -- worth noting debug log is only enabled to troubleshoot this issue. We do not normally use debug log, and this issue occurs on afflicted nodes whether or not debug log is enabled. I say this because I looked at #8211 and it seems like it's logic built around debugging being enabled.
- turn off debug logging
- rotate the logs more often (
/etc/logrotate.d/icinga2
)- make
/var/log/icinga2/debug.log
a pipe (mkfifo) and do e.g.while true; do cat /var/log/icinga2/debug.log; done
6 days in an no issues on the upgraded node.
Should I be using the version from nextcloud, or will this be introduced in the repos soon?
You should keep an eye on #8211.
Has been fixed with #8211
Describe the bug
Icinga2 Satellites Hang on Reload, indefinitely -- service appears running and FRESHNESS never alerts until the service is stopped.
Systemd shows this:
Aug 11 13:36:14 ica02m02n.redacted systemd[1]: icinga2.service reload operation timed out. Stopping.
To Reproduce
This appears to be a race condition, only affecting Icinga 2 agents (I have not seen it on the master yet). Eventually Icinga 2 will attempt to reload from Icinga Director and then stall. Because the systemd unit file waits 30min for timeout, we see the following in the logs:
This did not happen prior to updating to version
2.12.0-1
(previous version that we used was2.11.3-1
)It seems like the logging mechanism is stopping before the actual crash. The logs are virtually empty prior to crash (I currently have debugging log turned on to try to catch it next time around).
Expected behavior
I would expect that Icinga would reload without issue.
Screenshots
If applicable, add screenshots to help explain your problem.
Your Environment
Include as many relevant details about the environment you experienced the problem in
icinga2 feature list
):Icinga Web 2 version and modules (System - About): Icinga Web 2 Version 2.7.3 Git commit 06cabfe8ba28cf545a42c92f25484383191a4e51 PHP Version 7.1.33 Git commit date 2019-10-18 Name | Version setup | 2.7.3 audit | 1.0.1 businessprocess | 2.2.0 cube | 1.1.0 director | 1.7.2 doc | 2.7.3 fileshipper | 1.1.0 grafana | 1.3.6 incubator | 0.5.0 ipl | v0.4.0 monitoring | 2.7.3 pdfexport | 0.9.1 reactbundle | 0.7.0 reporting | 0.9.2 x509 | 1.0.0
Config validation (
icinga2 daemon -C
):Please ignore all of my warnings about unused rules :)
Additional context
Maybe related to #8160, but this is a distributed environment that otherwise works fine, except hanging on the occasional config RELOAD.
Issue can be solved with manual intervention:
systemctl stop icinga2
systemctl start icinga
More information to come if the debug log produces something useful.
A similar issue happened with
2.10
with users using graphite, but we are not using graphite.