Open araujorm opened 10 months ago
Happened again. Couldn't get any info. Any advice? Meanwhile, in case anyone else has the same issue and there's no feedback... we have put the following on the systemd unit override to try to workaround, but it seems clearly a bug since icinga shouldn't (partially) hang.
[Service]
WatchdogSec=90s
Status: "Shutting down old instance..."
This indicates that the process is trying to reload but it hangs for some reason. What does the regular log file say regarding reloads?
Also, if the reload is waiting for something, there's a good chance that one can see something in the stacks of the running threads, so you could do the following to dump that information so that we can have a look at it: https://icinga.com/docs/icinga-2/latest/doc/21-development/#gdb-backtrace-from-running-process
Hello.
Thanks for the reply.
We deactivated the systemd watchdog and after about 1.5 days the issue happened again.
As for the regular icinga2.log, everything seems normal when the director deploy happened. Config was read, endpoints were reconnected, API is responding and so on. If still prefer to check it, please tell us how to safely send the log file because it has a lot of customer names and information that we would prefer not to share in public.
We took the gdb traces before restarting the service, and are attaching them here in a file icinga2-gdb-2023-11-22.zip
Could this be related to https://github.com/Icinga/icinga2/issues/9806 ?
Also, in case it's relevant,
systemctl status icinga2.service
showed:Status: "Shutting down old instance..."
Is this logged in the icinga2.log in some way? I also believe this is not logged via journalctl -u icinga2 -x
.
We can not verify as we did not have this StatusText by now.
We are also kind of experiencing the same error, but we do not trigger anything at the moment, no deployment, no restart, nothing.
We also do run RHEL 7 and icinga2 r2.14.0-1, but also had the errors in the previous r2.13
Also, in case it's relevant,
systemctl status icinga2.service
showed:Status: "Shutting down old instance..."
Is this logged in the icinga2.log in some way? I also believe this is not logged via
journalctl -u icinga2 -x
. We can not verify as we did not have this StatusText by now. We are also kind of experiencing the same error, but we do not trigger anything at the moment, no deployment, no restart, nothing.
It appears in the output of systemctl status icinga2.service
, such as this:
● icinga2.service - Icinga host/service/network monitoring system
Loaded: loaded (/usr/lib/systemd/system/icinga2.service; enabled; vendor preset: disabled)
(...)
Main PID: 32036 (icinga2)
Status: "Shutting down old instance..."
(...)
We also do run RHEL 7 and icinga2 r2.14.0-1, but also had the errors in the previous r2.13
We've always had this too, but lately it started happening at least twice a week.
However, we can confirm adding the systemd watchdog along with a auto-restart works (as workaround), for example doing systemctl edit icinga2
and adding there:
[Service]
Restart=always
RestartSec=10s
WatchdogSec=90s
We took the gdb traces before restarting the service, and are attaching them here in a file icinga2-gdb-2023-11-22.zip
Hi, I've had a look at the backtraces and couldn't find anything serious except that you are using a pretty old (~9-13 years) version of jemalloc
.
Findings:
The umbrella process is starting to spawn a new worker process here that is waiting for the old worker (with pid currentWorker = 32072) to shut down in daemoncommand.cpp:278
:
Thread 1 (Thread 0x7f5e3f1cc940 (LWP 32735)):
#0 0x00007f5e3b90f9fd in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007f5e3b9402d4 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:32
#2 0x0000000000bf0179 in RunWorker (stderrFile=..., closeConsoleLog=false, configs=std::vector of length 1, capacity 1 = {...}) at /usr/src/debug/icinga2-2.14.0/lib/cli/daemoncommand.cpp:278
#3 StartUnixWorker(std::vector<std::string, std::allocator<std::string> > const&, bool, icinga::String const&) () at /usr/src/debug/icinga2-2.14.0/lib/cli/daemoncommand.cpp:545
#4 0x0000000000bf0f92 in icinga::DaemonCommand::Run(boost::program_options::variables_map const&, std::vector<std::string, std::allocator<std::string> > const&) const () at /opt/rh/devtoolset-11/root/usr/include/c++/11/ext/new_allocator.h:89
nextWorker = <optimized out>
configs = std::vector of length 1, capacity 1 = {"/etc/icinga2/icinga2.conf"}
...
currentWorker = 32072
requestedTermination = false
notifiedTermination = <optimized out>
#5 0x0000000000950928 in Main() () at /usr/include/boost169/boost/smart_ptr/intrusive_ptr.hpp:197
#6 0x000000000093e729 in main () at /usr/src/debug/icinga2-2.14.0/icinga-app/icinga.cpp:946
#7 0x00007f5e3b86c555 in __libc_start_main (main=0x93e620 <main>, argc=6, argv=0x7ffe4b5d5338, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe4b5d5328) at ../csu/libc-start.c:266
#8 0x000000000094e9cb in _start () at /opt/rh/devtoolset-11/root/usr/include/c++/11/bits/alloc_traits.h:851
[Inferior 1 (process 32735) detached]
The current worker also starts to shut down as requested and waits for the IdoMysqlConnection
work queue threads to finish normally.
Thread 1 (Thread 0x7f5e3f1cc940 (LWP 32072)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007f5e3c1e1aec in __gthread_cond_wait (__mutex=<optimized out>, __cond=__cond@entry=0x7f5e13819368) at /usr/src/debug/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/x86_64-redhat-linux/libstdc++-v3/include/x86_64-redhat-linux/bits/gthr-default.h:864
#2 std::condition_variable::wait (this=this@entry=0x7f5e13819368, __lock=...) at ../../../../../libstdc++-v3/src/c++11/condition_variable.cc:52
#3 0x00000000009b4431 in icinga::WorkQueue::Join(bool) () at /usr/src/debug/icinga2-2.14.0/lib/base/workqueue.cpp:113
lock = {_M_device = 0x7f5e138192e0, _M_owns = true}
#4 0x0000000000e16096 in icinga::DbConnection::Pause() () at /usr/src/debug/icinga2-2.14.0/lib/db_ido/dbconnection.cpp:132
#5 0x0000000000e5eae7 in icinga::IdoMysqlConnection::Pause() () at /usr/src/debug/icinga2-2.14.0/lib/db_ido_mysql/idomysqlconnection.cpp:109
#6 0x00000000009762d5 in icinga::ConfigObject::SetAuthority(bool) () at /usr/src/debug/icinga2-2.14.0/lib/base/configobject-ti.cpp:944
#7 0x0000000000987626 in icinga::ConfigObject::Deactivate(bool, icinga::Value const&) () at /usr/src/debug/icinga2-2.14.0/lib/base/configobject.cpp:400
#8 0x00000000009b30a0 in icinga::ConfigObject::StopObjects() () at /usr/include/boost169/boost/smart_ptr/intrusive_ptr.hpp:197
#9 0x00000000009d3c4e in icinga::Application::RunEventLoop() () at /usr/src/debug/icinga2-2.14.0/lib/base/application.cpp:356
#10 0x0000000000cde938 in icinga::IcingaApplication::Main() () at /usr/src/debug/icinga2-2.14.0/lib/icinga/icingaapplication.cpp:112
But one of its worker threads gets stuck somewhere in jemalloc
while it is freeing the task from the queue and blocks all other threads trying to acquire WorkQueue::m_Mutex
.
Thread 6 (Thread 0x7f5e14072700 (LWP 32418)):
#0 0x00007f5e3eda24a0 in je_tcache_get (create=false) at src/jemalloc.c:1309
#1 je_arena_dalloc (try_tcache=true, ptr=0x7f5ddbe7e240, chunk=0x7f5ddbc00000, arena=0x7f5e3698eec0) at include/jemalloc/internal/arena.h:1040
pageind = 638
mapbits = 81
tcache = <optimized out>
#2 je_idalloct (try_tcache=true, ptr=0x7f5ddbe7e240) at include/jemalloc/internal/jemalloc_internal.h:898
chunk = 0x7f5ddbc00000
#3 je_iqalloct (try_tcache=true, ptr=0x7f5ddbe7e240) at include/jemalloc/internal/jemalloc_internal.h:917
#4 je_iqalloc (ptr=0x7f5ddbe7e240) at include/jemalloc/internal/jemalloc_internal.h:924
#5 ifree (ptr=0x7f5ddbe7e240) at src/jemalloc.c:1233
#6 free (ptr=0x7f5ddbe7e240) at src/jemalloc.c:1308
#7 0x0000000000e60ae8 in _M_destroy (__victim=...) at /usr/src/debug/icinga2-2.14.0/lib/db_ido_mysql/idomysqlconnection.cpp:1058
#8 _M_manager (__op=<optimized out>, __source=..., __dest=...) at /opt/rh/devtoolset-11/root/usr/include/c++/11/bits/std_function.h:203
#9 std::_Function_handler<void (), icinga::IdoMysqlConnection::InternalExecuteQuery(icinga::DbQuery const&, int)::{lambda()#1}>::_M_manager(std::_Any_data&, std::_Function_handler<void (), icinga::IdoMysqlConnection::InternalExecuteQuery(icinga::DbQuery const&, int)::{lambda()#1}> const&, std::_Manager_operation) () at /opt/rh/devtoolset-11/root/usr/include/c++/11/bits/std_function.h:282
#10 0x00000000009a4f26 in ~_Function_base (this=0x7f5d9b806340, __in_chrg=<optimized out>) at /opt/rh/devtoolset-11/root/usr/include/c++/11/bits/std_function.h:244
#11 ~function (this=0x7f5d9b806340, __in_chrg=<optimized out>) at /opt/rh/devtoolset-11/root/usr/include/c++/11/bits/std_function.h:334
#12 ~Task (this=0x7f5d9b806340, __in_chrg=<optimized out>) at /usr/src/debug/icinga2-2.14.0/lib/base/workqueue.hpp:31
#13 destroy<icinga::Task> (__p=0x7f5d9b806340, this=<optimized out>) at /opt/rh/devtoolset-11/root/usr/include/c++/11/ext/new_allocator.h:168
#14 destroy<icinga::Task> (__p=0x7f5d9b806340, __a=...) at /opt/rh/devtoolset-11/root/usr/include/c++/11/bits/alloc_traits.h:535
#15 pop_back (this=<optimized out>) at /opt/rh/devtoolset-11/root/usr/include/c++/11/bits/stl_deque.h:1561
#16 pop (this=<optimized out>) at /opt/rh/devtoolset-11/root/usr/include/c++/11/bits/stl_queue.h:679
#17 icinga::WorkQueue::WorkerThreadProc() () at /usr/src/debug/icinga2-2.14.0/lib/base/workqueue.cpp:273
lock = {_M_device = 0x7f5e138192e0, _M_owns = true}
#18 0x00007f5e3df33954 in boost::(anonymous namespace)::thread_proxy (param=<optimized out>) at libs/thread/src/pthread/thread.cpp:177
#19 0x00007f5e3dd0fea5 in start_thread (arg=0x7f5e14072700) at pthread_create.c:307
#20 0x00007f5e3b948b0d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
I was only able to find include/jemalloc/internal/arena.h
in jemalloc in version 3.6.0 released in 2015. So I can't really say what's going wrong there, but you may want to try upgrading your installed jemalloc and see if that helps.
I was only able to find
include/jemalloc/internal/arena.h
in jemalloc in version 3.6.0 released in 2015. So I can't really say what's going wrong there, but you may want to try upgrading your installed jemalloc and see if that helps.
Many thanks for your time @yhabteab.
We had in fact enabled running icinga2 with libjemalloc some time before the issue started to arise more prominently, using the instructions in https://github.com/Icinga/icinga2/blob/master/doc/15-troubleshooting.md#try-swapping-out-the-allocator , however it's as you said and the version of that library is quite old on CentOS 7, which could be the issue here. If that's the root cause, then for us it'll be solved in a few months because an upgrade to a newer OS version is already planned (as CentOS 7 is scheduled to die next year).
For now we'll be reverting the change and running icinga2 again using the standard libc memory allocator (we've also cleaned some unneeded objects like old comments which were erroneously created automatically by an old script, which sped up significantly the startup time, so for now this shouldn't be a problem).
We'll still keep the systemd watchdog active for the service, and later check in the machine logs if the occurrences of it being triggering have any change and later provide feedback.
Thanks again and happy holidays
Good afternoon.
The watchdog was triggered once since the change was made about 2 weeks ago. Seems to be a bit better but not yet conclusive.
We've now updated to the 2.14.1 today (we were on 2.14.0) and we'll wait some time before checking again (just in case it's related to something fixed in 2.14.1, such as https://github.com/Icinga/icinga2/pull/9947).
Good morning.
Issue still occurs once every 2 weeks (more or less), but as we have placed the systemd watchdog we don't really notice it anymore. We just know that because of syslog entries such as:
Feb 22 16:58:50 *** systemd: icinga2.service watchdog timeout (limit 1min 30s)!
Oddly, in March we have no such entries, however in april:
Apr 2 11:03:09 *** systemd: icinga2.service watchdog timeout (limit 1min 30s)!
Apr 8 18:48:48 *** systemd: icinga2.service watchdog timeout (limit 1min 30s)!
Apr 9 09:32:34 *** systemd: icinga2.service watchdog timeout (limit 1min 30s)!
Apr 10 20:13:52 *** systemd: icinga2.service watchdog timeout (limit 1min 30s)!
Apr 12 10:56:39 *** systemd: icinga2.service watchdog timeout (limit 1min 30s)!
Do note that on Apr 9 we updated to version 2.14.2, but that doesn't seem to correlate.
As the machine is still running on CentOS 7, and should take some weeks until we have it on a more recent system, we may take a while before being able to bring anything new to this matter. Is there any problem to leave this open in the meanwhile?
Thanks in advance.
One thing I forgot to mention, but have noticed: this seems to happen more often (on this icinga master) when we have more satellites that are unavailable for long periods of time. Just this month we have had 4 for longer than usual (due to several reasons that don't matter for this issue) and there could be a correlation there. Would it be of interest to try to collect any additional data on the master instance related to this? If so, please advise and thanks in advance.
Issue still occurs once every 2 weeks (more or less), but as we have placed the systemd watchdog we don't really notice it anymore. We just know that because of syslog entries such as:
I'm interested in your systemd watchdog config as last night one of your masters was OOM killed by the kernel.
Issue still occurs once every 2 weeks (more or less), but as we have placed the systemd watchdog we don't really notice it anymore. We just know that because of syslog entries such as:
I'm interested in your systemd watchdog config as last night one of your masters was OOM killed by the kernel.
Sure. Create a override with systemctl edit icinga2
as root, it should open a text editor with a temporary file where you can add the following:
[Service]
Restart=always
RestartSec=10s
WatchdogSec=90s
Save the file and quit the editor. This should create a file named /etc/systemd/system/icinga2.service.d/override.conf
and if you use systemctl edit
it should apply it immediately (if you created it manually, you would need to issue a systemctl daemon-reload
- that would also work). It should also not be needed to restart the service, but it won't hurt.
Creating a override in this way is always recommended instead of editing any systemd files in /usr/lib/systemd
because files in /usr are overwritten on updates, whilst the override file is instead placed on /etc.
To verify if it's indeed applied, you can then issue systemctl show icinga2
and verify that these options are active. Also when you issue a systemctl status icinga2
command, you should see something like:
Drop-In: /etc/systemd/system/icinga2.service.d
└─override.conf
Some explanation about the options:
Restart=always
will make systemd auto-restart the service everytime (except when you explicitly issue a systemctl stop
to it).RestartSec=10s
is needed because if it happens that when systemd tries to restart the service it fails repeatedly too often then systemd stops retrying, so this will ensure that systemd will wait 10 seconds before (re)trying. This, along with Restart=always
should prevent your master being down in situations such as the OOM killer stopping it.WatchdogSec=90s
is not for the OOM situation but works alongside the above options. It only works with daemons that support it (such as icinga2), where the application must periodically issue a special call (sd_notify
), and if systemd stops receiving such calls then it terminates the application; by itself it wouldn't be useful, but as we also have the Restart options above, should it happen then systemd will restart the service.Edit: typo
Describe the bug
Every few days we are getting this warning in the icingaweb2 interface. Only goes away by restarting icinga2 service.
To Reproduce
Currently we don't know what is triggering this but would like some advice on how to collect data. Leaving debuglog enabled is not an option since it grows 1 GB each 10 minutes (it has a few dozen satellites and indirectly monitors a few hundred hosts and a few thousands of services).
This is a master instance with a icinga director and automations, sometimes deploys are made every 10 minutes and that didn't use to be a problem (at least not frequently) but in the last couple of weeks it has happened every 3 or 4 days.
Expected behavior
Icinga2 should not be be hung.
Screenshots
Not applicable.
Your Environment
Include as many relevant details about the environment you experienced the problem in
icinga2 --version
): r2.14.0-1icinga2 feature list
): api checker command compatlog graphite ido-mysql livestatus mainlog notificationicinga2 daemon -C
): -> This outputs private potentially sensitive information, if you think it is relevant please tell us where how can safely send it.zones.conf
file (oricinga2 object list --type Endpoint
andicinga2 object list --type Zone
) from all affected nodes. -> This has private sensitive information, if you think it is relevant please tell us where how can safely send it.Additional context
icingaweb2 seems to rely on a query to the MySQL database like the following (this is the result the last time the issue occurred):
... which is curious, since the icinga2.log shows no errors and seems to show normal activity, also the API is responding at that same time: we performed a few API queries via HTTP to the port 5665 such as /v1/status and /v1/objects/hosts and they were successfully replied, indicating that icinga2 was not dead, but maybe only the MySQL IDO threads were stopped somehow? (note that we validated that the DB still responded and was not hung)
Also, in case it's relevant,
systemctl status icinga2.service
showed:... until we restarted the service using systemctl.
As stated, leaving the debug.log active is not an option, and enabling it only when the error happens isn't effective because when we restart/reload via systemd to apply a
feature enable debuglog
it goes back to normal.Could you please tell us which additional information we should try to collect when the issue happens again, to help debug this issue? Thanks in advance.