Monitoring backend 'icinga2' is not running

araujorm commented 10 months ago

Describe the bug

Every few days we are getting this warning in the icingaweb2 interface. Only goes away by restarting icinga2 service.

To Reproduce

Currently we don't know what is triggering this but would like some advice on how to collect data. Leaving debuglog enabled is not an option since it grows 1 GB each 10 minutes (it has a few dozen satellites and indirectly monitors a few hundred hosts and a few thousands of services).

This is a master instance with a icinga director and automations, sometimes deploys are made every 10 minutes and that didn't use to be a problem (at least not frequently) but in the last couple of weeks it has happened every 3 or 4 days.

Expected behavior

Icinga2 should not be be hung.

Screenshots

Not applicable.

Your Environment

Include as many relevant details about the environment you experienced the problem in

Version used (icinga2 --version): r2.14.0-1
Operating System and version: CentOS 7
Enabled features (icinga2 feature list): api checker command compatlog graphite ido-mysql livestatus mainlog notification

Icinga Web 2 version and modules (System - About):

MODULE         VERSION   STATE     DESCRIPTION
aws            1.1.0     enabled   Amazon Web Services module for Icinga Web 2
ca             1.0.3     enabled   Icinga CA Manager
cube           1.1.1     enabled   Cube for Icinga Web 2
director       1.10.2    enabled   Director - Config tool for Icinga 2
elasticsearch  1.0.0     enabled   Elasticsearch Event Integration
fileshipper    1.1.0     enabled   Fileshipper for Icinga Director
graphite       1.2.1     enabled   Icinga Graphite module
idoreports     0.10.1    enabled   Reports for IDO
incubator      0.20.0    enabled   Incubator provides bleeding-edge libraries
jira           1.0.1     enabled   JIRA integration
monitoring     2.11.4    enabled   Icinga monitoring module
nagvis         1.1.1     enabled   NagVis integration
pdfexport      0.10.1    enabled   PDF Export via Google Chrome/Chromium
reactbundle    0.9.0     enabled   ReactPHP-based 3rd party libraries
reporting      0.10.0    enabled   Reporting

Config validation (icinga2 daemon -C): -> This outputs private potentially sensitive information, if you think it is relevant please tell us where how can safely send it.
If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes. -> This has private sensitive information, if you think it is relevant please tell us where how can safely send it.

Additional context

icingaweb2 seems to rely on a query to the MySQL database like the following (this is the result the last time the issue occurred):

MariaDB [icinga]> select is_currently_running from icinga_programstatus;
+----------------------+
| is_currently_running |
+----------------------+
|                    0 |
+----------------------+
1 row in set (0.000 sec)

... which is curious, since the icinga2.log shows no errors and seems to show normal activity, also the API is responding at that same time: we performed a few API queries via HTTP to the port 5665 such as /v1/status and /v1/objects/hosts and they were successfully replied, indicating that icinga2 was not dead, but maybe only the MySQL IDO threads were stopped somehow? (note that we validated that the DB still responded and was not hung)

Also, in case it's relevant, systemctl status icinga2.service showed:

Status: "Shutting down old instance..."

... until we restarted the service using systemctl.

As stated, leaving the debug.log active is not an option, and enabling it only when the error happens isn't effective because when we restart/reload via systemd to apply a feature enable debuglog it goes back to normal.

Could you please tell us which additional information we should try to collect when the issue happens again, to help debug this issue? Thanks in advance.

araujorm commented 10 months ago

Happened again. Couldn't get any info. Any advice? Meanwhile, in case anyone else has the same issue and there's no feedback... we have put the following on the systemd unit override to try to workaround, but it seems clearly a bug since icinga shouldn't (partially) hang.

[Service]
WatchdogSec=90s

julianbrost commented 10 months ago

Status: "Shutting down old instance..."

This indicates that the process is trying to reload but it hangs for some reason. What does the regular log file say regarding reloads?

Also, if the reload is waiting for something, there's a good chance that one can see something in the stacks of the running threads, so you could do the following to dump that information so that we can have a look at it: https://icinga.com/docs/icinga-2/latest/doc/21-development/#gdb-backtrace-from-running-process

araujorm commented 10 months ago

Hello.

Thanks for the reply.

We deactivated the systemd watchdog and after about 1.5 days the issue happened again.

As for the regular icinga2.log, everything seems normal when the director deploy happened. Config was read, endpoints were reconnected, API is responding and so on. If still prefer to check it, please tell us how to safely send the log file because it has a lot of customer names and information that we would prefer not to share in public.

We took the gdb traces before restarting the service, and are attaching them here in a file icinga2-gdb-2023-11-22.zip

Could this be related to https://github.com/Icinga/icinga2/issues/9806 ?

Tqnsls commented 9 months ago

Also, in case it's relevant, systemctl status icinga2.service showed:
Status: "Shutting down old instance..."

Is this logged in the icinga2.log in some way? I also believe this is not logged via journalctl -u icinga2 -x. We can not verify as we did not have this StatusText by now. We are also kind of experiencing the same error, but we do not trigger anything at the moment, no deployment, no restart, nothing.

We also do run RHEL 7 and icinga2 r2.14.0-1, but also had the errors in the previous r2.13

araujorm commented 9 months ago

Also, in case it's relevant, systemctl status icinga2.service showed:
Status: "Shutting down old instance..."
Is this logged in the icinga2.log in some way? I also believe this is not logged via journalctl -u icinga2 -x. We can not verify as we did not have this StatusText by now. We are also kind of experiencing the same error, but we do not trigger anything at the moment, no deployment, no restart, nothing.

It appears in the output of systemctl status icinga2.service, such as this:

● icinga2.service - Icinga host/service/network monitoring system
   Loaded: loaded (/usr/lib/systemd/system/icinga2.service; enabled; vendor preset: disabled)
(...)
 Main PID: 32036 (icinga2)
   Status: "Shutting down old instance..."
(...)

We also do run RHEL 7 and icinga2 r2.14.0-1, but also had the errors in the previous r2.13

We've always had this too, but lately it started happening at least twice a week.

However, we can confirm adding the systemd watchdog along with a auto-restart works (as workaround), for example doing systemctl edit icinga2 and adding there:

[Service]
Restart=always
RestartSec=10s
WatchdogSec=90s

yhabteab commented 9 months ago

We took the gdb traces before restarting the service, and are attaching them here in a file icinga2-gdb-2023-11-22.zip

Hi, I've had a look at the backtraces and couldn't find anything serious except that you are using a pretty old (~9-13 years) version of jemalloc.

Findings: The umbrella process is starting to spawn a new worker process here that is waiting for the old worker (with pid currentWorker = 32072) to shut down in daemoncommand.cpp:278:

Thread 1 (Thread 0x7f5e3f1cc940 (LWP 32735)):
#0  0x00007f5e3b90f9fd in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f5e3b9402d4 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:32
#2  0x0000000000bf0179 in RunWorker (stderrFile=..., closeConsoleLog=false, configs=std::vector of length 1, capacity 1 = {...}) at /usr/src/debug/icinga2-2.14.0/lib/cli/daemoncommand.cpp:278
#3  StartUnixWorker(std::vector<std::string, std::allocator<std::string> > const&, bool, icinga::String const&) () at /usr/src/debug/icinga2-2.14.0/lib/cli/daemoncommand.cpp:545
#4  0x0000000000bf0f92 in icinga::DaemonCommand::Run(boost::program_options::variables_map const&, std::vector<std::string, std::allocator<std::string> > const&) const () at /opt/rh/devtoolset-11/root/usr/include/c++/11/ext/new_allocator.h:89
        nextWorker = <optimized out>
        configs = std::vector of length 1, capacity 1 = {"/etc/icinga2/icinga2.conf"}
        ...
        currentWorker = 32072
        requestedTermination = false
        notifiedTermination = <optimized out>
#5  0x0000000000950928 in Main() () at /usr/include/boost169/boost/smart_ptr/intrusive_ptr.hpp:197
#6  0x000000000093e729 in main () at /usr/src/debug/icinga2-2.14.0/icinga-app/icinga.cpp:946
#7  0x00007f5e3b86c555 in __libc_start_main (main=0x93e620 <main>, argc=6, argv=0x7ffe4b5d5338, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe4b5d5328) at ../csu/libc-start.c:266
#8  0x000000000094e9cb in _start () at /opt/rh/devtoolset-11/root/usr/include/c++/11/bits/alloc_traits.h:851
[Inferior 1 (process 32735) detached]

The current worker also starts to shut down as requested and waits for the IdoMysqlConnection work queue threads to finish normally.

Thread 1 (Thread 0x7f5e3f1cc940 (LWP 32072)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f5e3c1e1aec in __gthread_cond_wait (__mutex=<optimized out>, __cond=__cond@entry=0x7f5e13819368) at /usr/src/debug/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/x86_64-redhat-linux/libstdc++-v3/include/x86_64-redhat-linux/bits/gthr-default.h:864
#2  std::condition_variable::wait (this=this@entry=0x7f5e13819368, __lock=...) at ../../../../../libstdc++-v3/src/c++11/condition_variable.cc:52
#3  0x00000000009b4431 in icinga::WorkQueue::Join(bool) () at /usr/src/debug/icinga2-2.14.0/lib/base/workqueue.cpp:113
        lock = {_M_device = 0x7f5e138192e0, _M_owns = true}
#4  0x0000000000e16096 in icinga::DbConnection::Pause() () at /usr/src/debug/icinga2-2.14.0/lib/db_ido/dbconnection.cpp:132
#5  0x0000000000e5eae7 in icinga::IdoMysqlConnection::Pause() () at /usr/src/debug/icinga2-2.14.0/lib/db_ido_mysql/idomysqlconnection.cpp:109
#6  0x00000000009762d5 in icinga::ConfigObject::SetAuthority(bool) () at /usr/src/debug/icinga2-2.14.0/lib/base/configobject-ti.cpp:944
#7  0x0000000000987626 in icinga::ConfigObject::Deactivate(bool, icinga::Value const&) () at /usr/src/debug/icinga2-2.14.0/lib/base/configobject.cpp:400
#8  0x00000000009b30a0 in icinga::ConfigObject::StopObjects() () at /usr/include/boost169/boost/smart_ptr/intrusive_ptr.hpp:197
#9  0x00000000009d3c4e in icinga::Application::RunEventLoop() () at /usr/src/debug/icinga2-2.14.0/lib/base/application.cpp:356
#10 0x0000000000cde938 in icinga::IcingaApplication::Main() () at /usr/src/debug/icinga2-2.14.0/lib/icinga/icingaapplication.cpp:112

But one of its worker threads gets stuck somewhere in jemalloc while it is freeing the task from the queue and blocks all other threads trying to acquire WorkQueue::m_Mutex.

Thread 6 (Thread 0x7f5e14072700 (LWP 32418)):
#0  0x00007f5e3eda24a0 in je_tcache_get (create=false) at src/jemalloc.c:1309
#1  je_arena_dalloc (try_tcache=true, ptr=0x7f5ddbe7e240, chunk=0x7f5ddbc00000, arena=0x7f5e3698eec0) at include/jemalloc/internal/arena.h:1040
        pageind = 638
        mapbits = 81
        tcache = <optimized out>
#2  je_idalloct (try_tcache=true, ptr=0x7f5ddbe7e240) at include/jemalloc/internal/jemalloc_internal.h:898
        chunk = 0x7f5ddbc00000
#3  je_iqalloct (try_tcache=true, ptr=0x7f5ddbe7e240) at include/jemalloc/internal/jemalloc_internal.h:917
#4  je_iqalloc (ptr=0x7f5ddbe7e240) at include/jemalloc/internal/jemalloc_internal.h:924
#5  ifree (ptr=0x7f5ddbe7e240) at src/jemalloc.c:1233
#6  free (ptr=0x7f5ddbe7e240) at src/jemalloc.c:1308
#7  0x0000000000e60ae8 in _M_destroy (__victim=...) at /usr/src/debug/icinga2-2.14.0/lib/db_ido_mysql/idomysqlconnection.cpp:1058
#8  _M_manager (__op=<optimized out>, __source=..., __dest=...) at /opt/rh/devtoolset-11/root/usr/include/c++/11/bits/std_function.h:203
#9  std::_Function_handler<void (), icinga::IdoMysqlConnection::InternalExecuteQuery(icinga::DbQuery const&, int)::{lambda()#1}>::_M_manager(std::_Any_data&, std::_Function_handler<void (), icinga::IdoMysqlConnection::InternalExecuteQuery(icinga::DbQuery const&, int)::{lambda()#1}> const&, std::_Manager_operation) () at /opt/rh/devtoolset-11/root/usr/include/c++/11/bits/std_function.h:282
#10 0x00000000009a4f26 in ~_Function_base (this=0x7f5d9b806340, __in_chrg=<optimized out>) at /opt/rh/devtoolset-11/root/usr/include/c++/11/bits/std_function.h:244
#11 ~function (this=0x7f5d9b806340, __in_chrg=<optimized out>) at /opt/rh/devtoolset-11/root/usr/include/c++/11/bits/std_function.h:334
#12 ~Task (this=0x7f5d9b806340, __in_chrg=<optimized out>) at /usr/src/debug/icinga2-2.14.0/lib/base/workqueue.hpp:31
#13 destroy<icinga::Task> (__p=0x7f5d9b806340, this=<optimized out>) at /opt/rh/devtoolset-11/root/usr/include/c++/11/ext/new_allocator.h:168
#14 destroy<icinga::Task> (__p=0x7f5d9b806340, __a=...) at /opt/rh/devtoolset-11/root/usr/include/c++/11/bits/alloc_traits.h:535
#15 pop_back (this=<optimized out>) at /opt/rh/devtoolset-11/root/usr/include/c++/11/bits/stl_deque.h:1561
#16 pop (this=<optimized out>) at /opt/rh/devtoolset-11/root/usr/include/c++/11/bits/stl_queue.h:679
#17 icinga::WorkQueue::WorkerThreadProc() () at /usr/src/debug/icinga2-2.14.0/lib/base/workqueue.cpp:273
        lock = {_M_device = 0x7f5e138192e0, _M_owns = true}
#18 0x00007f5e3df33954 in boost::(anonymous namespace)::thread_proxy (param=<optimized out>) at libs/thread/src/pthread/thread.cpp:177
#19 0x00007f5e3dd0fea5 in start_thread (arg=0x7f5e14072700) at pthread_create.c:307
#20 0x00007f5e3b948b0d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

I was only able to find include/jemalloc/internal/arena.h in jemalloc in version 3.6.0 released in 2015. So I can't really say what's going wrong there, but you may want to try upgrading your installed jemalloc and see if that helps.

araujorm commented 9 months ago

I was only able to find include/jemalloc/internal/arena.h in jemalloc in version 3.6.0 released in 2015. So I can't really say what's going wrong there, but you may want to try upgrading your installed jemalloc and see if that helps.

Many thanks for your time @yhabteab.

We had in fact enabled running icinga2 with libjemalloc some time before the issue started to arise more prominently, using the instructions in https://github.com/Icinga/icinga2/blob/master/doc/15-troubleshooting.md#try-swapping-out-the-allocator , however it's as you said and the version of that library is quite old on CentOS 7, which could be the issue here. If that's the root cause, then for us it'll be solved in a few months because an upgrade to a newer OS version is already planned (as CentOS 7 is scheduled to die next year).

For now we'll be reverting the change and running icinga2 again using the standard libc memory allocator (we've also cleaned some unneeded objects like old comments which were erroneously created automatically by an old script, which sped up significantly the startup time, so for now this shouldn't be a problem).

We'll still keep the systemd watchdog active for the service, and later check in the machine logs if the occurrences of it being triggering have any change and later provide feedback.

Thanks again and happy holidays

araujorm commented 8 months ago

Good afternoon.

The watchdog was triggered once since the change was made about 2 weeks ago. Seems to be a bit better but not yet conclusive.

We've now updated to the 2.14.1 today (we were on 2.14.0) and we'll wait some time before checking again (just in case it's related to something fixed in 2.14.1, such as https://github.com/Icinga/icinga2/pull/9947).

araujorm commented 5 months ago

Good morning.

Issue still occurs once every 2 weeks (more or less), but as we have placed the systemd watchdog we don't really notice it anymore. We just know that because of syslog entries such as:

Feb 22 16:58:50 *** systemd: icinga2.service watchdog timeout (limit 1min 30s)!

Oddly, in March we have no such entries, however in april:

Apr  2 11:03:09 *** systemd: icinga2.service watchdog timeout (limit 1min 30s)!
Apr  8 18:48:48 *** systemd: icinga2.service watchdog timeout (limit 1min 30s)!
Apr  9 09:32:34 *** systemd: icinga2.service watchdog timeout (limit 1min 30s)!
Apr 10 20:13:52 *** systemd: icinga2.service watchdog timeout (limit 1min 30s)!
Apr 12 10:56:39 *** systemd: icinga2.service watchdog timeout (limit 1min 30s)!

Do note that on Apr 9 we updated to version 2.14.2, but that doesn't seem to correlate.

As the machine is still running on CentOS 7, and should take some weeks until we have it on a more recent system, we may take a while before being able to bring anything new to this matter. Is there any problem to leave this open in the meanwhile?

Thanks in advance.

araujorm commented 5 months ago

One thing I forgot to mention, but have noticed: this seems to happen more often (on this icinga master) when we have more satellites that are unavailable for long periods of time. Just this month we have had 4 for longer than usual (due to several reasons that don't matter for this issue) and there could be a correlation there. Would it be of interest to try to collect any additional data on the master instance related to this? If so, please advise and thanks in advance.

slalomsk8er commented 5 months ago

Issue still occurs once every 2 weeks (more or less), but as we have placed the systemd watchdog we don't really notice it anymore. We just know that because of syslog entries such as:

I'm interested in your systemd watchdog config as last night one of your masters was OOM killed by the kernel.

araujorm commented 5 months ago

Issue still occurs once every 2 weeks (more or less), but as we have placed the systemd watchdog we don't really notice it anymore. We just know that because of syslog entries such as:

I'm interested in your systemd watchdog config as last night one of your masters was OOM killed by the kernel.

Sure. Create a override with systemctl edit icinga2 as root, it should open a text editor with a temporary file where you can add the following:

[Service]
Restart=always
RestartSec=10s
WatchdogSec=90s

Save the file and quit the editor. This should create a file named /etc/systemd/system/icinga2.service.d/override.conf and if you use systemctl edit it should apply it immediately (if you created it manually, you would need to issue a systemctl daemon-reload - that would also work). It should also not be needed to restart the service, but it won't hurt.

Creating a override in this way is always recommended instead of editing any systemd files in /usr/lib/systemd because files in /usr are overwritten on updates, whilst the override file is instead placed on /etc.

To verify if it's indeed applied, you can then issue systemctl show icinga2 and verify that these options are active. Also when you issue a systemctl status icinga2 command, you should see something like:

  Drop-In: /etc/systemd/system/icinga2.service.d
           └─override.conf

Some explanation about the options:

Restart=always will make systemd auto-restart the service everytime (except when you explicitly issue a systemctl stop to it).
RestartSec=10s is needed because if it happens that when systemd tries to restart the service it fails repeatedly too often then systemd stops retrying, so this will ensure that systemd will wait 10 seconds before (re)trying. This, along with Restart=always should prevent your master being down in situations such as the OOM killer stopping it.
WatchdogSec=90s is not for the OOM situation but works alongside the above options. It only works with daemons that support it (such as icinga2), where the application must periodically issue a special call (sd_notify), and if systemd stops receiving such calls then it terminates the application; by itself it wouldn't be useful, but as we also have the Restart options above, should it happen then systemd will restart the service.

Edit: typo

Icinga / icinga2