Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.03k stars 579 forks source link

Icinga2 Master Endpoint heavy load CPU usage #7040

Closed redbob365 closed 5 years ago

redbob365 commented 5 years ago

Hi,

My icinga2 Master Endpoint is running into a XCP-ng Host. It's suitably running, except by heavy load of CPU Usage. It's impacting all Host system. We don't have any satellite attached to it.

Look at these graphs from XCP server:

xentop - 13:17:34   Xen 4.7.6-6.3.1.xcp
5 domains: 2 running, 2 blocked, 0 paused, 0 crashed, 0 dying, 0 shutdown
Mem: 25165188k total, 19480216k used, 5684972k free    CPUs: 4 @ 2660MHz
      NAME  STATE   CPU(sec) CPU(%)     MEM(k) MEM(%)  MAXMEM(k) MAXMEM(%) VCPUS NETS NETTX(k) NETRX(k) VBDS   VBD_OO   VBD_RD   VBD_WR  VBD_RSECT  VBD_WSECT SSID
Copy of Sr --b---       6909    1.2    2097144    8.3    2098176       8.3     2    0        0        0    1        0   174045   162878    1939028    1738136    0
  Domain-0 -----r      71396   36.4    4194304   16.7    4194304      16.7     4    0        0        0    0        0        0        0          0          0    0
  pacupeva ------      10446   23.3    4194296   16.7    4234240      16.8     4    0        0        0    2        -        -        -          -          -    0
srvicinga- -----r     521610  338.4    4195328   16.7    4195328      16.7     4    0        0        0    2        0   242696 15814316    3367882  298743224    0
srvwiki-mt --b---      11132    0.7    4194296   16.7    4195328      16.7     4    0        0        0    2        0   158230    95202    1725828    3475352    0

That's top from the Master server:

top - 13:39:59 up 2 days,  2:07,  3 users,  load average: 11.95, 10.56, 9.73
Tasks: 190 total,   1 running, 186 sleeping,   0 stopped,   3 zombie
%Cpu(s): 38.2 us, 27.5 sy,  4.1 ni,  0.8 id,  0.0 wa,  0.0 hi,  1.0 si, 28.4 st
KiB Mem :  4025140 total,   878000 free,  1227548 used,  1919592 buff/cache
KiB Swap:  4191228 total,  4191228 free,        0 used.  2649752 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
130672 nagios    20   0 2895268  59900  16740 S 145.5  1.5 447:02.86 icinga2
  1660 _graphi+  20   0  157108  25700   4744 S  26.5  0.6 643:48.61 carbon-cache
123656 www-data  20   0  460260  25032  13572 S  14.2  0.6   0:02.40 apache2
  1373 influxdb  20   0 1503932 363636  26276 S  12.9  9.0 415:36.44 influxd
130693 nagios    20   0  449348   6488   5028 S  11.3  0.2 482:00.48 icinga2
123596 www-data  20   0  460292  25024  13572 S   9.0  0.6   0:02.75 apache2
  1421 mysql     20   0 2357200 238500  19020 S   8.4  5.9 274:07.44 mysqld
118806 www-data  20   0  460804  32588  20776 S   7.4  0.8   0:10.86 apache2
124015 www-data  20   0  460408  25120  13572 S   7.4  0.6   0:01.96 apache2
123624 www-data  20   0  460408  25128  13572 S   7.1  0.6   0:02.36 apache2
123169 www-data  20   0  462736  34668  20808 S   6.1  0.9   0:02.84 apache2
124029 www-data  20   0  460252  25000  13572 S   5.5  0.6   0:01.57 apache2
124350 nagios    25   5    9488    824    748 S   5.2  0.0   0:00.16 check_icmp
123116 www-data  20   0  460680  32552  20712 S   4.2  0.8   0:03.22 apache2
123704 www-data  20   0  460408  25120  13572 S   2.9  0.6   0:01.99 apache2
123692 www-data  20   0  460268  25036  13572 S   2.6  0.6   0:01.48 apache2
124028 root      20   0   41956   3588   2876 R   2.3  0.1   0:03.55 top
122754 www-data  20   0  460736  31648  19816 S   1.9  0.8   0:04.12 apache2
  1964 root      20   0 1998952 173140   8216 S   1.6  4.3 169:57.22 ruby2.3
122256 www-data  20   0  462852  33692  19848 S   1.6  0.8   0:04.69 apache2
   319 root       0 -20       0      0      0 S   1.3  0.0   5:54.12 kworker/3:1H
   628 root      20   0       0      0      0 S   1.3  0.0  19:18.75 jbd2/xvdb1-8
  1400 root      10 -10    5720   3508   2424 S   1.3  0.1   7:15.91 iscsid
123607 www-data  20   0  460268  25052  13572 S   1.0  0.6   0:02.62 apache2
     7 root      20   0       0      0      0 S   0.3  0.0  18:38.11 rcu_sched
    23 root      20   0       0      0      0 S   0.3  0.0   4:31.63 ksoftirqd/3
   326 root       0 -20       0      0      0 S   0.3  0.0   2:59.24 kworker/2:1H
   369 root      20   0   32200   3144   2800 S   0.3  0.1   2:43.67 systemd-journal
 31001 _graphi+  20   0  615708  50072  13684 S   0.3  1.2   0:32.63 apache2
114116 usrici    20   0   92832   4424   3468 S   0.3  0.1   0:01.93 sshd
     1 root      20   0   38152   6220   4028 S   0.0  0.2   5:18.67 systemd
     2 root      20   0       0      0      0 S   0.0  0.0   0:01.48 kthreadd
     3 root      20   0       0      0      0 S   0.0  0.0   2:08.38 ksoftirqd/0
     5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H
     8 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh
     9 root      rt   0       0      0      0 S   0.0  0.0   0:25.91 migration/0
    10 root      rt   0       0      0      0 S   0.0  0.0   0:05.97 watchdog/0
    11 root      rt   0       0      0      0 S   0.0  0.0   0:06.08 watchdog/1
    12 root      rt   0       0      0      0 S   0.0  0.0   0:26.38 migration/1
    13 root      20   0       0      0      0 S   0.0  0.0   4:31.86 ksoftirqd/1
    15 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/1:0H
    16 root      rt   0       0      0      0 S   0.0  0.0   0:06.52 watchdog/2
    17 root      rt   0       0      0      0 S   0.0  0.0   0:28.31 migration/2
    18 root      20   0       0      0      0 S   0.0  0.0   3:08.50 ksoftirqd/2

Is there any solution to it?

My Environment

Module Version
director master
grafana master
graphite 0.0.0
monitoring 2.6.2
nagvis 1.1.1

[2019-03-20 13:28:54 -0400] information/cli: Icinga application loader (version: r2.10.3-1) [2019-03-20 13:28:54 -0400] information/cli: Loading configuration file(s). [2019-03-20 13:28:55 -0400] information/ConfigItem: Committing config item(s). [2019-03-20 13:28:55 -0400] information/ApiListener: My API identity: srvici-mt.mt.trf1.gov.br [2019-03-20 13:28:58 -0400] warning/ApplyRule: Apply rule 'satellite-host' (in /etc/icinga2/conf.d/satellite.conf: 29:1-29:41) for type 'Dependency' does not match anywhere! [2019-03-20 13:28:58 -0400] warning/ApplyRule: Apply rule 'mail-icingaadmin' (in /etc/icinga2/conf.d/notifications.conf: 11:1-11:45) for type 'Notification' does not match anywhere! [2019-03-20 13:28:58 -0400] warning/ApplyRule: Apply rule 'mail-icingaadmin' (in /etc/icinga2/conf.d/notifications.conf: 23:1-23:48) for type 'Notification' does not match anywhere! [2019-03-20 13:28:58 -0400] warning/ApplyRule: Apply rule 'backup-downtime' (in /etc/icinga2/conf.d/downtimes.conf: 5:1-5:52) for type 'ScheduledDowntime' does not match anywhere! [2019-03-20 13:28:58 -0400] warning/ApplyRule: Apply rule 'apt' (in /etc/icinga2/conf.d/apt.conf: 1:0-1:18) for type 'Service' does not match anywhere! [2019-03-20 13:28:58 -0400] warning/ApplyRule: Apply rule 'ping6' (in /etc/icinga2/conf.d/services.conf: 35:1-35:21) for type 'Service' does not match anywhere! [2019-03-20 13:28:58 -0400] warning/ApplyRule: Apply rule 'ssh' (in /etc/icinga2/conf.d/services.conf: 48:1-48:19) for type 'Service' does not match anywhere! [2019-03-20 13:28:58 -0400] warning/ApplyRule: Apply rule '' (in /etc/icinga2/conf.d/services.conf: 58:1-58:65) for type 'Service' does not match anywhere! [2019-03-20 13:28:58 -0400] warning/ApplyRule: Apply rule 'disk:' (in /etc/icinga2/conf.d/services.conf: 66:1-66:80) for type 'Service' does not match anywhere! [2019-03-20 13:28:58 -0400] warning/ApplyRule: Apply rule 'icinga' (in /etc/icinga2/conf.d/services.conf: 75:1-75:22) for type 'Service' does not match anywhere! [2019-03-20 13:28:58 -0400] warning/ApplyRule: Apply rule 'load' (in /etc/icinga2/conf.d/services.conf: 81:1-81:20) for type 'Service' does not match anywhere! [2019-03-20 13:28:58 -0400] warning/ApplyRule: Apply rule 'procs' (in /etc/icinga2/conf.d/services.conf: 92:1-92:21) for type 'Service' does not match anywhere! [2019-03-20 13:28:58 -0400] warning/ApplyRule: Apply rule 'swap' (in /etc/icinga2/conf.d/services.conf: 100:1-100:20) for type 'Service' does not match anywhere! [2019-03-20 13:28:58 -0400] warning/ApplyRule: Apply rule 'users' (in /etc/icinga2/conf.d/services.conf: 108:1-108:21) for type 'Service' does not match anywhere! [2019-03-20 13:28:58 -0400] warning/ApplyRule: Apply rule 'nrpe_ss' (in /etc/icinga2/conf.d/services.conf: 116:1-116:23) for type 'Service' does not match anywhere! [2019-03-20 13:28:58 -0400] warning/ApplyRule: Apply rule 'disk' (in /etc/icinga2/conf.d/services.conf: 125:1-125:20) for type 'Service' does not match anywhere! [2019-03-20 13:28:58 -0400] information/ConfigItem: Instantiated 100 Services. [2019-03-20 13:28:58 -0400] information/ConfigItem: Instantiated 1 InfluxdbWriter. [2019-03-20 13:28:58 -0400] information/ConfigItem: Instantiated 1 LivestatusListener. [2019-03-20 13:28:58 -0400] information/ConfigItem: Instantiated 1 IcingaApplication. [2019-03-20 13:28:58 -0400] information/ConfigItem: Instantiated 178 Hosts. [2019-03-20 13:28:58 -0400] information/ConfigItem: Instantiated 1 FileLogger. [2019-03-20 13:28:58 -0400] information/ConfigItem: Instantiated 2 NotificationCommands. [2019-03-20 13:28:58 -0400] information/ConfigItem: Instantiated 1 NotificationComponent. [2019-03-20 13:28:58 -0400] information/ConfigItem: Instantiated 13 HostGroups. [2019-03-20 13:28:58 -0400] information/ConfigItem: Instantiated 1 ApiListener. [2019-03-20 13:28:58 -0400] information/ConfigItem: Instantiated 1 GraphiteWriter. [2019-03-20 13:28:58 -0400] information/ConfigItem: Instantiated 1 PerfdataWriter. [2019-03-20 13:28:58 -0400] information/ConfigItem: Instantiated 1 CheckerComponent. [2019-03-20 13:28:58 -0400] information/ConfigItem: Instantiated 3 Zones. [2019-03-20 13:28:58 -0400] information/ConfigItem: Instantiated 1 StatusDataWriter. [2019-03-20 13:28:58 -0400] information/ConfigItem: Instantiated 1 ExternalCommandListener. [2019-03-20 13:28:58 -0400] information/ConfigItem: Instantiated 1 Endpoint. [2019-03-20 13:28:58 -0400] information/ConfigItem: Instantiated 4 ApiUsers. [2019-03-20 13:28:58 -0400] information/ConfigItem: Instantiated 1 CompatLogger. [2019-03-20 13:28:58 -0400] information/ConfigItem: Instantiated 2 Users. [2019-03-20 13:28:58 -0400] information/ConfigItem: Instantiated 1 IdoMysqlConnection. [2019-03-20 13:28:58 -0400] information/ConfigItem: Instantiated 222 CheckCommands. [2019-03-20 13:28:58 -0400] information/ConfigItem: Instantiated 1 UserGroup. [2019-03-20 13:28:58 -0400] information/ConfigItem: Instantiated 1 ServiceGroup. [2019-03-20 13:28:58 -0400] information/ConfigItem: Instantiated 3 TimePeriods. [2019-03-20 13:28:58 -0400] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars' [2019-03-20 13:28:58 -0400] information/cli: Finished validating the configuration file(s).


I must tell you that, besides these settings, this server has enabled `dashing-icinga2` too, and there are two clients browsing Monitor Maps (nagvis) from it.
dnsmichi commented 5 years ago

Please generate a full backtrace of the running process described here: https://icinga.com/docs/icinga2/latest/doc/21-development/#gdb-backtrace-from-running-process

redbob365 commented 5 years ago

OK, @dnsmichi, Here are logs: https://www.dropbox.com/s/1ogqgluwj567v5q/gdb_bt_72870_1553709540.log?dl=0 https://www.dropbox.com/s/u1um8th1gfqzfdf/gdb_bt_72825_1553709548.log?dl=0

dnsmichi commented 5 years ago

Please attach them as a zip file next time, Dropbox with all the popups is barely usable.

gdb_icinga2_7040.zip

dnsmichi commented 5 years ago

I can see that there are a lot of threads executing checks, and their processes run like forever (up until the timeout kills them).

Thread 55 (Thread 0x7fec9f9e7700 (LWP 82963)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
No locals.
#1  0x00007fed0608edbd in __GI___pthread_mutex_lock (mutex=0xea3300) at ../nptl/pthread_mutex_lock.c:80
        __PRETTY_FUNCTION__ = "__pthread_mutex_lock"
        type = 0
        id = <optimized out>
#2  0x0000000000602e18 in ?? ()
No symbol table info available.
#3  0x00000000005fcbbd in icinga::Process::Run(std::function<void (icinga::ProcessResult const&)> const&) ()
No symbol table info available.
#4  0x00000000006cfb43 in icinga::PluginUtility::ExecuteCommand(boost::intrusive_ptr<icinga::Command> const&, boost::intrusive_ptr<icinga::Checkable> const&, boost::intrusive_ptr<icinga::CheckResult> const&, std::vector<std::pair<icinga::String, boost::intrusive_ptr<icinga::Object> >, std::allocator<std::pair<icinga::String, boost::intrusive_ptr<icinga::Object> > > > const&, boost::intrusive_ptr<icinga::Dictionary> const&, bool, int, std::function<void (icinga::Value const&, icinga::ProcessResult const&)> const&) ()
No symbol table info available.
#5  0x00000000007d69f6 in icinga::PluginCheckTask::ScriptFunc(boost::intrusive_ptr<icinga::Checkable> const&, boost::intrusive_ptr<icinga::CheckResult> const&, boost::intrusive_ptr<icinga::Dictionary> const&, bool) ()
No symbol table info available.
#6  0x00000000007d5a03 in std::_Function_handler<icinga::Value (std::vector<icinga::Value, std::allocator<icinga::Value> > const&), std::enable_if<std::is_function<std::remove_pointer<void (*)(boost::intrusive_ptr<icinga::Checkable> const&, boost::intrusive_ptr<icinga::CheckResult> const&, boost::intrusive_ptr<icinga::Dictionary> const&, bool)>::type>::value&&(!std::is_same<void (*)(boost::intrusive_ptr<icinga::Checkable> const&, boost::intrusive_ptr<icinga::CheckResult> const&, boost::intrusive_ptr<icinga::Dictionary> const&, bool), icinga::Value (*)(std::vector<icinga::Value, std::allocator<icinga::Value> > const&)>::value), std::function<icinga::Value (std::vector<icinga::Value, std::allocator<icinga::Value> > const&)> >::type icinga::WrapFunction<void (*)(boost::intrusive_ptr<icinga::Checkable> const&, boost::intrusive_ptr<icinga::CheckResult> const&, boost::intrusive_ptr<icinga::Dictionary> const&, bool)>(void (*)(boost::intrusive_ptr<icinga::Checkable> const&, boost::intrusive_ptr<icinga::CheckResult> const&, boost::intrusive_ptr<icinga::Dictionary> const&, bool))::{lambda(std::vector<icinga::Value, std::allocator<icinga::Value> > const&)#1}>::_M_invoke(std::_Any_data const&, std::vector<icinga::Value, std::allocator<icinga::Value> > const&) ()
No symbol table info available.
#7  0x00000000005d81cf in icinga::Function::Invoke(std::vector<icinga::Value, std::allocator<icinga::Value> > const&) ()
No symbol table info available.
#8  0x00000000006e8dbd in icinga::CheckCommand::Execute(boost::intrusive_ptr<icinga::Checkable> const&, boost::intrusive_ptr<icinga::CheckResult> const&, boost::intrusive_ptr<icinga::Dictionary> const&, bool) ()
No symbol table info available.
#9  0x00000000006fd5e5 in icinga::Checkable::ExecuteCheck() ()
No symbol table info available.
#10 0x00000000007cb55b in icinga::CheckerComponent::ExecuteCheckHelper(boost::intrusive_ptr<icinga::Checkable> const&) ()
No symbol table info available.
#11 0x000000000062f2f6 in icinga::ThreadPool::WorkerThread::ThreadProc(icinga::ThreadPool::Queue&) ()
No symbol table info available.
#12 0x00007fed06a3d5d5 in ?? () from /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.58.0
No symbol table info available.
#13 0x00007fed0608c6ba in start_thread (arg=0x7fec9f9e7700) at pthread_create.c:333
        __res = <optimized out>
        pd = 0x7fec9f9e7700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140654266971904, -3246129604114312920, 0, 140656005215279, 262144, 140654676105616, 3254503214567067944, 3254240247518138664}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#14 0x00007fed06f5d41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
No locals.

Thread 21 til 61 are hanging check execution threads. Investigate on why these checks are hanging that long.

redbob365 commented 5 years ago

@dnsmichi, I found several services without defined timeout (blank values), maybe they were causing endless running processes. After that, CPU usage dropped to half load. I tried by all manners to upload a file in github, but I have problems! Look at these: gdb_bt_1670_1553874512.log and gdb_bt_1730_1553874504. These are graphics before and after the fixes. I must say that I disabled grafana and graphite. I'll try to re-enable them to see if it would impact over it.

dnsmichi commented 5 years ago

File upload to GitHub requires a certain format, e.g. a zip file containing all the logs. Since there was no feedback anymore, I'll consider this being resolved with the changed plugin timeout.