Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.03k stars 578 forks source link

Configuration validation w/ ScheduledDowntimes performance decreased in 2.9 #6486

Closed marcofl closed 6 years ago

marcofl commented 6 years ago

We've a bigger setup with around 800 cluster clients (including satellites) 850 hosts and >52k services. Before icinga 2.9 a restart (or reload, no difference) on the master took around 2 minutes, which I think is already slow. The time is measured from the restart command via systemctl until clients are able to connect again to the master.

Now after 2.9 this takes much longer, up to 5minutes until clients can connect again to the master via the API. This is especially a problem, as we use Puppet for configuration of all objects and we have reloads every 30min, so basically our icinga2 master is now down for 10minutes per hour.

I can only imagine that the way how Downtimes are evaluated in 2.9 during the restart to prevent false notifications from being sent out are the reason for this increase in startup time.

Also a simsple icinga2 daemon -C takes over 2 minutes now (see example output below)

Expected Behavior

Icinga should restart within 1-2minutes or even faster.

Current Behavior

With 2.8 in our setup, 1-2minutes With 2.9.1 in our setup, 4-5minutes

Steps to Reproduce (for bugs)

  1. restart icinga2 on a master with a lot of objects
    time systemctl restart icinga2
  2. check how long it takes until checks are executed again and all clients are connected again to the master

you can also do a config check:

# time icinga2 daemon -C
[2018-07-26 08:33:05 +0200] information/cli: Icinga application loader (version: r2.9.1-1)
[2018-07-26 08:33:05 +0200] information/cli: Loading configuration file(s).
[2018-07-26 08:33:10 +0200] information/ConfigItem: Committing config item(s).
[2018-07-26 08:33:12 +0200] warning/ApiListener: Attribute 'key_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
[2018-07-26 08:33:12 +0200] warning/ApiListener: Attribute 'ca_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
[2018-07-26 08:33:12 +0200] warning/ApiListener: Attribute 'cert_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
[2018-07-26 08:33:12 +0200] warning/ApiListener: Please read the upgrading documentation for v2.8: https://www.icinga.com/docs/icinga2/latest/doc/16-upgrading-icinga-2/
[2018-07-26 08:33:12 +0200] information/ApiListener: My API identity: master1.monitoring.vshn.net
[2018-07-26 08:33:20 +0200] information/WorkQueue: #4 (DaemonUtility::LoadConfigFiles) items: 0, rate: 6.51667/s (391/min 391/5min 391/15min);
[2018-07-26 08:33:20 +0200] information/WorkQueue: #5 (InfluxdbWriter, influxdb) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2018-07-26 08:33:22 +0200] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2018-07-26 08:33:22 +0200] information/WorkQueue: #7 (ApiListener, SyncQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2018-07-26 08:33:22 +0200] information/WorkQueue: #8 (IdoMysqlConnection, ido-mysql) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2018-07-26 08:35:16 +0200] information/ConfigItem: Instantiated 52569 Services.
[2018-07-26 08:35:16 +0200] information/ConfigItem: Instantiated 4 ServiceGroups.
[2018-07-26 08:35:16 +0200] information/ConfigItem: Instantiated 10057 ScheduledDowntimes.
[2018-07-26 08:35:16 +0200] information/ConfigItem: Instantiated 1 InfluxdbWriter.
[2018-07-26 08:35:16 +0200] information/ConfigItem: Instantiated 36 HostGroups.
[2018-07-26 08:35:16 +0200] information/ConfigItem: Instantiated 1 FileLogger.
[2018-07-26 08:35:16 +0200] information/ConfigItem: Instantiated 1 NotificationComponent.
[2018-07-26 08:35:16 +0200] information/ConfigItem: Instantiated 8 NotificationCommands.
[2018-07-26 08:35:16 +0200] information/ConfigItem: Instantiated 16300 Notifications.
[2018-07-26 08:35:16 +0200] information/ConfigItem: Instantiated 1 IcingaApplication.
[2018-07-26 08:35:16 +0200] information/ConfigItem: Instantiated 885 Hosts.
[2018-07-26 08:35:16 +0200] information/ConfigItem: Instantiated 1 ApiListener.
[2018-07-26 08:35:16 +0200] information/ConfigItem: Instantiated 20646 Downtimes.
[2018-07-26 08:35:16 +0200] information/ConfigItem: Instantiated 105 Comments.
[2018-07-26 08:35:16 +0200] information/ConfigItem: Instantiated 1 CheckerComponent.
[2018-07-26 08:35:16 +0200] information/ConfigItem: Instantiated 816 Zones.
[2018-07-26 08:35:16 +0200] information/ConfigItem: Instantiated 1 ExternalCommandListener.
[2018-07-26 08:35:16 +0200] information/ConfigItem: Instantiated 814 Endpoints.
[2018-07-26 08:35:16 +0200] information/ConfigItem: Instantiated 5 ApiUsers.
[2018-07-26 08:35:16 +0200] information/ConfigItem: Instantiated 7 UserGroups.
[2018-07-26 08:35:16 +0200] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2018-07-26 08:35:16 +0200] information/ConfigItem: Instantiated 319 CheckCommands.
[2018-07-26 08:35:16 +0200] information/ConfigItem: Instantiated 2 TimePeriods.
[2018-07-26 08:35:16 +0200] information/ConfigItem: Instantiated 13 Users.
[2018-07-26 08:35:16 +0200] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2018-07-26 08:35:17 +0200] information/cli: Finished validating the configuration file(s).

real    2m12.444s
user    2m40.468s
sys 0m6.464s

Context

Master done for up to 10 minutes per hour, because we Puppet restarts at least 2 times per hour.

Your Environment

All the objects are configured in zones.d/. Usually every object is a file. around 40k of the services are individual objects with one file per object.

Also for some strange reason, we currently use "Apply Service" instead of "Service" even for individual services, which only apply to one host (assign_where host.name == <fqdn>).

We're currently reengineering our puppet setup to write all hosts and services into one host and one services conf file per zones.d/.

you see in the config check output, that Instantiated 52569 Services. took 2 minutes. Maybe this is the reason? are apply rules "bad"?

Afaik the config check happens via systemctl before icinga2 restarts. but during restart it also has to instantiate the objects and now maybe some change with downtimes, that added another ~2 minutes?

Copyright (c) 2012-2018 Icinga Development Team (https://www.icinga.com/) License GPLv2+: GNU GPL version 2 or later http://gnu.org/licenses/gpl2.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.

Application information: Installation root: /usr Sysconf directory: /etc Run directory: /run Local state directory: /var Package data directory: /usr/share/icinga2 State path: /var/lib/icinga2/icinga2.state Modified attributes path: /var/lib/icinga2/modified-attributes.conf Objects path: /var/cache/icinga2/icinga2.debug Vars path: /var/cache/icinga2/icinga2.vars PID path: /run/icinga2/icinga2.pid

Build information: Compiler: GNU 5.3.1 Build host: 1b4c52bdc470

* Operating System and version:

System information: Platform: Ubuntu Platform version: 16.04.5 LTS (Xenial Xerus) Kernel: Linux Kernel version: 4.4.0-131-generic Architecture: x86_64

* Enabled features (`icinga2 feature list`):

Disabled features: compatlog debuglog elasticsearch gelf graphite livestatus opentsdb perfdata statusdata syslog Enabled features: api checker command ido-mysql influxdb mainlog notification

* Icinga Web 2 version and modules (System - About): not relevant here
* Config validation (`icinga2 daemon -C`): see above
* If you run multiple Icinga 2 instances, the `zones.conf` file (or `icinga2 object list --type Endpoint` and `icinga2 object list --type Zone`) from all affected nodes:

icinga2 object list --type Endpoint | grep Object | wc -l

814

icinga2 object list --type Zone | grep Object | wc -l

816

dnsmichi commented 6 years ago

The changes with preventing unwanted notifications definitely slowed down the overall config item compilation. That's the price to pay here.

marcofl commented 6 years ago

How bad is here the fact, that we use Apply Service to Host when we could have used Object Service instead (for around 20-30k services)?

dnsmichi commented 6 years ago

I don't think that services are the problem here, still you can run a benchmark for comparisons if you want to. The ScheduledDowntime objects are highly likely the problem.

marcofl commented 6 years ago

I further reduced the amount of services and ScheduledDowntimes, but it only helped a bit:

[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 46095 Services.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 4 ServiceGroups.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 3470 ScheduledDowntimes.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 1 InfluxdbWriter.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 35 HostGroups.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 1 FileLogger.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 1 NotificationComponent.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 8 NotificationCommands.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 16300 Notifications.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 1 IcingaApplication.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 883 Hosts.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 1 ApiListener.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 6978 Downtimes.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 135 Comments.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 1 CheckerComponent.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 813 Zones.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 1 ExternalCommandListener.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 811 Endpoints.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 5 ApiUsers.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 7 UserGroups.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 319 CheckCommands.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 2 TimePeriods.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 13 Users.

A icinga2 daemon -C still takes 122sec and a restart around 3-5minutes before the API is available again.

the "strange" thing during a restart is, that it absolutely logs nothing for a long time. the systemctl restart itself takes around 7sec, but then it runs one once CPU core only with 100% for over 2 minutes, after that it's on all cpu cores (>600%) and things start working working again after another minute or so.

[2018-07-31 13:19:57 +0200] information/Application: Shutting down...
[2018-07-31 13:19:57 +0200] information/ApiListener: 'api' stopped.
[2018-07-31 13:19:57 +0200] information/CheckerComponent: 'checker' stopped.
[2018-07-31 13:19:57 +0200] information/ExternalCommandListener: 'command' stopped.
[2018-07-31 13:22:12 +0200] information/ApiListener: 'api' started.

What happens in that time?

N-o-X commented 6 years ago

I've tested this issue with your config @marcofl, but got very different results:

Your config:

Those tests where done about 10 times per version and it didn't change much. I'll investigate more later.

marcofl commented 6 years ago

I can confirm, that we now have restart times (without any config change) of around 1min again with 2.9.1+164.gdd59964.2018.09.12+1.xenial-0 snapshot as suggested by Icinga / Netways (netways #570902 ticket).

The main difference I noticed with this version is that the systemctl restart now "waits" until icinga2 is running again (incl. API) compared to 2.9.1 where systemd showed icinga2 already running but actually it was hanging somewhere in startup for another 5min+.

dnsmichi commented 6 years ago

Thanks. I'm waiting for the final updates in the mentioned RT ticket before closing/cherry-picking here.

dnsmichi commented 6 years ago

Based on #570902 I'm closing this. Thanks for sponsoring time to investigate here.