Closed marcofl closed 6 years ago
The changes with preventing unwanted notifications definitely slowed down the overall config item compilation. That's the price to pay here.
How bad is here the fact, that we use Apply Service to Host
when we could have used Object Service
instead (for around 20-30k services)?
I don't think that services are the problem here, still you can run a benchmark for comparisons if you want to. The ScheduledDowntime objects are highly likely the problem.
I further reduced the amount of services and ScheduledDowntimes, but it only helped a bit:
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 46095 Services.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 4 ServiceGroups.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 3470 ScheduledDowntimes.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 1 InfluxdbWriter.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 35 HostGroups.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 1 FileLogger.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 1 NotificationComponent.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 8 NotificationCommands.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 16300 Notifications.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 1 IcingaApplication.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 883 Hosts.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 1 ApiListener.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 6978 Downtimes.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 135 Comments.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 1 CheckerComponent.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 813 Zones.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 1 ExternalCommandListener.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 811 Endpoints.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 5 ApiUsers.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 7 UserGroups.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 319 CheckCommands.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 2 TimePeriods.
[2018-07-31 13:14:40 +0200] information/ConfigItem: Instantiated 13 Users.
A icinga2 daemon -C
still takes 122sec and a restart around 3-5minutes before the API is available again.
the "strange" thing during a restart is, that it absolutely logs nothing for a long time. the systemctl restart itself takes around 7sec, but then it runs one once CPU core only with 100% for over 2 minutes, after that it's on all cpu cores (>600%) and things start working working again after another minute or so.
[2018-07-31 13:19:57 +0200] information/Application: Shutting down...
[2018-07-31 13:19:57 +0200] information/ApiListener: 'api' stopped.
[2018-07-31 13:19:57 +0200] information/CheckerComponent: 'checker' stopped.
[2018-07-31 13:19:57 +0200] information/ExternalCommandListener: 'command' stopped.
[2018-07-31 13:22:12 +0200] information/ApiListener: 'api' started.
What happens in that time?
I've tested this issue with your config @marcofl, but got very different results:
Your config:
2.8.4: icinga2 daemon -C 354.99s user 14.09s system 327% cpu 1:52.69 total
Master(e8f0d6da4): icinga2 daemon -C 211.87s user 10.13s system 194% cpu 1:54.06 total
Those tests where done about 10 times per version and it didn't change much. I'll investigate more later.
I can confirm, that we now have restart times (without any config change) of around 1min again with 2.9.1+164.gdd59964.2018.09.12+1.xenial-0
snapshot as suggested by Icinga / Netways (netways #570902 ticket).
The main difference I noticed with this version is that the systemctl restart
now "waits" until icinga2 is running again (incl. API) compared to 2.9.1 where systemd showed icinga2 already running but actually it was hanging somewhere in startup for another 5min+.
Thanks. I'm waiting for the final updates in the mentioned RT ticket before closing/cherry-picking here.
Based on #570902 I'm closing this. Thanks for sponsoring time to investigate here.
We've a bigger setup with around 800 cluster clients (including satellites) 850 hosts and >52k services. Before icinga 2.9 a restart (or reload, no difference) on the master took around 2 minutes, which I think is already slow. The time is measured from the restart command via systemctl until clients are able to connect again to the master.
Now after 2.9 this takes much longer, up to 5minutes until clients can connect again to the master via the API. This is especially a problem, as we use Puppet for configuration of all objects and we have reloads every 30min, so basically our icinga2 master is now down for 10minutes per hour.
I can only imagine that the way how Downtimes are evaluated in 2.9 during the restart to prevent false notifications from being sent out are the reason for this increase in startup time.
Also a simsple
icinga2 daemon -C
takes over 2 minutes now (see example output below)Expected Behavior
Icinga should restart within 1-2minutes or even faster.
Current Behavior
With 2.8 in our setup, 1-2minutes With 2.9.1 in our setup, 4-5minutes
Steps to Reproduce (for bugs)
you can also do a config check:
Context
Master done for up to 10 minutes per hour, because we Puppet restarts at least 2 times per hour.
Your Environment
All the objects are configured in zones.d/. Usually every object is a file. around 40k of the services are individual objects with one file per object.
Also for some strange reason, we currently use "Apply Service" instead of "Service" even for individual services, which only apply to one host (
assign_where host.name == <fqdn>
).We're currently reengineering our puppet setup to write all hosts and services into one host and one services conf file per zones.d/.
you see in the config check output, that
Instantiated 52569 Services.
took 2 minutes. Maybe this is the reason? are apply rules "bad"?Afaik the config check happens via systemctl before icinga2 restarts. but during restart it also has to instantiate the objects and now maybe some change with downtimes, that added another ~2 minutes?
icinga2 --version
):Copyright (c) 2012-2018 Icinga Development Team (https://www.icinga.com/) License GPLv2+: GNU GPL version 2 or later http://gnu.org/licenses/gpl2.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.
Application information: Installation root: /usr Sysconf directory: /etc Run directory: /run Local state directory: /var Package data directory: /usr/share/icinga2 State path: /var/lib/icinga2/icinga2.state Modified attributes path: /var/lib/icinga2/modified-attributes.conf Objects path: /var/cache/icinga2/icinga2.debug Vars path: /var/cache/icinga2/icinga2.vars PID path: /run/icinga2/icinga2.pid
Build information: Compiler: GNU 5.3.1 Build host: 1b4c52bdc470
System information: Platform: Ubuntu Platform version: 16.04.5 LTS (Xenial Xerus) Kernel: Linux Kernel version: 4.4.0-131-generic Architecture: x86_64
Disabled features: compatlog debuglog elasticsearch gelf graphite livestatus opentsdb perfdata statusdata syslog Enabled features: api checker command ido-mysql influxdb mainlog notification
icinga2 object list --type Endpoint | grep Object | wc -l
814
icinga2 object list --type Zone | grep Object | wc -l
816