Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.03k stars 578 forks source link

API Crashes/Hangs after many requests (deadlock/race condition in ApiListener); solved by splitting locks #5408

Closed Stefar77 closed 6 years ago

Stefar77 commented 7 years ago

We have many passive services that get updated by a single active check on a host. When we switch from using sockets to API to send passive results the API will crash within minutes.

Expected Behavior

When pushing lots and lots of events via API it should not hang.

Current Behavior

Pushing many passive results to Icinga2 via API results in process hang, possibly the Icinga2 process itself will crash a while later without leaving a crash log. The only way to get API running again is kill -KILL {PID} or killall -KILL icinga2 and restarting and not stress the API / use socket to send passive check results.

Possible Solution

Tried 66c0746. It does seem to make the API a lot faster but it will still crash. edit; it was just faster because it had just restarted

5419 Solves most lockup's in the API and known remote thread leaks.

ps. sorry for the commit mess, getting used to git takes a bit..

Steps to Reproduce (for bugs)

  1. Add ~20 hosts with 10 Passive services and 1 Active service
  2. Make the Active service script send passive results to the hosts passive services trough the API
  3. Select the 20 active services in the GUI and press Check Now
  4. You have killed the API [to fix it you need to kill -KILL Icinga]
  5. At this point thread count will build up until it reaches the OS limit and the OS will kill the process.

Context

I'm trying to upgrade pollers to use API instead of sockets so they can use direct feedback and give notice if a service is missing but changing only 1 poller from sockets to API kills Icinga within seconds. I use one active service: 'Check Mitel' that gets the alarm state(s) and updates passive services accordingly. It's hard to trigger any other way, running the poller in the CLI many times does not seem to kill Icinga, but via 'Check Now' it fires them all at once. (faster then using shell to spawn curl many times at once) :-)

Is related to/same problem as #5148 I think

Your Environment

Copyright (c) 2012-2017 Icinga Development Team (https://www.icinga.com/) License GPLv2+: GNU GPL version 2 or later http://gnu.org/licenses/gpl2.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.

Application information: Installation root: /usr/local Sysconf directory: /usr/local/etc Run directory: /var/run Local state directory: /var Package data directory: /usr/local/share/icinga2 State path: /var/lib/icinga2/icinga2.state Modified attributes path: /var/lib/icinga2/modified-attributes.conf Objects path: /var/cache/icinga2/icinga2.debug Vars path: /var/cache/icinga2/icinga2.vars PID path: /var/run/icinga2/icinga2.pid

System information: Platform: Unknown Platform version: Unknown Kernel: FreeBSD Kernel version: 11.0-RELEASE-p9 Architecture: amd64

Build information: Compiler: Clang 3.8.0

* Operating System and version:

FreeBSD 11.0-RELEASE-p9 #0: Tue Apr 11 08:48:40 UTC 2017 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC

* Enabled features (`icinga2 feature list`):

api checker command graphite ido-mysql livestatus mainlog notification syslog

* Icinga Web 2 version and modules (System - About):
* Config validation (`icinga2 daemon -C`):

information/cli: Icinga application loader (version: r2.6.3-1) information/cli: Loading configuration file(s). information/ConfigItem: Committing config item(s). information/ApiListener: My API identity: *.*** warning/ApplyRule: Apply rule 'satellite-host' (in /usr/local/etc/icinga2/conf.d/satellite.conf: 29:1-29:41) for type 'Dependency' does not match anywhere! warning/ApplyRule: Apply rule 'mail-icingaadmin' (in /usr/local/etc/icinga2/conf.d/notifications.conf: 11:1-11:45) for type 'Notification' does not match anywhere! warning/ApplyRule: Apply rule 'mail-icingaadmin' (in /usr/local/etc/icinga2/conf.d/notifications.conf: 20:1-20:48) for type 'Notification' does not match anywhere! warning/ApplyRule: Apply rule 'backup-downtime' (in /usr/local/etc/icinga2/conf.d/downtimes.conf: 5:1-5:52) for type 'ScheduledDowntime' does not match anywhere! information/ConfigItem: Instantiated 1 FileLogger. information/ConfigItem: Instantiated 9 Endpoints. information/ConfigItem: Instantiated 10 Zones. information/ConfigItem: Instantiated 1 SyslogLogger. information/ConfigItem: Instantiated 1 ApiListener. information/ConfigItem: Instantiated 2 ApiUsers. information/ConfigItem: Instantiated 10747 Services. information/ConfigItem: Instantiated 239 Comments. information/ConfigItem: Instantiated 740 Dependencies. information/ConfigItem: Instantiated 1236 Notifications. information/ConfigItem: Instantiated 239 CheckCommands. information/ConfigItem: Instantiated 4 ServiceGroups. information/ConfigItem: Instantiated 5 TimePeriods. information/ConfigItem: Instantiated 3 Users. information/ConfigItem: Instantiated 2 UserGroups. information/ConfigItem: Instantiated 1236 Hosts. information/ConfigItem: Instantiated 21 HostGroups. information/ConfigItem: Instantiated 1 IcingaApplication. information/ConfigItem: Instantiated 3 NotificationCommands. information/ConfigItem: Instantiated 1 CheckerComponent. information/ConfigItem: Instantiated 1 ExternalCommandListener. information/ConfigItem: Instantiated 1 GraphiteWriter. information/ConfigItem: Instantiated 1 IdoMysqlConnection. information/ConfigItem: Instantiated 1 NotificationComponent. information/ConfigItem: Instantiated 1 LivestatusListener. information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars' information/cli: Finished validating the configuration file(s).


* Single zone
yoshi314 commented 7 years ago

I have the exact same problem when running the dashing dashboard for icinga2. After latest updates it works fine. but icinga crashed in the middle of the night so i had a several hour monitoring outage.

Same as in your case, no interesting logs.

This might be an easy to reproduce case.

Crunsher commented 7 years ago

I never had this problem when creating or querying objects.

icingaweb2 can use either external commands or the API to run 'check now', can you verify you are using the API as command transport?

Stefar77 commented 7 years ago

Crunsher, the 'check now' is only a method to make it go down fast, it's the API call in the plugin that is killing my Icinga. My plugin updates Mitel Controller alerts (~10 passive) services and I had a bug that ignored memcached and always pushed the status. (even when there was no change) when I change the poller to use API calls instead of sockets the API will crash. 'Check Now' on 40 Mitel Controllers tries to do ~400 passive updates at about the same time to the API and seems to kill it.

I fixed my poller and switched back to sockets to prevent the API from overloading until it's fixed. :-)

Ps. Normally Icinga has 46 threads and goes up for a sec when I refresh in aNag or some host is pushing some event but it usually backs down to 46 again, when I stress the API it will stay at 128 threads and the API will hang. This then also generates lots of timeouts in the Mitel pollers because requests API seem to hang forever, eventually I think the Icinga process disappears without any notice in crash/ or log.

For now; $use_api=false; // Set to true on a busy poller to kill the Icinga API

We are still migrating from Nagios to Icinga2 and it's not in production yet, if you want me to test stuff I be glad to do so.

Stefar77 commented 7 years ago

Just noticed; curl -k -s -u login:pass -H 'Accept: application/json' -X POST 'https://localhost:5665/v1/events?queue=debugnotifications&types=Notification'

Creates a new thread that doesn't seem to end. (Not that I use events in my pollers but may be related in threads not always ending) Fixed it with a nasty thread sleep in HttpServerConnection::Disconnect when there is m_PendingRequests

Also fixed in my pull-request #5419

dnsmichi commented 6 years ago

Closing this in favour of #6361.