Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
1.99k stars 573 forks source link

very large plugin output causes services to look overdue & time out database connection #6549

Closed dbu1986 closed 4 years ago

dbu1986 commented 6 years ago

Expected Behavior

The Output of a check plugin should not affect Icinga2's database connection health, which results in all services to look overdue in Icingaweb2. Icinga2 should take care of submitting database-compatible sql queries regarding the max length of table columns.

Current Behavior

As soon as Icinga2 is beginning to handle a very large plugin output in one service, the list of overdue check results in Icingaweb2 gets longer, until all active services seem to be overdue in Icingaweb2. The debuglog shows database exceptions e.g. as follows:

[2018-08-16 10:41:46 +0200] critical/IdoMysqlConnection: Error "Query was empty" when executing query ""
[2018-08-16 10:41:46 +0200] critical/IdoMysqlConnection: Exception during database operation: Verify that your database is operational!
[2018-08-16 10:41:46 +0200] debug/IdoMysqlConnection: Exception during database operation: Error: std::exception
[2018-08-16 10:41:46 +0200] critical/IdoMysqlConnection: Error "MySQL server has gone away" when executing query "DELETE FROM icinga_comments WHERE instance_id = 1 AND session_token <> 1534408869"

It looks like the executed UPDATE query into icinga_servicestatus (with the very long output) causes Icinga2 to time out and close it's database connection.

Possible Solution

Icinga2 makes sure that the plugin output will fit into the column 'long_output' in table icinga_servicestatus (column type = TEXT => 65535 Chars), e.g. by cutting off the plugin output string while building the sql query. Same might be relevant to column 'perfdata' as well.

Steps to Reproduce (for bugs)

  1. Create a service using a CheckCommand with very long plugin output (e.g. 945635 characters )
  2. Let Icinga2 check the service and handle it's plugin output
  3. Watch the increasing time of 'last check' on all services in Icingaweb2. You can also watch the increasing list of late check results at Icingaweb2 --> Dashboard -->Overdue.
  4. Consulting your debuglog should also make you aware of database exceptions (see above).

Context

I am deploying services executing a new check plugin (custom made perl script). Due to different reasons the plugin returns 945635 characters of output, including lots of host-based errors. I know, it might not be Icinga2's job to take care of buggy outputs of custom plugins, but it seems to allow pretty extensive impacts, which should IMHO get limited.

As Icinga2 begins to execute checks, I notice that all services are getting high last check times. The list of overdue check results is increasing until all ~8800 services are affected. Hitting the service's "Check now" link does not change this behaviour. The debuglog shows exceptions regarding the database connection (see above). I think that's the reason why Icingaweb2 cannot retreive the latest check results from the database. The checks are definitely executed freshly and in time, as traced via debuglog.

I assume that the large sql query is responsible for the situation, so limiting the input values to the maximum column size of the destination table might be an appropriate way to avoid it.

Your Environment

Copyright (c) 2012-2018 Icinga Development Team (https://www.icinga.com/) License GPLv2+: GNU GPL version 2 or later http://gnu.org/licenses/gpl2.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.

Application information: Installation root: /usr Sysconf directory: /etc Run directory: /run Local state directory: /var Package data directory: /usr/share/icinga2 State path: /var/lib/icinga2/icinga2.state Modified attributes path: /var/lib/icinga2/modified-attributes.conf Objects path: /var/cache/icinga2/icinga2.debug Vars path: /var/cache/icinga2/icinga2.vars PID path: /run/icinga2/icinga2.pid

System information: Platform: Red Hat Enterprise Linux Server Platform version: 7.5 (Maipo) Kernel: Linux Kernel version: 3.10.0-862.9.1.el7.x86_64 Architecture: x86_64

Build information: Compiler: GNU 4.8.5 Build host: unknown

* Enabled features (`icinga2 feature list`): 

Enabled features: api checker command debuglog graphite ido-mysql mainlog notification

* Icinga Web 2 version and modules (System - About):  2.6.1
* Config validation (`icinga2 daemon -C`):

[2018-08-16 14:38:08 +0200] information/cli: Icinga application loader (version: r2.9.1-1) [2018-08-16 14:38:08 +0200] information/cli: Loading configuration file(s). [2018-08-16 14:38:08 +0200] information/ConfigItem: Committing config item(s). [2018-08-16 14:38:09 +0200] information/ApiListener: My API identity: fravm007010.os-fra.local [2018-08-16 14:38:18 +0200] information/WorkQueue: #4 (DaemonUtility::LoadConfigFiles) items: 0, rate: 3.25/s (195/min 195/5min 195/15min); [2018-08-16 14:38:19 +0200] information/WorkQueue: #5 (IdoMysqlConnection, ido-mysql) items: 0, rate: 0/s (0/min 0/5min 0/15min); [2018-08-16 14:38:19 +0200] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min); [2018-08-16 14:38:19 +0200] information/WorkQueue: #7 (ApiListener, SyncQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min); [2018-08-16 14:38:19 +0200] information/WorkQueue: #8 (GraphiteWriter, graphite) items: 0, rate: 0/s (0/min 0/5min 0/15min); [2018-08-16 14:38:25 +0200] information/ConfigItem: Instantiated 8809 Services. [2018-08-16 14:38:25 +0200] information/ConfigItem: Instantiated 37 ServiceGroups. [2018-08-16 14:38:25 +0200] information/ConfigItem: Instantiated 70 HostGroups. [2018-08-16 14:38:25 +0200] information/ConfigItem: Instantiated 1 EventCommand. [2018-08-16 14:38:25 +0200] information/ConfigItem: Instantiated 2 FileLoggers. [2018-08-16 14:38:25 +0200] information/ConfigItem: Instantiated 1 NotificationComponent. [2018-08-16 14:38:25 +0200] information/ConfigItem: Instantiated 9 NotificationCommands. [2018-08-16 14:38:25 +0200] information/ConfigItem: Instantiated 9421 Notifications. [2018-08-16 14:38:25 +0200] information/ConfigItem: Instantiated 1 IcingaApplication. [2018-08-16 14:38:25 +0200] information/ConfigItem: Instantiated 612 Hosts. [2018-08-16 14:38:25 +0200] information/ConfigItem: Instantiated 1 ApiListener. [2018-08-16 14:38:25 +0200] information/ConfigItem: Instantiated 407 Downtimes. [2018-08-16 14:38:25 +0200] information/ConfigItem: Instantiated 1 GraphiteWriter. [2018-08-16 14:38:25 +0200] information/ConfigItem: Instantiated 140 Comments. [2018-08-16 14:38:25 +0200] information/ConfigItem: Instantiated 1 CheckerComponent. [2018-08-16 14:38:25 +0200] information/ConfigItem: Instantiated 499 Zones. [2018-08-16 14:38:25 +0200] information/ConfigItem: Instantiated 1 ExternalCommandListener. [2018-08-16 14:38:25 +0200] information/ConfigItem: Instantiated 501 Endpoints. [2018-08-16 14:38:25 +0200] information/ConfigItem: Instantiated 7 ApiUsers. [2018-08-16 14:38:25 +0200] information/ConfigItem: Instantiated 1 UserGroup. [2018-08-16 14:38:25 +0200] information/ConfigItem: Instantiated 1 IdoMysqlConnection. [2018-08-16 14:38:25 +0200] information/ConfigItem: Instantiated 298 CheckCommands. [2018-08-16 14:38:25 +0200] information/ConfigItem: Instantiated 19 TimePeriods. [2018-08-16 14:38:25 +0200] information/ConfigItem: Instantiated 1 User. [2018-08-16 14:38:26 +0200] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars' [2018-08-16 14:38:26 +0200] information/cli: Finished validating the configuration file(s).

Crunsher commented 6 years ago

I'm down with cutting off the plugin output, but cutting off the perfdata will lead to problems with other tools trying to read it. Maybe an all or nothing solution is better in this case: If the perfdata is too long, we log that as a Warning and write nothing into the DB.

dbu1986 commented 6 years ago

Sounds good. I would consider applying this to both columns, long_output & perfdata. I've had ~936553 characters to be saved into long_output, and ~14629 characters to be saved into perfdata. In this case the non-fitting problem applies to long_output.

dnsmichi commented 6 years ago

I would fix the plugin in the first place. 900K characters is nothing you would like to render in Icinga Web 2 nor read in a notification email in case of emergency. I suppose that's a custom written plugin, which can be modified. If you need inspiration for good formatted plugin outputs, here's a talk from a past Icinga Camp: https://www.youtube.com/watch?v=Ey_APqSCoFQ

dbu1986 commented 6 years ago

Thanks for linking this valuable video! Sure, already did so. My point is to harden Icinga2 against such an impact. It took a while for me to identify the dependency between a crappy plugin output and seemingly system-wide overdue checks in Icingaweb2.

dnsmichi commented 4 years ago

This is something taken into account with the new IcingaDB backend. For the IDO schema, there won't be any efforts to fix this unfortunately.