Some clients/service checks are unable to connect to the master after 2.9 upgrade.

cah-jeremykuhn commented 6 years ago

Since upgrading from 2.8.4 to 2.9.0 two days ago, roughly 250 service checks out of the total 1110 across 223 hosts are throwing alerts that they are unable to connect to the client endpoint. And its not even entire hosts. I have several hosts where Disk and CPU checks will work, but the CPU usage check will say the host is not connected. The weird part is that some checks recover while others go down as host not connect. I've tried increasing and decreasing both the max_concurrent_checks and the ulimit but neither has made the failed checks go away. I have a screenshots as an example. The clients that connect to the master are a mix of Windows Server 2016 and Ubuntu 14.04 hosts, however none of the windows hosts are experiencing any issues.

Current max_concurrent_checks = 2048 Current ulimits: hard nofile 650000 soft nofile 100000

Expected Behavior

All service checks should be able to connect.

Current Behavior

20% of service checks are failing with "Remote instance is not connected to Here is a screenshot of a host with 3 services that are able to connect and one that is not:

And here is the code for those checks (instance memory is the one not working):

apply Service "nonprod_instance_disk" {
  display_name = "Instance Disk Usage"
  check_command = "disk"
  max_check_attempts = 3
  check_interval = 10m
  retry_interval = 5m
  command_endpoint = host.vars.client_endpoint
  assign where host.vars.client_endpoint && host.vars.os == "Linux" && host.vars.env == "nonprod"
  vars.disk_wfree = "10%"
  vars.local_disks["basic_partitions"] = {
  disk_partitions = "/"
  }
  vars.slack_notifications = "enabled"
}
apply Service "nonprod_instance_load" {
  display_name = "CPU Average Load"
  check_command = "load"
  max_check_attempts = 3
  check_interval = 10m
  retry_interval = 5m
  command_endpoint = host.vars.client_endpoint
  assign where host.vars.client_endpoint && host.vars.os == "Linux" && host.vars.env == "nonprod"
  vars.load_percpu = true
  vars.load_wload15 = "2"
  vars.slack_notifications = "enabled"
}
apply Service "nonprod_instance_memory" {
  display_name = "Instance Memory Usage"
  check_command = "memory"
  max_check_attempts = 3
  check_interval = 10m
  retry_interval = 5m
  command_endpoint = host.vars.client_endpoint
  assign where host.vars.client_endpoint && host.vars.os == "Linux" && host.vars.env == "nonprod"
  vars.mem_swap_critical = "100,50"
  vars.mem_swap_warning = "90,25"
  vars.slack_notifications = "enabled"
}
apply Service "instance_ntp" {
  display_name = "NTP process status"
  check_command = "procs"
  max_check_attempts = 3
  check_interval = 1h
  retry_interval = 5m
  command_endpoint = host.vars.client_endpoint
  assign where host.vars.client_endpoint && host.vars.os == "Linux"
  vars.procs_command = "ntpd"
  vars.procs_critical = "1:1"
  vars.slack_notifications = "enabled"
}

Steps to Reproduce (for bugs)

Upgraded icinga master and clients to 2.9 from 2.8.4 simaltaneously
Icinga restarted everywhere.
Some clients checks never come back.

Your Environment

# icinga2 --version
icinga2 - The Icinga 2 network monitoring daemon (version: r2.9.0-1)

Copyright (c) 2012-2018 Icinga Development Team (https://www.icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
  Installation root: /usr
  Sysconf directory: /etc
  Run directory: /run
  Local state directory: /var
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

System information:
  Platform: Ubuntu
  Platform version: 14.04.5 LTS, Trusty Tahr
  Kernel: Linux
  Kernel version: 4.4.0-111-generic
  Architecture: x86_64

Build information:
  Compiler: GNU 4.8.2
  Build host: 057266a66422

Enabled features (icinga2 feature list):

# icinga2 feature list
Disabled features: compatlog debuglog elasticsearch gelf graphite ido-mysql influxdb livestatus opentsdb perfdata statusdata syslog
Enabled features: api checker command mainlog notification

Icinga Web 2 version and modules (System - About): icingaweb 2.4.2 modules: setup 2.4.2 monitor 2.4.2

Config validation (icinga2 daemon -C):

# icinga2 daemon -C
[2018-07-19 19:45:33 +0000] information/cli: Icinga application loader (version: r2.9.0-1)
[2018-07-19 19:45:33 +0000] information/cli: Loading configuration file(s).
[2018-07-19 19:45:35 +0000] information/ConfigItem: Committing config item(s).
[2018-07-19 19:45:35 +0000] information/ApiListener: My API identity: <master_dns>
[2018-07-19 19:45:40 +0000] warning/ApplyRule: Apply rule 'slack-notifications-notification-services-alfred' (in /etc/icinga2/objects.d/applynotification.conf: 15:1-15:80) for type 'Notification' does not match anywhere!
[2018-07-19 19:45:40 +0000] warning/ApplyRule: Apply rule 'windows_service_ntds' (in /etc/icinga2/objects.d/applyservice.conf: 254:1-254:36) for type 'Service' does not match anywhere!
[2018-07-19 19:45:40 +0000] information/ConfigItem: Instantiated 1117 Services.
[2018-07-19 19:45:40 +0000] information/ConfigItem: Instantiated 3 ServiceGroups.
[2018-07-19 19:45:40 +0000] information/ConfigItem: Instantiated 1531 ScheduledDowntimes.
[2018-07-19 19:45:40 +0000] information/ConfigItem: Instantiated 8 HostGroups.
[2018-07-19 19:45:40 +0000] information/ConfigItem: Instantiated 1 FileLogger.
[2018-07-19 19:45:40 +0000] information/ConfigItem: Instantiated 1 NotificationComponent.
[2018-07-19 19:45:40 +0000] information/ConfigItem: Instantiated 3 NotificationCommands.
[2018-07-19 19:45:40 +0000] information/ConfigItem: Instantiated 2233 Notifications.
[2018-07-19 19:45:40 +0000] information/ConfigItem: Instantiated 1 IcingaApplication.
[2018-07-19 19:45:40 +0000] information/ConfigItem: Instantiated 223 Hosts.
[2018-07-19 19:45:40 +0000] information/ConfigItem: Instantiated 1 ApiListener.
[2018-07-19 19:45:40 +0000] information/ConfigItem: Instantiated 555 Downtimes.
[2018-07-19 19:45:40 +0000] information/ConfigItem: Instantiated 9 Comments.
[2018-07-19 19:45:40 +0000] information/ConfigItem: Instantiated 1 CheckerComponent.
[2018-07-19 19:45:40 +0000] information/ConfigItem: Instantiated 187 Zones.
[2018-07-19 19:45:40 +0000] information/ConfigItem: Instantiated 1 ExternalCommandListener.
[2018-07-19 19:45:40 +0000] information/ConfigItem: Instantiated 187 Endpoints.
[2018-07-19 19:45:40 +0000] information/ConfigItem: Instantiated 1 ApiUser.
[2018-07-19 19:45:40 +0000] information/ConfigItem: Instantiated 2 UserGroups.
[2018-07-19 19:45:40 +0000] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2018-07-19 19:45:40 +0000] information/ConfigItem: Instantiated 81 CheckCommands.
[2018-07-19 19:45:40 +0000] information/ConfigItem: Instantiated 3 TimePeriods.
[2018-07-19 19:45:40 +0000] information/ConfigItem: Instantiated 2 Users.
[2018-07-19 19:45:40 +0000] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2018-07-19 19:45:40 +0000] information/cli: Finished validating the configuration file(s).

If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes.

Icebird2000 commented 6 years ago

Did you see something like this in your logfile?

[2018-07-18 15:18:37 +0200] critical/ApiListener: Client TLS handshake failed (from [21X.XX.XX.9]:39242)
Context:
(0) Handling new API client connection

If yes I think it has something to do with #6445 and I already mentioned it there.

cah-jeremykuhn commented 6 years ago

Yea I didnt read too far down into that issue (my mistake) but i am getting the exact same log message errors, so its most likely the same issue. I'll close this. Thanks for the quick response!

Icinga / icinga2