Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.03k stars 579 forks source link

Some clients/service checks are unable to connect to the master after 2.9 upgrade. #6463

Closed cah-jeremykuhn closed 6 years ago

cah-jeremykuhn commented 6 years ago

Since upgrading from 2.8.4 to 2.9.0 two days ago, roughly 250 service checks out of the total 1110 across 223 hosts are throwing alerts that they are unable to connect to the client endpoint. And its not even entire hosts. I have several hosts where Disk and CPU checks will work, but the CPU usage check will say the host is not connected. The weird part is that some checks recover while others go down as host not connect. I've tried increasing and decreasing both the max_concurrent_checks and the ulimit but neither has made the failed checks go away. I have a screenshots as an example. The clients that connect to the master are a mix of Windows Server 2016 and Ubuntu 14.04 hosts, however none of the windows hosts are experiencing any issues.

Current max_concurrent_checks = 2048 Current ulimits: hard nofile 650000 soft nofile 100000

Expected Behavior

All service checks should be able to connect.

Current Behavior

20% of service checks are failing with "Remote instance is not connected to Here is a screenshot of a host with 3 services that are able to connect and one that is not:

image

And here is the code for those checks (instance memory is the one not working):

apply Service "nonprod_instance_disk" {
  display_name = "Instance Disk Usage"
  check_command = "disk"
  max_check_attempts = 3
  check_interval = 10m
  retry_interval = 5m
  command_endpoint = host.vars.client_endpoint
  assign where host.vars.client_endpoint && host.vars.os == "Linux" && host.vars.env == "nonprod"
  vars.disk_wfree = "10%"
  vars.local_disks["basic_partitions"] = {
  disk_partitions = "/"
  }
  vars.slack_notifications = "enabled"
}
apply Service "nonprod_instance_load" {
  display_name = "CPU Average Load"
  check_command = "load"
  max_check_attempts = 3
  check_interval = 10m
  retry_interval = 5m
  command_endpoint = host.vars.client_endpoint
  assign where host.vars.client_endpoint && host.vars.os == "Linux" && host.vars.env == "nonprod"
  vars.load_percpu = true
  vars.load_wload15 = "2"
  vars.slack_notifications = "enabled"
}
apply Service "nonprod_instance_memory" {
  display_name = "Instance Memory Usage"
  check_command = "memory"
  max_check_attempts = 3
  check_interval = 10m
  retry_interval = 5m
  command_endpoint = host.vars.client_endpoint
  assign where host.vars.client_endpoint && host.vars.os == "Linux" && host.vars.env == "nonprod"
  vars.mem_swap_critical = "100,50"
  vars.mem_swap_warning = "90,25"
  vars.slack_notifications = "enabled"
}
apply Service "instance_ntp" {
  display_name = "NTP process status"
  check_command = "procs"
  max_check_attempts = 3
  check_interval = 1h
  retry_interval = 5m
  command_endpoint = host.vars.client_endpoint
  assign where host.vars.client_endpoint && host.vars.os == "Linux"
  vars.procs_command = "ntpd"
  vars.procs_critical = "1:1"
  vars.slack_notifications = "enabled"
}

Steps to Reproduce (for bugs)

  1. Upgraded icinga master and clients to 2.9 from 2.8.4 simaltaneously
  2. Icinga restarted everywhere.
  3. Some clients checks never come back.

Your Environment

# icinga2 --version
icinga2 - The Icinga 2 network monitoring daemon (version: r2.9.0-1)

Copyright (c) 2012-2018 Icinga Development Team (https://www.icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
  Installation root: /usr
  Sysconf directory: /etc
  Run directory: /run
  Local state directory: /var
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

System information:
  Platform: Ubuntu
  Platform version: 14.04.5 LTS, Trusty Tahr
  Kernel: Linux
  Kernel version: 4.4.0-111-generic
  Architecture: x86_64

Build information:
  Compiler: GNU 4.8.2
  Build host: 057266a66422
Icebird2000 commented 6 years ago

Did you see something like this in your logfile?

[2018-07-18 15:18:37 +0200] critical/ApiListener: Client TLS handshake failed (from [21X.XX.XX.9]:39242)
Context:
(0) Handling new API client connection

If yes I think it has something to do with #6445 and I already mentioned it there.

cah-jeremykuhn commented 6 years ago

Yea I didnt read too far down into that issue (my mistake) but i am getting the exact same log message errors, so its most likely the same issue. I'll close this. Thanks for the quick response!