Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.03k stars 578 forks source link

Icinga 2 Cluster Problem endpoints are not connected while checking still working, version 2.10.4-1 #7075

Closed rozhao2 closed 5 years ago

rozhao2 commented 5 years ago

Describe the bug

I have two nodes in master zone, 4 zones, in each zone, there are 2 satellites.

Master nodes: cu-icinga-01, cu-icinga-02 2 Satellites in sample zone: njg-icinga-01, njg-icinga-02

Detail configuartion files attached below.

Made configuration change on cu-icinga-01, reload icinga2: "systemctl reload icinga2", there will be Cluster Problem:

Icinga 2 Cluster Problem: 2 endpoints are not connected. (njg-icinga-02, njg-icinga-01)

the endpoints are random.

On satellites, there are logs (debug level), when master node reload, it said API clients left, and then, it keep alerting "Not connecting to Endpoint 'cu-icinga-01' because we're already trying to connect to it."

[2019-04-02 23:27:12 -0700] warning/TlsStream: TLS stream was disconnected.
[2019-04-02 23:27:12 -0700] warning/JsonRpcConnection: API client disconnected for identity 'cu-icinga-01'
[2019-04-02 23:27:12 -0700] warning/ApiListener: Removing API client for endpoint 'cu-icinga-01'. 0 API clients left.
[2019-04-02 23:27:12 -0700] debug/EndpointDbObject: update is_connected=0 for endpoint 'cu-icinga-01'

[2019-04-02 23:27:12 -0700] warning/TlsStream: TLS stream was disconnected.
[2019-04-02 23:27:31 -0700] warning/JsonRpcConnection: API client disconnected for identity 'cu-icinga-02'
[2019-04-02 23:27:31 -0700] warning/ApiListener: Removing API client for endpoint 'cu-icinga-02'. 0 API clients left.
[2019-04-02 23:27:31 -0700] debug/EndpointDbObject: update is_connected=0 for endpoint 'cu-icinga-02'

[2019-04-02 23:27:23 -0700] debug/ApiListener: Not connecting to Endpoint 'cu-icinga-01' because we're already trying to connect to it.
[2019-04-02 23:27:33 -0700] debug/ApiListener: Not connecting to Endpoint 'cu-icinga-01' because we're already trying to connect to it.
[2019-04-02 23:27:43 -0700] debug/ApiListener: Not connecting to Endpoint 'cu-icinga-01' because we're already trying to connect to it.
[2019-04-02 23:27:43 -0700] debug/ApiListener: Not connecting to Endpoint 'cu-icinga-02' because we're already trying to connect to it.
[2019-04-02 23:27:53 -0700] debug/ApiListener: Not connecting to Endpoint 'cu-icinga-01' because we're already trying to connect to it.
[2019-04-02 23:27:53 -0700] debug/ApiListener: Not connecting to Endpoint 'cu-icinga-02' because we're already trying to connect to it.
[2019-04-02 23:28:03 -0700] debug/ApiListener: Not connecting to Endpoint 'cu-icinga-01' because we're already trying to connect to it.
[2019-04-02 23:28:03 -0700] debug/ApiListener: Not connecting to Endpoint 'cu-icinga-02' because we're already trying to connect to it.
[2019-04-02 23:28:13 -0700] debug/ApiListener: Not connecting to Endpoint 'cu-icinga-01' because we're already trying to connect to it.
[2019-04-02 23:28:13 -0700] debug/ApiListener: Not connecting to Endpoint 'cu-icinga-02' because we're already trying to connect to it.
[2019-04-02 23:28:23 -0700] debug/ApiListener: Not connecting to Endpoint 'cu-icinga-01' because we're already trying to connect to it.
[2019-04-02 23:28:23 -0700] debug/ApiListener: Not connecting to Endpoint 'cu-icinga-02' because we're already trying to connect to it.
[2019-04-02 23:28:33 -0700] debug/ApiListener: Not connecting to Endpoint 'cu-icinga-01' because we're already trying to connect to it.

To Reproduce

  1. systemctl reload icinga2
  2. It is easy to re-produce after 24 hours, because I retried to reload several times, it is OK

Expected behavior

It should not have this type alerts

Screenshots

image

Your Environment

Include as many relevant details about the environment you experienced the problem in

cu-icinga-01:/etc/icinga2/zones.d/master# icinga2 --version 
icinga2 - The Icinga 2 network monitoring daemon (version: r2.10.4-1)

Copyright (c) 2012-2019 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
icingaweb2-common-2.5.3-1.el7.icinga.noarch
icingaweb2-vendor-HTMLPurifier-2.5.3-1.el7.icinga.noarch
icingaweb2-vendor-dompdf-2.5.3-1.el7.icinga.noarch
icingaweb2-vendor-Parsedown-2.5.3-1.el7.icinga.noarch
icingaweb2-vendor-JShrink-2.5.3-1.el7.icinga.noarch

.... some apply rule not match messages....

[2019-04-03 00:08:35 -0700] warning/ApplyRule: Apply rule 'check-v2hlrlink-EE' (in /etc/icinga2/zones.d/global-templates/services-v2hlrlink.conf: 121:1-121:34) for type 'Service' does not match anywhere! [2019-04-03 00:08:35 -0700] warning/ApplyRule: Apply rule 'check-v2hlrlink-Luxembourg' (in /etc/icinga2/zones.d/global-templates/services-v2hlrlink.conf: 136:1-136:42) for type 'Service' does not match anywhere! [2019-04-03 00:08:35 -0700] warning/ApplyRule: Apply rule 'check-v2hlrlink-jasper_hp_itu2' (in /etc/icinga2/zones.d/global-templates/services-v2hlrlink.conf: 151:1-151:46) for type 'Service' does not match anywhere! [2019-04-03 00:08:35 -0700] warning/ApplyRule: Apply rule 'check-v2hlrlink-jasper_itu1' (in /etc/icinga2/zones.d/global-templates/services-v2hlrlink.conf: 166:1-166:43) for type 'Service' does not match anywhere! [2019-04-03 00:08:35 -0700] warning/ApplyRule: Apply rule 'check-v2hlrlink-Etisalat' (in /etc/icinga2/zones.d/global-templates/services-v2hlrlink.conf: 181:1-181:40) for type 'Service' does not match anywhere! [2019-04-03 00:08:35 -0700] warning/ApplyRule: Apply rule 'check-v2hlrlink-Mobily' (in /etc/icinga2/zones.d/global-templates/services-v2hlrlink.conf: 196:1-196:38) for type 'Service' does not match anywhere! [2019-04-03 00:08:35 -0700] warning/ApplyRule: Apply rule 'check-v2hlrlink-QAHPSP1' (in /etc/icinga2/zones.d/global-templates/services-v2hlrlink.conf: 211:1-211:39) for type 'Service' does not match anywhere! [2019-04-03 00:08:35 -0700] warning/ApplyRule: Apply rule 'check-v2hlrlink-Telkomsel' (in /etc/icinga2/zones.d/global-templates/services-v2hlrlink.conf: 226:1-226:41) for type 'Service' does not match anywhere! [2019-04-03 00:08:35 -0700] warning/ApplyRule: Apply rule 'check-v2hlrlink-jasper_itu1' (in /etc/icinga2/zones.d/global-templates/services-v2hlrlink.conf: 241:1-241:43) for type 'Service' does not match anywhere! [2019-04-03 00:08:35 -0700] warning/ApplyRule: Apply rule 'check-v2hlrlink-jasper_hp_itu2' (in /etc/icinga2/zones.d/global-templates/services-v2hlrlink.conf: 256:1-256:46) for type 'Service' does not match anywhere! [2019-04-03 00:08:35 -0700] warning/ApplyRule: Apply rule 'check-v2hlrlink-Telus' (in /etc/icinga2/zones.d/global-templates/services-v2hlrlink.conf: 271:1-271:37) for type 'Service' does not match anywhere! [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 13324 Services. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 2 LivestatusListeners. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 1 IcingaApplication. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 2566 Hosts. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 2 FileLoggers. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 6 NotificationCommands. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 52673 Notifications. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 1 NotificationComponent. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 52 HostGroups. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 1 ApiListener. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 78 Downtimes. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 2 Comments. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 1 CheckerComponent. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 7 Zones. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 1 ExternalCommandListener. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 10 Endpoints. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 3 ApiUsers. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 109 Users. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 1 IdoMysqlConnection. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 279 CheckCommands. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 99 UserGroups. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 29 ServiceGroups. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 6 TimePeriods. [2019-04-03 00:08:35 -0700] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars' [2019-04-03 00:08:35 -0700] information/cli: Finished validating the configuration file(s).


* If you run multiple Icinga 2 instances, the `zones.conf` file (or `icinga2 object list --type Endpoint` and `icinga2 object list --type Zone`) from all affected nodes.

### zones.conf on master nodes:

object Endpoint "cu-icinga-01" { host = "cu-icinga-01.i.jaspersystems.com" } object Endpoint "cu-icinga-02" { host = "cu-icinga-02.i.jaspersystems.com" } object Zone "master" { endpoints = [ "cu-icinga-01", "cu-icinga-02" ] } object Endpoint "nkg-icinga-01" { host = "nkg-icinga-01.i.jaspersystems.com" } object Endpoint "nkg-icinga-02" { host = "nkg-icinga-02.i.jaspersystems.com" } object Zone "DC16" { endpoints = [ "nkg-icinga-01", "nkg-icinga-02" ] parent = "master" } object Endpoint "can-icinga-01" { host = "can-icinga-01.i.jaspersystems.com" } object Endpoint "can-icinga-02" { host = "can-icinga-02.i.jaspersystems.com" } object Zone "DC17" { endpoints = [ "can-icinga-01", "can-icinga-02" ] parent = "master" } object Zone "global-templates" { global = true } object Zone "director-global" { global = true } object Endpoint "fuo-icinga-01" { host = "fuo-icinga-01.i.jaspersystems.com" } object Endpoint "fuo-icinga-02" { host = "fuo-icinga-02.i.jaspersystems.com" } object Zone "DC19" { endpoints = [ "fuo-icinga-01", "fuo-icinga-02" ] parent = "master" } object Endpoint "njg-icinga-01" { host = "njg-icinga-01.i.jaspersystems.com" } object Endpoint "njg-icinga-02" { host = "njg-icinga-02.i.jaspersystems.com" } object Zone "DC18" { endpoints = [ "njg-icinga-01", "njg-icinga-02" ] parent = "master" }


zones.conf on satellite nodes is same with master nodes', I copied it from master nodes:

object Endpoint "cu-icinga-01" { host = "cu-icinga-01.i.jaspersystems.com" } object Endpoint "cu-icinga-02" { host = "cu-icinga-02.i.jaspersystems.com" } object Zone "master" { endpoints = [ "cu-icinga-01", "cu-icinga-02" ] } object Endpoint "nkg-icinga-01" { host = "nkg-icinga-01.i.jaspersystems.com" } object Endpoint "nkg-icinga-02" { host = "nkg-icinga-02.i.jaspersystems.com" } object Zone "DC16" { endpoints = [ "nkg-icinga-01", "nkg-icinga-02" ] parent = "master" } object Endpoint "can-icinga-01" { host = "can-icinga-01.i.jaspersystems.com" } object Endpoint "can-icinga-02" { host = "can-icinga-02.i.jaspersystems.com" } object Zone "DC17" { endpoints = [ "can-icinga-01", "can-icinga-02" ] parent = "master" } object Zone "global-templates" { global = true } object Zone "director-global" { global = true } object Endpoint "fuo-icinga-01" { host = "fuo-icinga-01.i.jaspersystems.com" } object Endpoint "fuo-icinga-02" { host = "fuo-icinga-02.i.jaspersystems.com" } object Zone "DC19" { endpoints = [ "fuo-icinga-01", "fuo-icinga-02" ] parent = "master" } object Endpoint "njg-icinga-01" { host = "njg-icinga-01.i.jaspersystems.com" } object Endpoint "njg-icinga-02" { host = "njg-icinga-02.i.jaspersystems.com" } object Zone "DC18" { endpoints = [ "njg-icinga-01", "njg-icinga-02" ] parent = "master" }



## Additional context
I need to restart icinga2 on all not connected endpoints, it will recover.

The cluster was built with version 2.8.4 as a new cluster, and I met endpoint not connected issue after reloading master node with TLS handshake failed error message, then update to 2.10.4-1 as suggested from community, then the log shown as above.

Do I miss anything?

Thanks
rozhao2 commented 5 years ago

Today, just a few minutes ago, cu-icinga-01 reloaded, this time, it lost connection with nkg-icinga-01, error messages on cu-icinga-01:

[2019-04-03 18:25:40 -0700] information/ApiListener: Finished reconnecting to endpoint 'nkg-icinga-02' via host 'nkg-icinga-02.i.jaspersystems.com' and port '5665'
[2019-04-03 18:25:50 -0700] information/ApiListener: Reconnecting to endpoint 'nkg-icinga-02' via host 'nkg-icinga-02.i.jaspersystems.com' and port '5665'
[2019-04-03 18:26:00 -0700] critical/ApiListener: Client TLS handshake failed (to [10.35.60.238]:5665): Error: Timeout was reached (10) during TLS handshake.

        (0) Handling new API client connection

Context:
        (0) Handling new API client connection

[2019-04-03 18:26:00 -0700] information/ApiListener: Finished reconnecting to endpoint 'nkg-icinga-02' via host 'nkg-icinga-02.i.jaspersystems.com' and port '5665'
[2019-04-03 18:26:04 -0700] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
[2019-04-03 18:26:09 -0700] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 0, rate: 219.75/s (13185/min 68382/5min 141994/15min);
[2019-04-03 18:26:10 -0700] information/WorkQueue: #6 (ApiListener, SyncQueue) items: 0, rate:  0/s (0/min 0/5min 8/15min);
[2019-04-03 18:26:10 -0700] information/ApiListener: Reconnecting to endpoint 'nkg-icinga-02' via host 'nkg-icinga-02.i.jaspersystems.com' and port '5665'
[2019-04-03 18:26:10 -0700] information/WorkQueue: #7 (IdoMysqlConnection, ido-mysql) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2019-04-03 18:26:20 -0700] critical/ApiListener: Client TLS handshake failed (to [10.35.60.238]:5665): Error: Timeout was reached (10) during TLS handshake.

        (0) Handling new API client connection

Context:
        (0) Handling new API client connection

[2019-04-03 18:26:20 -0700] information/ApiListener: Finished reconnecting to endpoint 'nkg-icinga-02' via host 'nkg-icinga-02.i.jaspersystems.com' and port '5665'

On client node, debug log stuck totally, until I restart icinag2 on it.

Cannot see logs between 2019-04-03 18:23:12 -0700 and 2019-04-03 18:42:09 -0700.

2019-04-03 18:23:12 -0700] notice/JsonRpcConnection: Received 'event::SetNextCheck' message from 'nkg-icinga-01'
[2019-04-03 18:23:12 -0700] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2019-04-03 18:23:12 -0700] notice/Process: PID 28305 ('/usr/lib64/nagios/plugins/check_nrpe' '-H' '10.35.56.38' '-c' 'check_oom' '-t' '60') terminated with exit code 0[2019-04-03 18:42:09 -0700] information/FileLogger: 'debug-file' started.
[2019-04-03 18:42:09 -0700] information/FileLogger: 'main-log' started.
[2019-04-03 18:42:09 -0700] notice/WorkQueue: Spawning WorkQueue threads for 'ApiListener, RelayQueue'
[2019-04-03 18:42:09 -0700] notice/ApiListener: Relaying 'config::UpdateObject' message
[2019-04-03 18:42:09 -0700] notice/ApiListener: Relaying 'config::UpdateObject' message
[2019-04-03 18:42:09 -0700] notice/Downtime: Checkable 'pd9-nkgdb15!check-ucs_hardware_alert' already in a NOT-OK state. Triggering downtime now.
[2019-04-03 18:42:09 -0700] notice/ApiListener: Relaying 'config::UpdateObject' message

Actually, I am really confused.. looks like this logs are different from last time, any suggestion is welcome.

thank you very much.

dnsmichi commented 5 years ago

One note for the configuration inside zones.conf - since you've copied it, each side (master and satellites) will now connect to each other.

This is done by the host attribute in the endpoint objects, and should be configured just one way. Either the satellite can reach the parent zone, then configure the satellite's zones.conf to have the host attribute for the master endpoints. Or vice versa.

Looks like the master is not able to connect to the satellites from your logs, and always runs into TLS handshake timeouts. This essentially blocks and eats quite a few CPU resources.

Instead, the satellites establish the connections already, and the check results run through. Your master(s) are just overloaded, fix the connection direction issue first before looking into anything else.

Cheers, Michael

rozhao2 commented 5 years ago

Thank you Michael,

Let me confirm if I have correct understanding.

Correct cluster behavior should be master connecting satellites or satellites connecting master, they should not connect each other at the same time.

To disable bi-connecting, I should remove host attribute of satellites' endpoints from master nodes, then just keep satellites connecting to master nodes.

Thanks.

dnsmichi commented 5 years ago

Here‘s the docs with some more derails: https://icinga.com/docs/icinga2/latest/doc/06-distributed-monitoring/#endpoint-connection-direction

rozhao2 commented 5 years ago

Thank you Michael, I used v2.5.4 long time (about two years), didn't notice this detail before.

Have corrected configuration, let me watch for one week, let master nodes connect satellites, satellites do not connect master nodes.

Thanks.

rozhao2 commented 5 years ago

Hi Michael @dnsmichi , I have disabled connection from satellite to master, the issue is still here. I think I can express more clearly.

Describe the Issue

I reload master node: cu-icinga-01, by "systemctl reload icinga2" Then, on satellite njg-icinga-01, I did see cu-icinga-01 left as API client:

[2019-04-09 19:48:56 -0700] notice/ApiListener: Sending message 'event::CheckResult' to 'njg-icinga-02'
[2019-04-09 19:48:56 -0700] notice/ApiListener: Sending message 'event::CheckResult' to 'cu-icinga-01'
[2019-04-09 19:48:56 -0700] warning/TlsStream: TLS stream was disconnected.
[2019-04-09 19:48:56 -0700] warning/JsonRpcConnection: API client disconnected for identity 'cu-icinga-01'
[2019-04-09 19:48:56 -0700] warning/ApiListener: Removing API client for endpoint 'cu-icinga-01'. 0 API clients left.
[2019-04-09 19:48:56 -0700] debug/EndpointDbObject: update is_connected=0 for endpoint 'cu-icinga-01'

And I can see mater node is attempting to collect njg-icinga-01 by netstat command, 10.35.60.236 is the lost master node, it is trying to connect satellite's 5665 port, then disconnect, then connect again, keep doing this.

 njg-icinga-01:/var/log/icinga2# netstat -anp | grep 5665 | grep EST                                                                    
tcp      289      0 10.60.89.57:5665        10.35.60.236:50338      ESTABLISHED 3747/icinga2        
tcp        0      0 10.60.89.57:5665        10.37.60.236:36464      ESTABLISHED 3747/icinga2        
tcp        0      0 10.60.89.57:5665        10.60.89.72:59548       ESTABLISHED 3747/icinga2                           
njg-icinga-01:/var/log/icinga2# netstat -anp | grep 5665 | grep EST
tcp        0      0 10.60.89.57:5665        10.37.60.236:36464      ESTABLISHED 3747/icinga2        
tcp        0      0 10.60.89.57:5665        10.60.89.72:59548       ESTABLISHED 3747/icinga2                          
njg-icinga-01:/var/log/icinga2# netstat -anp | grep 5665 | grep EST
tcp      289      0 10.60.89.57:5665        10.35.60.236:50342      ESTABLISHED 3747/icinga2        
tcp        0      0 10.60.89.57:5665        10.37.60.236:36464      ESTABLISHED 3747/icinga2        
tcp        0      0 10.60.89.57:5665        10.60.89.72:59548       ESTABLISHED 3747/icinga2    

On master node, it keeps printing trying to reconnect satellite:

[2019-04-09 19:49:08 -0700] debug/ApiListener: Not connecting to Endpoint 'njg-icinga-01' because we're already trying to connect to it.
[2019-04-09 19:49:08 -0700] information/ApiListener: Finished reconnecting to endpoint 'njg-icinga-01' via host 'njg-icinga-01.i.jaspersystems.com' and port '5665'
[2019-04-09 19:49:18 -0700] information/ApiListener: Reconnecting to endpoint 'njg-icinga-01' via host 'njg-icinga-01.i.jaspersystems.com' and port '5665'
[2019-04-09 19:49:28 -0700] debug/ApiListener: Not connecting to Endpoint 'njg-icinga-01' because we're already trying to connect to it.
[2019-04-09 19:49:28 -0700] information/ApiListener: Finished reconnecting to endpoint 'njg-icinga-01' via host 'njg-icinga-01.i.jaspersystems.com' and port '5665'
[2019-04-09 19:49:38 -0700] information/ApiListener: Reconnecting to endpoint 'njg-icinga-01' via host 'njg-icinga-01.i.jaspersystems.com' and port '5665'
[2019-04-09 19:49:48 -0700] debug/ApiListener: Not connecting to Endpoint 'njg-icinga-01' because we're already trying to connect to it.
[2019-04-09 19:49:48 -0700] information/ApiListener: Finished reconnecting to endpoint 'njg-icinga-01' via host 'njg-icinga-01.i.jaspersystems.com' and port '5665'
[2019-04-09 19:49:58 -0700] information/ApiListener: Reconnecting to endpoint 'njg-icinga-01' via host 'njg-icinga-01.i.jaspersystems.com' and port '5665'

From tcpdump, we can see looks like master node disconnect connection by itself because satellite didn't send certificate information after Hello: image

Also attached packet file : njg-icinga-01.pcap.zip

We can see after TCP connection connected, master node sent "Client Hello" and satellite feedback ack, but no following packets of certificate, so, after 10 seconds, master node disconnected.

Is there any possible problem in my environment?

Thanks.

rozhao2 commented 5 years ago

After restart icinga2 on satellite, captured correct behavior, satellite should send Server Hello to master node after ack packet: image

Also attached cap file, from 163 packet, it became good: njg-icinga-01.good.pcap.zip

Looks like the issue is on satellite node, it cannot feedback caused timeout, master node disconnect connection.

checking more on satellite, if there is any issue on system level.

dnsmichi commented 5 years ago

Likely this also is related to the cluster and API overloads we've seen with 2.9 and 2.10, and are aiming to solve with 2.11. You can test the snapshot packages, if the problem persists - all details are collected in #7041.

rozhao2 commented 5 years ago

Hm.. not sure if it can help or not, I found the satellites who open too many filehandlers will be easy to fail connect. especially, CLOSE_WAIT TCP.

I have setup cronjob to collect "lsof | grep icinga2 | grep CLOSE | wc -l " for every minutes, at present, all my satellites opened file handler is less than 11,000, I tried reload master node several times, but the issue not happen.

Will try tomorrow, then check if it is related with opened file descriptor, if it is, I think there might be some defect on close connection.

icinga2   31848 31935  icinga   55u     IPv4          628258809       0t0        TCP njg-icinga-02:5665->10.60.89.10:44612 (CLOSE_WAIT)
icinga2   31848 31935  icinga   56u     IPv4          628284049       0t0        TCP njg-icinga-02:5665->10.60.89.10:50072 (CLOSE_WAIT)
icinga2   31848 31935  icinga   57u     IPv4          628280004       0t0        TCP njg-icinga-02:5665->10.60.89.10:35834 (CLOSE_WAIT)
icinga2   31848 31935  icinga   58u     IPv4          628283858       0t0        TCP njg-icinga-02:5665->10.60.89.10:45992 (CLOSE_WAIT)
icinga2   31848 31935  icinga   60u     IPv4          628299751       0t0        TCP njg-icinga-02:5665->10.60.89.10:57492 (CLOSE_WAIT)
icinga2   31848 31935  icinga   61u     IPv4          628300944       0t0        TCP njg-icinga-02:5665->10.60.89.10:59732 (CLOSE_WAIT)
icinga2   31848 31936  icinga   13u     IPv4          628265592       0t0        TCP njg-icinga-02:5665->10.60.89.10:56144 (CLOSE_WAIT)
icinga2   31848 31936  icinga   16u     IPv4          629206498       0t0        TCP njg-icinga-02:5665->cu-icinga-01:59512 (CLOSE_WAIT)
icinga2   31848 31936  icinga   17u     IPv4          628276666       0t0        TCP njg-icinga-02:5665->10.60.89.10:58512 (CLOSE_WAIT)
icinga2   31848 31936  icinga   43u     IPv4          626460544       0t0        TCP njg-icinga-02:5665->cu-icinga-02:57644 (CLOSE_WAIT)
icinga2   31848 31936  icinga   47u     IPv4          628261643       0t0        TCP njg-icinga-02:5665->10.60.89.10:50418 (CLOSE_WAIT)
icinga2   31848 31936  icinga   48u     IPv4          628265415       0t0        TCP njg-icinga-02:5665->10.60.89.10:42072 (CLOSE_WAIT)
icinga2   31848 31936  icinga   49u     IPv4          628260879       0t0        TCP njg-icinga-02:5665->10.60.89.10:48544 (CLOSE_WAIT)
icinga2   31848 31936  icinga   50u     IPv4          628272743       0t0        TCP njg-icinga-02:5665->10.60.89.10:33506 (CLOSE_WAIT)
icinga2   31848 31936  icinga   51u     IPv4          628269858       0t0        TCP njg-icinga-02:5665->10.60.89.10:36734 (CLOSE_WAIT)
icinga2   31848 31936  icinga   52u     IPv4          628283441       0t0        TCP njg-icinga-02:5665->10.60.89.10:39418 (CLOSE_WAIT)
icinga2   31848 31936  icinga   53u     IPv4          628283745       0t0        TCP njg-icinga-02:5665->10.60.89.10:45306 (CLOSE_WAIT)
icinga2   31848 31936  icinga   54u     IPv4          628279878       0t0        TCP njg-icinga-02:5665->10.60.89.10:60214 (CLOSE_WAIT)
icinga2   31848 31936  icinga   55u     IPv4          628258809       0t0        TCP njg-icinga-02:5665->10.60.89.10:44612 (CLOSE_WAIT)
icinga2   31848 31936  icinga   56u     IPv4          628284049       0t0        TCP njg-icinga-02:5665->10.60.89.10:50072 (CLOSE_WAIT)
icinga2   31848 31936  icinga   57u     IPv4          628280004       0t0        TCP njg-icinga-02:5665->10.60.89.10:35834 (CLOSE_WAIT)
icinga2   31848 31936  icinga   58u     IPv4          628283858       0t0        TCP njg-icinga-02:5665->10.60.89.10:45992 (CLOSE_WAIT)

Thanks.

rozhao2 commented 5 years ago

Today ,the issue happen again, the file descriptor number became more than 20000+

I reloaded master node at 16:31:01, no cluster alert at that time, but the opened fd became more and more, until 18:01, master node reload again, this satellite not connected...

For other satellites, the fd is normal (means similar with this one before 16:31).

Acutllay, I set file-max = 65535, but seems icinga2 cannot handle tls connection anymore.

Downgraded half satellites to 2.8.1 to watch for one week.


Time                | total fd          | CLOSE_WAIT fd | icinga2 fd
-----------------------------------------------------------------------------------------------------
Wed, 10 Apr 2019 16:18:01 -0700 | 7992          | 0       | 4006
Wed, 10 Apr 2019 16:19:01 -0700 | 7768          | 0       | 3852
Wed, 10 Apr 2019 16:20:02 -0700 | 7577          | 0       | 3551
Wed, 10 Apr 2019 16:21:01 -0700 | 7389          | 0       | 3656
Wed, 10 Apr 2019 16:22:01 -0700 | 7680          | 0       | 3747
Wed, 10 Apr 2019 16:23:01 -0700 | 7645          | 0       | 3815
Wed, 10 Apr 2019 16:24:01 -0700 | 7590          | 0       | 3743
Wed, 10 Apr 2019 16:25:01 -0700 | 7375          | 0       | 3578
Wed, 10 Apr 2019 16:26:01 -0700 | 7731          | 0       | 3656
Wed, 10 Apr 2019 16:27:01 -0700 | 7685          | 0       | 3756
Wed, 10 Apr 2019 16:28:01 -0700 | 8329          | 0       | 3971
Wed, 10 Apr 2019 16:29:01 -0700 | 8728          | 0       | 4410
Wed, 10 Apr 2019 16:30:01 -0700 | 7800          | 0       | 3751
Wed, 10 Apr 2019 16:31:01 -0700 | 8051          | 76          | 4003
Wed, 10 Apr 2019 16:32:01 -0700 | 8167          | 76          | 3943
Wed, 10 Apr 2019 16:33:01 -0700 | 9161          | 168         | 5049
Wed, 10 Apr 2019 16:34:01 -0700 | 9778          | 276         | 5523
Wed, 10 Apr 2019 16:35:01 -0700 | 10366         | 572         | 5927
Wed, 10 Apr 2019 16:36:01 -0700 | 10615         | 702         | 6477
Wed, 10 Apr 2019 16:37:01 -0700 | 10649         | 756         | 6415
Wed, 10 Apr 2019 16:38:01 -0700 | 10948         | 880         | 7121
Wed, 10 Apr 2019 16:39:01 -0700 | 11298         | 1026        | 7500
Wed, 10 Apr 2019 16:40:01 -0700 | 10921         | 1026        | 6912
Wed, 10 Apr 2019 16:41:01 -0700 | 11495         | 1121        | 7308
Wed, 10 Apr 2019 16:42:01 -0700 | 11525         | 1220        | 7493
Wed, 10 Apr 2019 16:43:01 -0700 | 12352         | 1220        | 8243
Wed, 10 Apr 2019 16:44:01 -0700 | 12195         | 1323        | 8040
Wed, 10 Apr 2019 16:45:02 -0700 | 11847         | 1386        | 7905
Wed, 10 Apr 2019 16:46:01 -0700 | 12064         | 1386        | 8000
Wed, 10 Apr 2019 16:47:01 -0700 | 12551         | 1495        | 8249
Wed, 10 Apr 2019 16:48:01 -0700 | 13068         | 1560        | 9050
Wed, 10 Apr 2019 16:49:01 -0700 | 12696         | 1560        | 8760
Wed, 10 Apr 2019 16:50:01 -0700 | 12508         | 1675        | 8368
Wed, 10 Apr 2019 16:51:01 -0700 | 15245         | 2054        | 11144
Wed, 10 Apr 2019 16:52:01 -0700 | 15326         | 2080        | 11042
Wed, 10 Apr 2019 16:53:01 -0700 | 17963         | 3312        | 14771
Wed, 10 Apr 2019 16:54:01 -0700 | 19509         | 4370        | 16145
Wed, 10 Apr 2019 16:55:01 -0700 | 19254         | 4370        | 14937
Wed, 10 Apr 2019 16:56:01 -0700 | 20914         | 5320        | 16187
Wed, 10 Apr 2019 16:57:01 -0700 | 20868         | 6175        | 16878
Wed, 10 Apr 2019 16:58:01 -0700 | 21628         | 6270        | 18496
Wed, 10 Apr 2019 16:59:01 -0700 | 21877         | 7220        | 18670
Wed, 10 Apr 2019 17:00:01 -0700 | 21803         | 7980        | 18326
Wed, 10 Apr 2019 17:01:01 -0700 | 23066         | 8170        | 19053
Wed, 10 Apr 2019 17:02:02 -0700 | 23879         | 9025        | 19499
...      
Wed, 10 Apr 2019 17:59:01 -0700 | 25831         | 11970       | 22198
Wed, 10 Apr 2019 18:00:01 -0700 | 25628         | 11970       | 21801
Wed, 10 Apr 2019 18:01:01 -0700 | 25967         | 11970       | 21991
Wed, 10 Apr 2019 18:02:01 -0700 | 26057         | 12160       | 21984
Wed, 10 Apr 2019 18:03:01 -0700 | 27289         | 12445       | 23701
Wed, 10 Apr 2019 18:04:01 -0700 | 26427         | 12730       | 23038
Wed, 10 Apr 2019 18:05:01 -0700 | 26655         | 13015       | 22800
Wed, 10 Apr 2019 18:06:01 -0700 | 28111         | 13300       | 23757
Wed, 10 Apr 2019 18:07:01 -0700 | 26177         | 13585       | 22275
Wed, 10 Apr 2019 18:08:01 -0700 | 27950         | 13870       | 25156
Wed, 10 Apr 2019 18:09:01 -0700 | 27802         | 14155       | 24151
Wed, 10 Apr 2019 18:10:01 -0700 | 28153         | 14440       | 24311
Wed, 10 Apr 2019 18:11:01 -0700 | 29163         | 14725       | 24977
Wed, 10 Apr 2019 18:12:01 -0700 | 30453         | 15010       | 26623
Wed, 10 Apr 2019 18:13:01 -0700 | 30063         | 15295       | 26339
Wed, 10 Apr 2019 18:14:01 -0700 | 29489         | 15580       | 25542
Wed, 10 Apr 2019 18:15:01 -0700 | 29926         | 15865       | 25880
Wed, 10 Apr 2019 18:16:01 -0700 | 30167         | 16150       | 26419
Wed, 10 Apr 2019 18:17:01 -0700 | 31427         | 16435       | 26824
Wed, 10 Apr 2019 18:18:01 -0700 | 31456         | 16720       | 28099
Wed, 10 Apr 2019 18:19:01 -0700 | 30882         | 17005       | 26911
Wed, 10 Apr 2019 18:20:01 -0700 | 31005         | 17290       | 27028
Wed, 10 Apr 2019 18:21:01 -0700 | 31831         | 17575       | 28014
Wed, 10 Apr 2019 18:22:01 -0700 | 31896         | 17860       | 28174
Wed, 10 Apr 2019 18:23:01 -0700 | 32776         | 18145       | 29408
Wed, 10 Apr 2019 18:24:01 -0700 | 32473         | 18430       | 28737
Wed, 10 Apr 2019 18:25:01 -0700 | 33513         | 18715       | 28505
Wed, 10 Apr 2019 18:26:01 -0700 | 33519         | 19000       | 29393
Wed, 10 Apr 2019 18:27:01 -0700 | 34358         | 19285       | 29559
Wed, 10 Apr 2019 18:28:01 -0700 | 35282         | 19570       | 30113
Wed, 10 Apr 2019 18:29:01 -0700 | 34132         | 19855       | 30028
Wed, 10 Apr 2019 18:30:01 -0700 | 34365         | 20140       | 29913
Wed, 10 Apr 2019 18:31:01 -0700 | 35145         | 20425       | 30809
Wed, 10 Apr 2019 18:32:01 -0700 | 35542         | 20710       | 30887
dnsmichi commented 5 years ago

So the master connects to the satellites, and leaves the connection open.

TCP njg-icinga-02:5665->10.60.89.10:44612 (CLOSE_WAIT)

Just curious, which FQDN points to 10.60.89.10? (dig -x 10.60.89.10).

Seems that sometimes this is the first master too.

TCP njg-icinga-02:5665->cu-icinga-01:59512 (CLOSE_WAIT)

Since you're using FQDNs for the host attribute as well, it may be worth a shot to eliminate DNS resolution problems and use the direct IPv4 addresses to connect to the satellites.

If you re-iterate this on the master, how many stale sockets are opened over there, this likely is not only the case for the satellites, but the masters too.

And last but not least, if you can, please test the snapshot packages (master and satellite in this case). The parts with TLS connection handling have been rewritten from scratch.

rozhao2 commented 5 years ago

I checked DNS too, it is OK, 10.60.89.10 is a security scanning server in our internal network, it will detect all listening ports on all hosts, so, it was trying to keep connecting and disconnecting 5665 port, and that server has no DNS records, so we can only see its IP.

This indicates it will have some problem on TCP connection handling, and I do think master node should have same issue too, but I didn't collect fd on master node, I can do this next step.

Hm.. because this is our production monitoring system, last time, I used big effort to convince our team to get approval to upgrade from 2.5.4 to 2.8.4 to solve segment fault crash issue, however, meet this issue, I don't think I can get approval to use snapshot version :(

Anyway, thank you very much, I will watch for one week that if 2.8.1 version has this issue or not, if not, I will downgrade to 2.8.1, no method...

dnsmichi commented 5 years ago

Uh.

I checked DNS too, it is OK, 10.60.89.10 is a security scanning server in our internal network, it will detect all listening ports on all hosts, so, it was trying to keep connecting and disconnecting 5665 port, and that server has no DNS records, so we can only see its IP.

And we have a winner, this is likely the same issue as with #6559 and #6562 where an external tool scans the API via TCP and HTTP, and somehow connections are not closed properly. This leaves stale connections all around, nothing related to Icinga cluster connections on their own.

This is something which was hidden in 2.8.4 with corked connections, more visible with 2.9 and 2.10. A fix is in the making, that's why I've asked whether to test the snapshot packages. This really would help to know whether everything is resolved prior to any release :)

rozhao2 commented 5 years ago

Thank you Michael for confirm, yes.. after investigation, I don't think it is cluster connections issue.

It is stale connections caused cluster connection issue.

I think we can test this in lab environment next week, I will do it like this:

  1. setup a test icinga2 node, 2.10.4 version
  2. setup a test icinga2 node, snapshot package

Use an automation script, just connecting 5665 port, disconnecting, repeating fast, see if it possible that only 2.10.4 will have this phenomenon (too many CLOSE_WAIT fd).

By the way, until now, I see one of my 2.10.4 satellite has same issue, all 2.8.1 satellites OK, I will do test again, if only this node will not connect after master node reload, this needs time to verify :(

Thanks.

rozhao2 commented 5 years ago

Update: two weeks passed after downgraded to 2.8.1, everything is OK:

  1. no cluster alerts anymore
  2. when I restart satellite nodes, there are no alerts of "Terminated" anymore, I think it is good enough.

    For above two issues, I have seen there are issues tracking them, so, I want to close this issue to reduce duplicated topics.

Thanks Michael( @dnsmichi ) for talking about this issue, I suggest icinga2 should provide some LTS version, for those versions, no new features, just bug/security fix, then users like me will feel safer :)

If there is extra effort to provide LTS, I would like to contribute if what I can see.

Thanks.

dnsmichi commented 5 years ago

LTS is on the roadmap, but nothing which directly affects Icinga 2 as core at the moment. 2.8.x won't hit that anymore - if it runs for you, good, but you may encounter different bugs with it (notification handling, etc.) which will always require updated versions. 2.11 is a good starting point where many of the mentioned problems are subject being fixed. That is why I've asked you to test the snapshot packages to be sure your ongoing problems are fixed :)

Cheers, Michael