Closed rozhao2 closed 5 years ago
Today, just a few minutes ago, cu-icinga-01 reloaded, this time, it lost connection with nkg-icinga-01, error messages on cu-icinga-01:
[2019-04-03 18:25:40 -0700] information/ApiListener: Finished reconnecting to endpoint 'nkg-icinga-02' via host 'nkg-icinga-02.i.jaspersystems.com' and port '5665'
[2019-04-03 18:25:50 -0700] information/ApiListener: Reconnecting to endpoint 'nkg-icinga-02' via host 'nkg-icinga-02.i.jaspersystems.com' and port '5665'
[2019-04-03 18:26:00 -0700] critical/ApiListener: Client TLS handshake failed (to [10.35.60.238]:5665): Error: Timeout was reached (10) during TLS handshake.
(0) Handling new API client connection
Context:
(0) Handling new API client connection
[2019-04-03 18:26:00 -0700] information/ApiListener: Finished reconnecting to endpoint 'nkg-icinga-02' via host 'nkg-icinga-02.i.jaspersystems.com' and port '5665'
[2019-04-03 18:26:04 -0700] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
[2019-04-03 18:26:09 -0700] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 0, rate: 219.75/s (13185/min 68382/5min 141994/15min);
[2019-04-03 18:26:10 -0700] information/WorkQueue: #6 (ApiListener, SyncQueue) items: 0, rate: 0/s (0/min 0/5min 8/15min);
[2019-04-03 18:26:10 -0700] information/ApiListener: Reconnecting to endpoint 'nkg-icinga-02' via host 'nkg-icinga-02.i.jaspersystems.com' and port '5665'
[2019-04-03 18:26:10 -0700] information/WorkQueue: #7 (IdoMysqlConnection, ido-mysql) items: 0, rate: 0/s (0/min 0/5min 0/15min);
[2019-04-03 18:26:20 -0700] critical/ApiListener: Client TLS handshake failed (to [10.35.60.238]:5665): Error: Timeout was reached (10) during TLS handshake.
(0) Handling new API client connection
Context:
(0) Handling new API client connection
[2019-04-03 18:26:20 -0700] information/ApiListener: Finished reconnecting to endpoint 'nkg-icinga-02' via host 'nkg-icinga-02.i.jaspersystems.com' and port '5665'
On client node, debug log stuck totally, until I restart icinag2 on it.
Cannot see logs between 2019-04-03 18:23:12 -0700 and 2019-04-03 18:42:09 -0700.
2019-04-03 18:23:12 -0700] notice/JsonRpcConnection: Received 'event::SetNextCheck' message from 'nkg-icinga-01'
[2019-04-03 18:23:12 -0700] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2019-04-03 18:23:12 -0700] notice/Process: PID 28305 ('/usr/lib64/nagios/plugins/check_nrpe' '-H' '10.35.56.38' '-c' 'check_oom' '-t' '60') terminated with exit code 0[2019-04-03 18:42:09 -0700] information/FileLogger: 'debug-file' started.
[2019-04-03 18:42:09 -0700] information/FileLogger: 'main-log' started.
[2019-04-03 18:42:09 -0700] notice/WorkQueue: Spawning WorkQueue threads for 'ApiListener, RelayQueue'
[2019-04-03 18:42:09 -0700] notice/ApiListener: Relaying 'config::UpdateObject' message
[2019-04-03 18:42:09 -0700] notice/ApiListener: Relaying 'config::UpdateObject' message
[2019-04-03 18:42:09 -0700] notice/Downtime: Checkable 'pd9-nkgdb15!check-ucs_hardware_alert' already in a NOT-OK state. Triggering downtime now.
[2019-04-03 18:42:09 -0700] notice/ApiListener: Relaying 'config::UpdateObject' message
Actually, I am really confused.. looks like this logs are different from last time, any suggestion is welcome.
thank you very much.
One note for the configuration inside zones.conf - since you've copied it, each side (master and satellites) will now connect to each other.
This is done by the host
attribute in the endpoint objects, and should be configured just one way. Either the satellite can reach the parent zone, then configure the satellite's zones.conf to have the host
attribute for the master endpoints. Or vice versa.
Looks like the master is not able to connect to the satellites from your logs, and always runs into TLS handshake timeouts. This essentially blocks and eats quite a few CPU resources.
Instead, the satellites establish the connections already, and the check results run through. Your master(s) are just overloaded, fix the connection direction issue first before looking into anything else.
Cheers, Michael
Thank you Michael,
Let me confirm if I have correct understanding.
Correct cluster behavior should be master connecting satellites or satellites connecting master, they should not connect each other at the same time.
To disable bi-connecting, I should remove host attribute of satellites' endpoints from master nodes, then just keep satellites connecting to master nodes.
Thanks.
Here‘s the docs with some more derails: https://icinga.com/docs/icinga2/latest/doc/06-distributed-monitoring/#endpoint-connection-direction
Thank you Michael, I used v2.5.4 long time (about two years), didn't notice this detail before.
Have corrected configuration, let me watch for one week, let master nodes connect satellites, satellites do not connect master nodes.
Thanks.
Hi Michael @dnsmichi , I have disabled connection from satellite to master, the issue is still here. I think I can express more clearly.
I reload master node: cu-icinga-01, by "systemctl reload icinga2" Then, on satellite njg-icinga-01, I did see cu-icinga-01 left as API client:
[2019-04-09 19:48:56 -0700] notice/ApiListener: Sending message 'event::CheckResult' to 'njg-icinga-02'
[2019-04-09 19:48:56 -0700] notice/ApiListener: Sending message 'event::CheckResult' to 'cu-icinga-01'
[2019-04-09 19:48:56 -0700] warning/TlsStream: TLS stream was disconnected.
[2019-04-09 19:48:56 -0700] warning/JsonRpcConnection: API client disconnected for identity 'cu-icinga-01'
[2019-04-09 19:48:56 -0700] warning/ApiListener: Removing API client for endpoint 'cu-icinga-01'. 0 API clients left.
[2019-04-09 19:48:56 -0700] debug/EndpointDbObject: update is_connected=0 for endpoint 'cu-icinga-01'
And I can see mater node is attempting to collect njg-icinga-01 by netstat command, 10.35.60.236 is the lost master node, it is trying to connect satellite's 5665 port, then disconnect, then connect again, keep doing this.
njg-icinga-01:/var/log/icinga2# netstat -anp | grep 5665 | grep EST
tcp 289 0 10.60.89.57:5665 10.35.60.236:50338 ESTABLISHED 3747/icinga2
tcp 0 0 10.60.89.57:5665 10.37.60.236:36464 ESTABLISHED 3747/icinga2
tcp 0 0 10.60.89.57:5665 10.60.89.72:59548 ESTABLISHED 3747/icinga2
njg-icinga-01:/var/log/icinga2# netstat -anp | grep 5665 | grep EST
tcp 0 0 10.60.89.57:5665 10.37.60.236:36464 ESTABLISHED 3747/icinga2
tcp 0 0 10.60.89.57:5665 10.60.89.72:59548 ESTABLISHED 3747/icinga2
njg-icinga-01:/var/log/icinga2# netstat -anp | grep 5665 | grep EST
tcp 289 0 10.60.89.57:5665 10.35.60.236:50342 ESTABLISHED 3747/icinga2
tcp 0 0 10.60.89.57:5665 10.37.60.236:36464 ESTABLISHED 3747/icinga2
tcp 0 0 10.60.89.57:5665 10.60.89.72:59548 ESTABLISHED 3747/icinga2
On master node, it keeps printing trying to reconnect satellite:
[2019-04-09 19:49:08 -0700] debug/ApiListener: Not connecting to Endpoint 'njg-icinga-01' because we're already trying to connect to it.
[2019-04-09 19:49:08 -0700] information/ApiListener: Finished reconnecting to endpoint 'njg-icinga-01' via host 'njg-icinga-01.i.jaspersystems.com' and port '5665'
[2019-04-09 19:49:18 -0700] information/ApiListener: Reconnecting to endpoint 'njg-icinga-01' via host 'njg-icinga-01.i.jaspersystems.com' and port '5665'
[2019-04-09 19:49:28 -0700] debug/ApiListener: Not connecting to Endpoint 'njg-icinga-01' because we're already trying to connect to it.
[2019-04-09 19:49:28 -0700] information/ApiListener: Finished reconnecting to endpoint 'njg-icinga-01' via host 'njg-icinga-01.i.jaspersystems.com' and port '5665'
[2019-04-09 19:49:38 -0700] information/ApiListener: Reconnecting to endpoint 'njg-icinga-01' via host 'njg-icinga-01.i.jaspersystems.com' and port '5665'
[2019-04-09 19:49:48 -0700] debug/ApiListener: Not connecting to Endpoint 'njg-icinga-01' because we're already trying to connect to it.
[2019-04-09 19:49:48 -0700] information/ApiListener: Finished reconnecting to endpoint 'njg-icinga-01' via host 'njg-icinga-01.i.jaspersystems.com' and port '5665'
[2019-04-09 19:49:58 -0700] information/ApiListener: Reconnecting to endpoint 'njg-icinga-01' via host 'njg-icinga-01.i.jaspersystems.com' and port '5665'
From tcpdump, we can see looks like master node disconnect connection by itself because satellite didn't send certificate information after Hello:
Also attached packet file : njg-icinga-01.pcap.zip
We can see after TCP connection connected, master node sent "Client Hello" and satellite feedback ack, but no following packets of certificate, so, after 10 seconds, master node disconnected.
Is there any possible problem in my environment?
Thanks.
After restart icinga2 on satellite, captured correct behavior, satellite should send Server Hello to master node after ack packet:
Also attached cap file, from 163 packet, it became good: njg-icinga-01.good.pcap.zip
Looks like the issue is on satellite node, it cannot feedback caused timeout, master node disconnect connection.
checking more on satellite, if there is any issue on system level.
Likely this also is related to the cluster and API overloads we've seen with 2.9 and 2.10, and are aiming to solve with 2.11. You can test the snapshot packages, if the problem persists - all details are collected in #7041.
Hm.. not sure if it can help or not, I found the satellites who open too many filehandlers will be easy to fail connect. especially, CLOSE_WAIT TCP.
I have setup cronjob to collect "lsof | grep icinga2 | grep CLOSE | wc -l " for every minutes, at present, all my satellites opened file handler is less than 11,000, I tried reload master node several times, but the issue not happen.
Will try tomorrow, then check if it is related with opened file descriptor, if it is, I think there might be some defect on close connection.
icinga2 31848 31935 icinga 55u IPv4 628258809 0t0 TCP njg-icinga-02:5665->10.60.89.10:44612 (CLOSE_WAIT)
icinga2 31848 31935 icinga 56u IPv4 628284049 0t0 TCP njg-icinga-02:5665->10.60.89.10:50072 (CLOSE_WAIT)
icinga2 31848 31935 icinga 57u IPv4 628280004 0t0 TCP njg-icinga-02:5665->10.60.89.10:35834 (CLOSE_WAIT)
icinga2 31848 31935 icinga 58u IPv4 628283858 0t0 TCP njg-icinga-02:5665->10.60.89.10:45992 (CLOSE_WAIT)
icinga2 31848 31935 icinga 60u IPv4 628299751 0t0 TCP njg-icinga-02:5665->10.60.89.10:57492 (CLOSE_WAIT)
icinga2 31848 31935 icinga 61u IPv4 628300944 0t0 TCP njg-icinga-02:5665->10.60.89.10:59732 (CLOSE_WAIT)
icinga2 31848 31936 icinga 13u IPv4 628265592 0t0 TCP njg-icinga-02:5665->10.60.89.10:56144 (CLOSE_WAIT)
icinga2 31848 31936 icinga 16u IPv4 629206498 0t0 TCP njg-icinga-02:5665->cu-icinga-01:59512 (CLOSE_WAIT)
icinga2 31848 31936 icinga 17u IPv4 628276666 0t0 TCP njg-icinga-02:5665->10.60.89.10:58512 (CLOSE_WAIT)
icinga2 31848 31936 icinga 43u IPv4 626460544 0t0 TCP njg-icinga-02:5665->cu-icinga-02:57644 (CLOSE_WAIT)
icinga2 31848 31936 icinga 47u IPv4 628261643 0t0 TCP njg-icinga-02:5665->10.60.89.10:50418 (CLOSE_WAIT)
icinga2 31848 31936 icinga 48u IPv4 628265415 0t0 TCP njg-icinga-02:5665->10.60.89.10:42072 (CLOSE_WAIT)
icinga2 31848 31936 icinga 49u IPv4 628260879 0t0 TCP njg-icinga-02:5665->10.60.89.10:48544 (CLOSE_WAIT)
icinga2 31848 31936 icinga 50u IPv4 628272743 0t0 TCP njg-icinga-02:5665->10.60.89.10:33506 (CLOSE_WAIT)
icinga2 31848 31936 icinga 51u IPv4 628269858 0t0 TCP njg-icinga-02:5665->10.60.89.10:36734 (CLOSE_WAIT)
icinga2 31848 31936 icinga 52u IPv4 628283441 0t0 TCP njg-icinga-02:5665->10.60.89.10:39418 (CLOSE_WAIT)
icinga2 31848 31936 icinga 53u IPv4 628283745 0t0 TCP njg-icinga-02:5665->10.60.89.10:45306 (CLOSE_WAIT)
icinga2 31848 31936 icinga 54u IPv4 628279878 0t0 TCP njg-icinga-02:5665->10.60.89.10:60214 (CLOSE_WAIT)
icinga2 31848 31936 icinga 55u IPv4 628258809 0t0 TCP njg-icinga-02:5665->10.60.89.10:44612 (CLOSE_WAIT)
icinga2 31848 31936 icinga 56u IPv4 628284049 0t0 TCP njg-icinga-02:5665->10.60.89.10:50072 (CLOSE_WAIT)
icinga2 31848 31936 icinga 57u IPv4 628280004 0t0 TCP njg-icinga-02:5665->10.60.89.10:35834 (CLOSE_WAIT)
icinga2 31848 31936 icinga 58u IPv4 628283858 0t0 TCP njg-icinga-02:5665->10.60.89.10:45992 (CLOSE_WAIT)
Thanks.
Today ,the issue happen again, the file descriptor number became more than 20000+
I reloaded master node at 16:31:01, no cluster alert at that time, but the opened fd became more and more, until 18:01, master node reload again, this satellite not connected...
For other satellites, the fd is normal (means similar with this one before 16:31).
Acutllay, I set file-max = 65535, but seems icinga2 cannot handle tls connection anymore.
Downgraded half satellites to 2.8.1 to watch for one week.
Time | total fd | CLOSE_WAIT fd | icinga2 fd
-----------------------------------------------------------------------------------------------------
Wed, 10 Apr 2019 16:18:01 -0700 | 7992 | 0 | 4006
Wed, 10 Apr 2019 16:19:01 -0700 | 7768 | 0 | 3852
Wed, 10 Apr 2019 16:20:02 -0700 | 7577 | 0 | 3551
Wed, 10 Apr 2019 16:21:01 -0700 | 7389 | 0 | 3656
Wed, 10 Apr 2019 16:22:01 -0700 | 7680 | 0 | 3747
Wed, 10 Apr 2019 16:23:01 -0700 | 7645 | 0 | 3815
Wed, 10 Apr 2019 16:24:01 -0700 | 7590 | 0 | 3743
Wed, 10 Apr 2019 16:25:01 -0700 | 7375 | 0 | 3578
Wed, 10 Apr 2019 16:26:01 -0700 | 7731 | 0 | 3656
Wed, 10 Apr 2019 16:27:01 -0700 | 7685 | 0 | 3756
Wed, 10 Apr 2019 16:28:01 -0700 | 8329 | 0 | 3971
Wed, 10 Apr 2019 16:29:01 -0700 | 8728 | 0 | 4410
Wed, 10 Apr 2019 16:30:01 -0700 | 7800 | 0 | 3751
Wed, 10 Apr 2019 16:31:01 -0700 | 8051 | 76 | 4003
Wed, 10 Apr 2019 16:32:01 -0700 | 8167 | 76 | 3943
Wed, 10 Apr 2019 16:33:01 -0700 | 9161 | 168 | 5049
Wed, 10 Apr 2019 16:34:01 -0700 | 9778 | 276 | 5523
Wed, 10 Apr 2019 16:35:01 -0700 | 10366 | 572 | 5927
Wed, 10 Apr 2019 16:36:01 -0700 | 10615 | 702 | 6477
Wed, 10 Apr 2019 16:37:01 -0700 | 10649 | 756 | 6415
Wed, 10 Apr 2019 16:38:01 -0700 | 10948 | 880 | 7121
Wed, 10 Apr 2019 16:39:01 -0700 | 11298 | 1026 | 7500
Wed, 10 Apr 2019 16:40:01 -0700 | 10921 | 1026 | 6912
Wed, 10 Apr 2019 16:41:01 -0700 | 11495 | 1121 | 7308
Wed, 10 Apr 2019 16:42:01 -0700 | 11525 | 1220 | 7493
Wed, 10 Apr 2019 16:43:01 -0700 | 12352 | 1220 | 8243
Wed, 10 Apr 2019 16:44:01 -0700 | 12195 | 1323 | 8040
Wed, 10 Apr 2019 16:45:02 -0700 | 11847 | 1386 | 7905
Wed, 10 Apr 2019 16:46:01 -0700 | 12064 | 1386 | 8000
Wed, 10 Apr 2019 16:47:01 -0700 | 12551 | 1495 | 8249
Wed, 10 Apr 2019 16:48:01 -0700 | 13068 | 1560 | 9050
Wed, 10 Apr 2019 16:49:01 -0700 | 12696 | 1560 | 8760
Wed, 10 Apr 2019 16:50:01 -0700 | 12508 | 1675 | 8368
Wed, 10 Apr 2019 16:51:01 -0700 | 15245 | 2054 | 11144
Wed, 10 Apr 2019 16:52:01 -0700 | 15326 | 2080 | 11042
Wed, 10 Apr 2019 16:53:01 -0700 | 17963 | 3312 | 14771
Wed, 10 Apr 2019 16:54:01 -0700 | 19509 | 4370 | 16145
Wed, 10 Apr 2019 16:55:01 -0700 | 19254 | 4370 | 14937
Wed, 10 Apr 2019 16:56:01 -0700 | 20914 | 5320 | 16187
Wed, 10 Apr 2019 16:57:01 -0700 | 20868 | 6175 | 16878
Wed, 10 Apr 2019 16:58:01 -0700 | 21628 | 6270 | 18496
Wed, 10 Apr 2019 16:59:01 -0700 | 21877 | 7220 | 18670
Wed, 10 Apr 2019 17:00:01 -0700 | 21803 | 7980 | 18326
Wed, 10 Apr 2019 17:01:01 -0700 | 23066 | 8170 | 19053
Wed, 10 Apr 2019 17:02:02 -0700 | 23879 | 9025 | 19499
...
Wed, 10 Apr 2019 17:59:01 -0700 | 25831 | 11970 | 22198
Wed, 10 Apr 2019 18:00:01 -0700 | 25628 | 11970 | 21801
Wed, 10 Apr 2019 18:01:01 -0700 | 25967 | 11970 | 21991
Wed, 10 Apr 2019 18:02:01 -0700 | 26057 | 12160 | 21984
Wed, 10 Apr 2019 18:03:01 -0700 | 27289 | 12445 | 23701
Wed, 10 Apr 2019 18:04:01 -0700 | 26427 | 12730 | 23038
Wed, 10 Apr 2019 18:05:01 -0700 | 26655 | 13015 | 22800
Wed, 10 Apr 2019 18:06:01 -0700 | 28111 | 13300 | 23757
Wed, 10 Apr 2019 18:07:01 -0700 | 26177 | 13585 | 22275
Wed, 10 Apr 2019 18:08:01 -0700 | 27950 | 13870 | 25156
Wed, 10 Apr 2019 18:09:01 -0700 | 27802 | 14155 | 24151
Wed, 10 Apr 2019 18:10:01 -0700 | 28153 | 14440 | 24311
Wed, 10 Apr 2019 18:11:01 -0700 | 29163 | 14725 | 24977
Wed, 10 Apr 2019 18:12:01 -0700 | 30453 | 15010 | 26623
Wed, 10 Apr 2019 18:13:01 -0700 | 30063 | 15295 | 26339
Wed, 10 Apr 2019 18:14:01 -0700 | 29489 | 15580 | 25542
Wed, 10 Apr 2019 18:15:01 -0700 | 29926 | 15865 | 25880
Wed, 10 Apr 2019 18:16:01 -0700 | 30167 | 16150 | 26419
Wed, 10 Apr 2019 18:17:01 -0700 | 31427 | 16435 | 26824
Wed, 10 Apr 2019 18:18:01 -0700 | 31456 | 16720 | 28099
Wed, 10 Apr 2019 18:19:01 -0700 | 30882 | 17005 | 26911
Wed, 10 Apr 2019 18:20:01 -0700 | 31005 | 17290 | 27028
Wed, 10 Apr 2019 18:21:01 -0700 | 31831 | 17575 | 28014
Wed, 10 Apr 2019 18:22:01 -0700 | 31896 | 17860 | 28174
Wed, 10 Apr 2019 18:23:01 -0700 | 32776 | 18145 | 29408
Wed, 10 Apr 2019 18:24:01 -0700 | 32473 | 18430 | 28737
Wed, 10 Apr 2019 18:25:01 -0700 | 33513 | 18715 | 28505
Wed, 10 Apr 2019 18:26:01 -0700 | 33519 | 19000 | 29393
Wed, 10 Apr 2019 18:27:01 -0700 | 34358 | 19285 | 29559
Wed, 10 Apr 2019 18:28:01 -0700 | 35282 | 19570 | 30113
Wed, 10 Apr 2019 18:29:01 -0700 | 34132 | 19855 | 30028
Wed, 10 Apr 2019 18:30:01 -0700 | 34365 | 20140 | 29913
Wed, 10 Apr 2019 18:31:01 -0700 | 35145 | 20425 | 30809
Wed, 10 Apr 2019 18:32:01 -0700 | 35542 | 20710 | 30887
So the master connects to the satellites, and leaves the connection open.
TCP njg-icinga-02:5665->10.60.89.10:44612 (CLOSE_WAIT)
Just curious, which FQDN points to 10.60.89.10? (dig -x 10.60.89.10).
Seems that sometimes this is the first master too.
TCP njg-icinga-02:5665->cu-icinga-01:59512 (CLOSE_WAIT)
Since you're using FQDNs for the host
attribute as well, it may be worth a shot to eliminate DNS resolution problems and use the direct IPv4 addresses to connect to the satellites.
If you re-iterate this on the master, how many stale sockets are opened over there, this likely is not only the case for the satellites, but the masters too.
And last but not least, if you can, please test the snapshot packages (master and satellite in this case). The parts with TLS connection handling have been rewritten from scratch.
I checked DNS too, it is OK, 10.60.89.10 is a security scanning server in our internal network, it will detect all listening ports on all hosts, so, it was trying to keep connecting and disconnecting 5665 port, and that server has no DNS records, so we can only see its IP.
This indicates it will have some problem on TCP connection handling, and I do think master node should have same issue too, but I didn't collect fd on master node, I can do this next step.
Hm.. because this is our production monitoring system, last time, I used big effort to convince our team to get approval to upgrade from 2.5.4 to 2.8.4 to solve segment fault crash issue, however, meet this issue, I don't think I can get approval to use snapshot version :(
Anyway, thank you very much, I will watch for one week that if 2.8.1 version has this issue or not, if not, I will downgrade to 2.8.1, no method...
Uh.
I checked DNS too, it is OK, 10.60.89.10 is a security scanning server in our internal network, it will detect all listening ports on all hosts, so, it was trying to keep connecting and disconnecting 5665 port, and that server has no DNS records, so we can only see its IP.
And we have a winner, this is likely the same issue as with #6559 and #6562 where an external tool scans the API via TCP and HTTP, and somehow connections are not closed properly. This leaves stale connections all around, nothing related to Icinga cluster connections on their own.
This is something which was hidden in 2.8.4 with corked connections, more visible with 2.9 and 2.10. A fix is in the making, that's why I've asked whether to test the snapshot packages. This really would help to know whether everything is resolved prior to any release :)
Thank you Michael for confirm, yes.. after investigation, I don't think it is cluster connections issue.
It is stale connections caused cluster connection issue.
I think we can test this in lab environment next week, I will do it like this:
Use an automation script, just connecting 5665 port, disconnecting, repeating fast, see if it possible that only 2.10.4 will have this phenomenon (too many CLOSE_WAIT fd).
By the way, until now, I see one of my 2.10.4 satellite has same issue, all 2.8.1 satellites OK, I will do test again, if only this node will not connect after master node reload, this needs time to verify :(
Thanks.
Update: two weeks passed after downgraded to 2.8.1, everything is OK:
when I restart satellite nodes, there are no alerts of "Terminated" anymore, I think it is good enough.
For above two issues, I have seen there are issues tracking them, so, I want to close this issue to reduce duplicated topics.
Thanks Michael( @dnsmichi ) for talking about this issue, I suggest icinga2 should provide some LTS version, for those versions, no new features, just bug/security fix, then users like me will feel safer :)
If there is extra effort to provide LTS, I would like to contribute if what I can see.
Thanks.
LTS is on the roadmap, but nothing which directly affects Icinga 2 as core at the moment. 2.8.x won't hit that anymore - if it runs for you, good, but you may encounter different bugs with it (notification handling, etc.) which will always require updated versions. 2.11 is a good starting point where many of the mentioned problems are subject being fixed. That is why I've asked you to test the snapshot packages to be sure your ongoing problems are fixed :)
Cheers, Michael
Describe the bug
I have two nodes in master zone, 4 zones, in each zone, there are 2 satellites.
Master nodes: cu-icinga-01, cu-icinga-02 2 Satellites in sample zone: njg-icinga-01, njg-icinga-02
Detail configuartion files attached below.
Made configuration change on cu-icinga-01, reload icinga2: "systemctl reload icinga2", there will be Cluster Problem:
the endpoints are random.
On satellites, there are logs (debug level), when master node reload, it said API clients left, and then, it keep alerting "Not connecting to Endpoint 'cu-icinga-01' because we're already trying to connect to it."
To Reproduce
Expected behavior
It should not have this type alerts
Screenshots
Your Environment
Include as many relevant details about the environment you experienced the problem in
icinga2 --version
):Operating System and version:
Enabled features (
icinga2 feature list
):Icinga Web 2 version and modules (System - About):
icinga2 daemon -C
):.... some apply rule not match messages....
[2019-04-03 00:08:35 -0700] warning/ApplyRule: Apply rule 'check-v2hlrlink-EE' (in /etc/icinga2/zones.d/global-templates/services-v2hlrlink.conf: 121:1-121:34) for type 'Service' does not match anywhere! [2019-04-03 00:08:35 -0700] warning/ApplyRule: Apply rule 'check-v2hlrlink-Luxembourg' (in /etc/icinga2/zones.d/global-templates/services-v2hlrlink.conf: 136:1-136:42) for type 'Service' does not match anywhere! [2019-04-03 00:08:35 -0700] warning/ApplyRule: Apply rule 'check-v2hlrlink-jasper_hp_itu2' (in /etc/icinga2/zones.d/global-templates/services-v2hlrlink.conf: 151:1-151:46) for type 'Service' does not match anywhere! [2019-04-03 00:08:35 -0700] warning/ApplyRule: Apply rule 'check-v2hlrlink-jasper_itu1' (in /etc/icinga2/zones.d/global-templates/services-v2hlrlink.conf: 166:1-166:43) for type 'Service' does not match anywhere! [2019-04-03 00:08:35 -0700] warning/ApplyRule: Apply rule 'check-v2hlrlink-Etisalat' (in /etc/icinga2/zones.d/global-templates/services-v2hlrlink.conf: 181:1-181:40) for type 'Service' does not match anywhere! [2019-04-03 00:08:35 -0700] warning/ApplyRule: Apply rule 'check-v2hlrlink-Mobily' (in /etc/icinga2/zones.d/global-templates/services-v2hlrlink.conf: 196:1-196:38) for type 'Service' does not match anywhere! [2019-04-03 00:08:35 -0700] warning/ApplyRule: Apply rule 'check-v2hlrlink-QAHPSP1' (in /etc/icinga2/zones.d/global-templates/services-v2hlrlink.conf: 211:1-211:39) for type 'Service' does not match anywhere! [2019-04-03 00:08:35 -0700] warning/ApplyRule: Apply rule 'check-v2hlrlink-Telkomsel' (in /etc/icinga2/zones.d/global-templates/services-v2hlrlink.conf: 226:1-226:41) for type 'Service' does not match anywhere! [2019-04-03 00:08:35 -0700] warning/ApplyRule: Apply rule 'check-v2hlrlink-jasper_itu1' (in /etc/icinga2/zones.d/global-templates/services-v2hlrlink.conf: 241:1-241:43) for type 'Service' does not match anywhere! [2019-04-03 00:08:35 -0700] warning/ApplyRule: Apply rule 'check-v2hlrlink-jasper_hp_itu2' (in /etc/icinga2/zones.d/global-templates/services-v2hlrlink.conf: 256:1-256:46) for type 'Service' does not match anywhere! [2019-04-03 00:08:35 -0700] warning/ApplyRule: Apply rule 'check-v2hlrlink-Telus' (in /etc/icinga2/zones.d/global-templates/services-v2hlrlink.conf: 271:1-271:37) for type 'Service' does not match anywhere! [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 13324 Services. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 2 LivestatusListeners. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 1 IcingaApplication. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 2566 Hosts. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 2 FileLoggers. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 6 NotificationCommands. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 52673 Notifications. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 1 NotificationComponent. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 52 HostGroups. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 1 ApiListener. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 78 Downtimes. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 2 Comments. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 1 CheckerComponent. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 7 Zones. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 1 ExternalCommandListener. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 10 Endpoints. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 3 ApiUsers. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 109 Users. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 1 IdoMysqlConnection. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 279 CheckCommands. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 99 UserGroups. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 29 ServiceGroups. [2019-04-03 00:08:35 -0700] information/ConfigItem: Instantiated 6 TimePeriods. [2019-04-03 00:08:35 -0700] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars' [2019-04-03 00:08:35 -0700] information/cli: Finished validating the configuration file(s).
object Endpoint "cu-icinga-01" { host = "cu-icinga-01.i.jaspersystems.com" } object Endpoint "cu-icinga-02" { host = "cu-icinga-02.i.jaspersystems.com" } object Zone "master" { endpoints = [ "cu-icinga-01", "cu-icinga-02" ] } object Endpoint "nkg-icinga-01" { host = "nkg-icinga-01.i.jaspersystems.com" } object Endpoint "nkg-icinga-02" { host = "nkg-icinga-02.i.jaspersystems.com" } object Zone "DC16" { endpoints = [ "nkg-icinga-01", "nkg-icinga-02" ] parent = "master" } object Endpoint "can-icinga-01" { host = "can-icinga-01.i.jaspersystems.com" } object Endpoint "can-icinga-02" { host = "can-icinga-02.i.jaspersystems.com" } object Zone "DC17" { endpoints = [ "can-icinga-01", "can-icinga-02" ] parent = "master" } object Zone "global-templates" { global = true } object Zone "director-global" { global = true } object Endpoint "fuo-icinga-01" { host = "fuo-icinga-01.i.jaspersystems.com" } object Endpoint "fuo-icinga-02" { host = "fuo-icinga-02.i.jaspersystems.com" } object Zone "DC19" { endpoints = [ "fuo-icinga-01", "fuo-icinga-02" ] parent = "master" } object Endpoint "njg-icinga-01" { host = "njg-icinga-01.i.jaspersystems.com" } object Endpoint "njg-icinga-02" { host = "njg-icinga-02.i.jaspersystems.com" } object Zone "DC18" { endpoints = [ "njg-icinga-01", "njg-icinga-02" ] parent = "master" }
object Endpoint "cu-icinga-01" { host = "cu-icinga-01.i.jaspersystems.com" } object Endpoint "cu-icinga-02" { host = "cu-icinga-02.i.jaspersystems.com" } object Zone "master" { endpoints = [ "cu-icinga-01", "cu-icinga-02" ] } object Endpoint "nkg-icinga-01" { host = "nkg-icinga-01.i.jaspersystems.com" } object Endpoint "nkg-icinga-02" { host = "nkg-icinga-02.i.jaspersystems.com" } object Zone "DC16" { endpoints = [ "nkg-icinga-01", "nkg-icinga-02" ] parent = "master" } object Endpoint "can-icinga-01" { host = "can-icinga-01.i.jaspersystems.com" } object Endpoint "can-icinga-02" { host = "can-icinga-02.i.jaspersystems.com" } object Zone "DC17" { endpoints = [ "can-icinga-01", "can-icinga-02" ] parent = "master" } object Zone "global-templates" { global = true } object Zone "director-global" { global = true } object Endpoint "fuo-icinga-01" { host = "fuo-icinga-01.i.jaspersystems.com" } object Endpoint "fuo-icinga-02" { host = "fuo-icinga-02.i.jaspersystems.com" } object Zone "DC19" { endpoints = [ "fuo-icinga-01", "fuo-icinga-02" ] parent = "master" } object Endpoint "njg-icinga-01" { host = "njg-icinga-01.i.jaspersystems.com" } object Endpoint "njg-icinga-02" { host = "njg-icinga-02.i.jaspersystems.com" } object Zone "DC18" { endpoints = [ "njg-icinga-01", "njg-icinga-02" ] parent = "master" }