Closed Tontonitch closed 6 years ago
It seems that after some time (about 2 hours in the occurrence) the connection is back.
Trying to reproduce the issue...
Hi,
I faced a similar issue in the last week in our environment at work, unfortunately I didn't had the time to investigate on the issue in deep.
For further tests can you please share more details about your setup:
Thanks!
Kind regards Michael
Hello Michael,
/* -------------------------------
Master Endpoints & Zone
--------------------------------*/
object Endpoint "monitorsrv1.communication1.ntld" {
host = "monitorsrv1.communication1.ntld"
}
object Zone "master" {
endpoints = [ "monitorsrv1.communication1.ntld" ]
}
object Zone "global-templates" {
global = true
}
object Zone "director-global" {
global = true
}
/* -------------------------------
Client Endpoints
--------------------------------*/
include "local.d/zones/*.conf"
/etc/icinga2/local.d/zones/xxxxhostxxxx.conf
object Endpoint "xxxxhostxxxx" {
host = "xxx.xxx.xxx.xxx"
log_duration = 0
}
object Zone "xxxxhostxxxx" {
parent = "master"
endpoints = [ "xxxxhostxxxx" ]
}
on client:
object Endpoint NodeName {
host = NodeName
}
object Zone ZoneName {
endpoints = [ NodeName ]
}
Reproduced by a simple Windows server (icinga2 client) restart. I'm waiting to see when it will be re-connected to the icinga2 server.
Exactly the same delay to re-connect to the master (about 2 hours).
Thank you for the additional details.
Your zones.conf on your clients is missing the master endpoint & zone. See the example in the documentation, the client must know about the master.
How often does the check run?
Reproduced by a simple Windows server (icinga2 client) restart.
Did you restart the entire system or did you stopped the icinga2 service on the windows machine and just started it again?
Your zones.conf on your clients is missing the master endpoint & zone. See the example in the documentation, the client must know about the master.
on the client, master endpoint & zone are located in the icinga2.conf file (I use a old adapted version of the powershell script https://github.com/Icinga/icinga2-powershell-module to install/update icinga2 client). The file zones.conf is not used in fact (not called from icinga2.conf).
Hereunder the icinga2.conf config part related to zones and endpoints:
/* --------------------------------
Endpoints
-------------------------------- */
object Endpoint "xxxxhostxxxx" {}
object Endpoint "monitorsrv1.communication1.ntld" {}
/* --------------------------------
Zones
-------------------------------- */
object Zone "master" {
endpoints = [ "monitorsrv1.communication1.ntld" ]
}
object Zone "xxxxhostxxxx" {
parent = "master"
endpoints = [ "xxxxhostxxxx" ]
}
object Zone "director-global" {
global = true
}
object Zone "global-templates" {
global = true
}
/* --------------------------------
API
-------------------------------- */
object ApiListener "api" {
cert_path = SysconfDir + "/icinga2/pki/xxxxhostxxxx.crt"
key_path = SysconfDir + "/icinga2/pki/xxxxhostxxxx.key"
ca_path = SysconfDir + "/icinga2/pki/ca.crt"
accept_commands = true
accept_config = true
}
How often does the check run?
each minute, 9 services through the icinga2 client
Did you restart the entire system or did you stopped the icinga2 service on the windows machine and just started it again?
I tried a simple restart of the service, reconnection was quick. Multiple tries. No issues. I tried a restart of the server, and faced the issue. Yesterday, 3 restarts, 3 times the issue. This morning, 2 restarts but 0 times the issue... I tried on 3 different servers, faced the issue only on 1 of them. But I'm sure it happened on other servers restarted after some software/system updates.
So, this morning, after a config reload on the icinga2 master, I cannot reproduce the issue. All my restart test of the server runing the icinga2 client were ok: connectivity between icinga2 master and client went back quite quickly. I also did some longer stop start to check if it's related to long server start, but the tests were still ok: quick connectivity back.
So I guess the issue is related to some cache on the icinga2 master.
I'm sure that after some time without reloading the master the issue will come back.
I was not able to reproduce this in my test environment. How much endpoints do you have in your setup @Tontonitch ?
about 110
May you need me to check something else, debug something?
Look into the log for messages like the bottom one. Maybe this indicates what happens and if the master tries to reconnect to the endpoint at all.
[2018-03-22 21:16:33 +0100] information/ApiListener: Reconnecting to endpoint 'test-win-icinga2' via host '172.16.21.77' and port '5665'
[2018-03-22 21:16:36 +0100] critical/TcpSocket: Invalid socket: No route to host
[2018-03-22 21:16:36 +0100] critical/ApiListener: Cannot connect to host '172.16.21.77' on port '5665'
[2018-03-22 21:16:36 +0100] information/ApiListener: Finished reconnecting to endpoint 'test-win-icinga2' via host '172.16.21.77' and port '5665'
Are you using a firewall (windows or third party) on your windows machines? If so have you configured exceptions for Icinga 2 in it?
Are you using a firewall (windows or third party) on your windows machines? If so have you configured exceptions for Icinga 2 in it?
No
Look into the log for messages
I will check, but last time I couldn't see anything. Is there a way to log information/ApiListener messages, without the need of enabling debuglog?
Yesterday I faced this issue again, after a reboot of one Windows host Icinga didn't reconnect to the agent. In my environment the master should build up the connection to the client (master -> client).
Unfortunately I don't have a debug log, only the normal one.
Master Linux Debian 8 jessie Icinga 2 v2.8.1
Client Windows Server 2012 R2 Icinga 2 v2.8.1
Master Log:
[...]
[2018-04-07 14:06:33 +0200] warning/TlsStream: TLS stream was disconnected.
[2018-04-07 14:06:33 +0200] warning/JsonRpcConnection: API client disconnected for identity 'client'
[2018-04-07 14:06:33 +0200] warning/ApiListener: Removing API client for endpoint 'client'. 0 API clients left.
[2018-04-07 14:06:49 +0200] information/ApiListener: Reconnecting to endpoint 'client' via host '<IP>' and port '5665'
[...]
After the last log message there is no further message related to the connection to client
.
Client Log:
[...]
[2018-04-07 14:06:32 +0200] information/Application: Received request to shut down.
[2018-04-07 14:06:33 +0200] information/Application: Shutting down...
[2018-04-07 14:06:33 +0200] information/ApiListener: 'api' stopped.
[2018-04-07 14:07:29 +0200] information/ApiListener: Adding new listener on port '5665'
[2018-04-07 14:07:29 +0200] information/ConfigItem: Activated all objects.
[...]
Only stop and start messages in the client log, no message related to a connection.
Test: manually connect to client:
# openssl s_client -CAfile /var/lib/icinga2/certs/ca.crt -cert /var/lib/icinga2/certs/master.crt -key /var/lib/icinga2/certs/master.key -connect client:5665
CONNECTED(00000003)
[...]
I tried to connect manually to the client which worked.
API output for client zone object:
Call https://master:5665/v1/objects/zones/client
API output for client endpoint object:
Call https://master:5665/v1/objects/endpoints/client
You can see the attribute connecting
is true
- still after ~1 hour this state didn't change
I didn't had much time yesterday and after round about 1 hour with no change to the connection status I decided to restart the Icinga 2 daemon. Next time I face this issue I'll create a gdb backtrace.
What does netstat say, is the socket opened? Anything on the debug log on the master side itself? Which distributed/version is that?
Just because I faced this issue on a customer Environment this week: Have you ensured that the Icinga 2 Master / Satellite are accessing the same time Server and no time skip / jump is happening?
If the time doesnt match, the connection cannot be established.
Are you connecting From Agent to Satellite/Master, vice versa or both?
Please test the current snapshot packages.
We've seen and fixed that with the upcoming 2.9 release.
Expected Behavior
Current Behavior
It happens some time (1 time out of 10 more or less) that after the restart of an Icinga2 client on a Windows server (at least during a server reboot) the Icinga2 master cannot connect anymore to it. And I receive for all the service checks the message: Remote Icinga instance 'aaaaa' is not connected to 'monitorsrv1'
Then, restarting the client doesn't change anything. To fix the situation I need to restart (reload) the icinga2 master.
Below an exemple:
icinga2.log
nothing reported in the log file
[root@monitorsrv1 icinga2]# tail -f icinga2.log | grep -v 'event::UpdateRepository'
Note that I didn't restart the icinga2 master yet, to troubleshoot the issue.
Possible Solution
Steps to Reproduce (for bugs)
1. 2. 3. 4.
Context
Your Environment
icinga2 --version
): icinga2 client: v2.8.0-2-g35f0df3, icinga2 master: r2.8.0-1icinga2 feature list
):icinga2 daemon -C
):icinga2 client:
icinga2 master
zones.conf
file (oricinga2 object list --type Endpoint
andicinga2 object list --type Zone
) from all affected nodes.