Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.03k stars 578 forks source link

Icinga2 master cannot connect anymore to Icinga2 client on Windows after client restart #6154

Closed Tontonitch closed 6 years ago

Tontonitch commented 6 years ago

Expected Behavior

Current Behavior

It happens some time (1 time out of 10 more or less) that after the restart of an Icinga2 client on a Windows server (at least during a server reboot) the Icinga2 master cannot connect anymore to it. And I receive for all the service checks the message: Remote Icinga instance 'aaaaa' is not connected to 'monitorsrv1'

Then, restarting the client doesn't change anything. To fix the situation I need to restart (reload) the icinga2 master.

Below an exemple:

icinga2.log

[2018-03-07 11:59:04 +0100] information/ConfigItem: Activated all objects.
[2018-03-07 11:59:12 +0100] information/WorkQueue: #4 (ApiListener, RelayQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2018-03-07 11:59:12 +0100] information/WorkQueue: #5 (ApiListener, SyncQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);

nothing reported in the log file [root@monitorsrv1 icinga2]# tail -f icinga2.log | grep -v 'event::UpdateRepository'

Note that I didn't restart the icinga2 master yet, to troubleshoot the issue.

Possible Solution

Steps to Reproduce (for bugs)

1. 2. 3. 4.

Context

Your Environment

icinga2 client:

C:\Users\xxxxx>"C:\Program Files\ICINGA2\sbin\icinga2.exe" daemon -C
information/cli: Icinga application loader (version: v2.8.0-2-g35f0df3)
information/cli: Loading configuration file(s).
information/ConfigItem: Committing config item(s).
warning/ApiListener: Attribute 'cert_path' for object 'api' of type 'ApiListener
' is deprecated and should not be used.
warning/ApiListener: Attribute 'key_path' for object 'api' of type 'ApiListener'
 is deprecated and should not be used.
warning/ApiListener: Attribute 'ca_path' for object 'api' of type 'ApiListener'
is deprecated and should not be used.
warning/ApiListener: Please read the upgrading documentation for v2.8: https://w
ww.icinga.com/docs/icinga2/latest/doc/16-upgrading-icinga-2/
information/ApiListener: My API identity: sccmtstinfra
information/ConfigItem: Instantiated 1 ApiListener.
information/ConfigItem: Instantiated 2 Endpoints.
information/ConfigItem: Instantiated 4 Zones.
information/ConfigItem: Instantiated 1 FileLogger.
information/ConfigItem: Instantiated 4 NotificationCommands.
information/ConfigItem: Instantiated 1 EventCommand.
information/ConfigItem: Instantiated 174 CheckCommands.
information/ConfigItem: Instantiated 1 IcingaApplication.
information/ScriptGlobal: Dumping variables to file 'C:\ProgramData\icinga2\var/
cache/icinga2/icinga2.vars'
information/cli: Finished validating the configuration file(s).
Press any key to continue . . .

icinga2 master

[root@monitorsrv1 icinga2]# icinga2 daemon -C
information/cli: Icinga application loader (version: r2.8.0-1)
information/cli: Loading configuration file(s).
information/ConfigItem: Committing config item(s).
warning/ApiListener: Attribute 'key_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
warning/ApiListener: Attribute 'ca_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
warning/ApiListener: Attribute 'cert_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
warning/ApiListener: Please read the upgrading documentation for v2.8: https://www.icinga.com/docs/icinga2/latest/doc/16-upgrading-icinga-2/
information/ApiListener: My API identity: monitorsrv1.communication1.ntld
information/ConfigItem: Instantiated 1 ApiListener.
information/ConfigItem: Instantiated 113 Zones.
information/ConfigItem: Instantiated 111 Endpoints.
information/ConfigItem: Instantiated 4 ApiUsers.
information/ConfigItem: Instantiated 1 FileLogger.
information/ConfigItem: Instantiated 1 NotificationComponent.
information/ConfigItem: Instantiated 5 UserGroups.
information/ConfigItem: Instantiated 4 NotificationCommands.
information/ConfigItem: Instantiated 2286 Notifications.
information/ConfigItem: Instantiated 289 CheckCommands.
information/ConfigItem: Instantiated 1 EventCommand.
information/ConfigItem: Instantiated 1 IcingaApplication.
information/ConfigItem: Instantiated 378 Hosts.
information/ConfigItem: Instantiated 23 HostGroups.
information/ConfigItem: Instantiated 4630 Dependencies.
information/ConfigItem: Instantiated 13 TimePeriods.
information/ConfigItem: Instantiated 10 Users.
information/ConfigItem: Instantiated 5 Comments.
information/ConfigItem: Instantiated 8 ServiceGroups.
information/ConfigItem: Instantiated 4630 Services.
information/ConfigItem: Instantiated 1 CheckerComponent.
information/ConfigItem: Instantiated 1 IdoMysqlConnection.
information/ConfigItem: Instantiated 1 LivestatusListener.
information/ConfigItem: Instantiated 1 PerfdataWriter.
information/ConfigItem: Instantiated 1 CompatLogger.
information/ConfigItem: Instantiated 1 StatusDataWriter.
information/ConfigItem: Instantiated 1 ExternalCommandListener.
information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
information/cli: Finished validating the configuration file(s).
[root@monitorsrv1 icinga2]#
Tontonitch commented 6 years ago

It seems that after some time (about 2 hours in the occurrence) the connection is back. image

Trying to reproduce the issue...

mcktr commented 6 years ago

Hi,

I faced a similar issue in the last week in our environment at work, unfortunately I didn't had the time to investigate on the issue in deep.

For further tests can you please share more details about your setup:

Thanks!

Kind regards Michael

Tontonitch commented 6 years ago

Hello Michael,

on client:

object Endpoint NodeName {
  host = NodeName
}
object Zone ZoneName {
  endpoints = [ NodeName ]
}
Tontonitch commented 6 years ago

Reproduced by a simple Windows server (icinga2 client) restart. I'm waiting to see when it will be re-connected to the icinga2 server.

Tontonitch commented 6 years ago

Exactly the same delay to re-connect to the master (about 2 hours). image

mcktr commented 6 years ago

Thank you for the additional details.

Your zones.conf on your clients is missing the master endpoint & zone. See the example in the documentation, the client must know about the master.

How often does the check run?

Reproduced by a simple Windows server (icinga2 client) restart.

Did you restart the entire system or did you stopped the icinga2 service on the windows machine and just started it again?

Tontonitch commented 6 years ago

Your zones.conf on your clients is missing the master endpoint & zone. See the example in the documentation, the client must know about the master.

on the client, master endpoint & zone are located in the icinga2.conf file (I use a old adapted version of the powershell script https://github.com/Icinga/icinga2-powershell-module to install/update icinga2 client). The file zones.conf is not used in fact (not called from icinga2.conf).

Hereunder the icinga2.conf config part related to zones and endpoints:

/* --------------------------------
 Endpoints
-------------------------------- */
object Endpoint "xxxxhostxxxx" {}
object Endpoint "monitorsrv1.communication1.ntld" {}

/* --------------------------------
 Zones
-------------------------------- */
object Zone "master" {
  endpoints = [ "monitorsrv1.communication1.ntld" ]
}
object Zone "xxxxhostxxxx" {
  parent = "master"
  endpoints = [ "xxxxhostxxxx" ]
}
object Zone "director-global" { 
    global = true
}
object Zone "global-templates" {
    global = true
}

/* --------------------------------
 API
-------------------------------- */
object ApiListener "api" {
  cert_path = SysconfDir + "/icinga2/pki/xxxxhostxxxx.crt"
  key_path = SysconfDir + "/icinga2/pki/xxxxhostxxxx.key"
  ca_path = SysconfDir + "/icinga2/pki/ca.crt"
  accept_commands = true
  accept_config = true
}
Tontonitch commented 6 years ago

How often does the check run?

each minute, 9 services through the icinga2 client

Did you restart the entire system or did you stopped the icinga2 service on the windows machine and just started it again?

I tried a simple restart of the service, reconnection was quick. Multiple tries. No issues. I tried a restart of the server, and faced the issue. Yesterday, 3 restarts, 3 times the issue. This morning, 2 restarts but 0 times the issue... I tried on 3 different servers, faced the issue only on 1 of them. But I'm sure it happened on other servers restarted after some software/system updates.

Tontonitch commented 6 years ago

So, this morning, after a config reload on the icinga2 master, I cannot reproduce the issue. All my restart test of the server runing the icinga2 client were ok: connectivity between icinga2 master and client went back quite quickly. I also did some longer stop start to check if it's related to long server start, but the tests were still ok: quick connectivity back.

So I guess the issue is related to some cache on the icinga2 master.

I'm sure that after some time without reloading the master the issue will come back.

mcktr commented 6 years ago

I was not able to reproduce this in my test environment. How much endpoints do you have in your setup @Tontonitch ?

Tontonitch commented 6 years ago

about 110

Tontonitch commented 6 years ago

May you need me to check something else, debug something?

mcktr commented 6 years ago

Look into the log for messages like the bottom one. Maybe this indicates what happens and if the master tries to reconnect to the endpoint at all.

[2018-03-22 21:16:33 +0100] information/ApiListener: Reconnecting to endpoint 'test-win-icinga2' via host '172.16.21.77' and port '5665'
[2018-03-22 21:16:36 +0100] critical/TcpSocket: Invalid socket: No route to host
[2018-03-22 21:16:36 +0100] critical/ApiListener: Cannot connect to host '172.16.21.77' on port '5665'
[2018-03-22 21:16:36 +0100] information/ApiListener: Finished reconnecting to endpoint 'test-win-icinga2' via host '172.16.21.77' and port '5665'

Are you using a firewall (windows or third party) on your windows machines? If so have you configured exceptions for Icinga 2 in it?

Tontonitch commented 6 years ago

Are you using a firewall (windows or third party) on your windows machines? If so have you configured exceptions for Icinga 2 in it?

No

Look into the log for messages

I will check, but last time I couldn't see anything. Is there a way to log information/ApiListener messages, without the need of enabling debuglog?

mcktr commented 6 years ago

Yesterday I faced this issue again, after a reboot of one Windows host Icinga didn't reconnect to the agent. In my environment the master should build up the connection to the client (master -> client).

Unfortunately I don't have a debug log, only the normal one.

Master Linux Debian 8 jessie Icinga 2 v2.8.1

Client Windows Server 2012 R2 Icinga 2 v2.8.1

Master Log:

[...]
[2018-04-07 14:06:33 +0200] warning/TlsStream: TLS stream was disconnected.
[2018-04-07 14:06:33 +0200] warning/JsonRpcConnection: API client disconnected for identity 'client'
[2018-04-07 14:06:33 +0200] warning/ApiListener: Removing API client for endpoint 'client'. 0 API clients left.
[2018-04-07 14:06:49 +0200] information/ApiListener: Reconnecting to endpoint 'client' via host '<IP>' and port '5665'
[...]

After the last log message there is no further message related to the connection to client.

Client Log:

[...]
[2018-04-07 14:06:32 +0200] information/Application: Received request to shut down.
[2018-04-07 14:06:33 +0200] information/Application: Shutting down...
[2018-04-07 14:06:33 +0200] information/ApiListener: 'api' stopped.
[2018-04-07 14:07:29 +0200] information/ApiListener: Adding new listener on port '5665'
[2018-04-07 14:07:29 +0200] information/ConfigItem: Activated all objects.
[...]

Only stop and start messages in the client log, no message related to a connection.

Test: manually connect to client:

# openssl s_client -CAfile /var/lib/icinga2/certs/ca.crt -cert /var/lib/icinga2/certs/master.crt -key /var/lib/icinga2/certs/master.key -connect client:5665
CONNECTED(00000003)
[...]

I tried to connect manually to the client which worked.

API output for client zone object:

Call https://master:5665/v1/objects/zones/client

6154-api-result-client-zone

API output for client endpoint object:

Call https://master:5665/v1/objects/endpoints/client

6154-api-result-client-endpoint

You can see the attribute connecting is true - still after ~1 hour this state didn't change

I didn't had much time yesterday and after round about 1 hour with no change to the connection status I decided to restart the Icinga 2 daemon. Next time I face this issue I'll create a gdb backtrace.

dnsmichi commented 6 years ago

What does netstat say, is the socket opened? Anything on the debug log on the master side itself? Which distributed/version is that?

LordHepipud commented 6 years ago

Just because I faced this issue on a customer Environment this week: Have you ensured that the Icinga 2 Master / Satellite are accessing the same time Server and no time skip / jump is happening?

If the time doesnt match, the connection cannot be established.

Are you connecting From Agent to Satellite/Master, vice versa or both?

dnsmichi commented 6 years ago

Please test the current snapshot packages.

dnsmichi commented 6 years ago

We've seen and fixed that with the upcoming 2.9 release.