Remote Icinga instance xxx is not connected to xxx since 2.11

sharkyzz commented 5 years ago

Describe the bug

Since we upgraded from 2.10-5 to 2.11 we experience tons of 'Remote Icinga instance xxx is not connected to xxx' while deploying a new configuration with Icinga Director. Before (with 2.10) we saw this occasionally happen (only a few hosts/services) Now it are literally hundreds of them (see screenshot)

Screenshots

Untitled

Your Environment

Include as many relevant details about the environment you experienced the problem in

Version used (icinga2 --version):
Operating System and version: icinga2 - The Icinga 2 network monitoring daemon (version: r2.11.0-1)

Copyright (c) 2012-2019 Icinga GmbH (https://icinga.com/) License GPLv2+: GNU GPL version 2 or later http://gnu.org/licenses/gpl2.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.

System information: Platform: Debian GNU/Linux Platform version: 9 (stretch) Kernel: Linux Kernel version: 4.9.0-8-amd64 Architecture: x86_64

Build information: Compiler: GNU 6.3.0 Build host: runner-LTrJQZ9N-project-298-concurrent-0

Application information:

General paths: Config directory: /etc/icinga2 Data directory: /var/lib/icinga2 Log directory: /var/log/icinga2 Cache directory: /var/cache/icinga2 Spool directory: /var/spool/icinga2 Run directory: /run/icinga2

Old paths (deprecated): Installation root: /usr Sysconf directory: /etc Run directory (base): /run Local state directory: /var

Internal paths: Package data directory: /usr/share/icinga2 State path: /var/lib/icinga2/icinga2.state Modified attributes path: /var/lib/icinga2/modified-attributes.conf Objects path: /var/cache/icinga2/icinga2.debug Vars path: /var/cache/icinga2/icinga2.vars PID path: /run/icinga2/icinga2.pid

Enabled features (icinga2 feature list):

Enabled features: api checker command ido-mysql influxdb mainlog notification

Icinga Web 2 version and modules (System - About):

Icingaweb 2.7.1 Businessprocess 2.1.0 Director 1.6.1 Elasticsearch 1.0.0 Monitoring 2.7.1

Config validation (icinga2 daemon -C):

[2019-09-24 14:22:56 +0200] information/cli: Icinga application loader (version: r2.11.0-1) [2019-09-24 14:22:56 +0200] information/cli: Loading configuration file(s). [2019-09-24 14:22:56 +0200] information/ConfigItem: Committing config item(s). [2019-09-24 14:22:56 +0200] information/ApiListener: My API identity: xxx [2019-09-24 14:22:56 +0200] information/ConfigItem: Instantiated 1 InfluxdbWriter. [2019-09-24 14:22:56 +0200] information/ConfigItem: Instantiated 1 FileLogger. [2019-09-24 14:22:56 +0200] information/ConfigItem: Instantiated 2 NotificationCommands. [2019-09-24 14:22:56 +0200] information/ConfigItem: Instantiated 1 NotificationComponent. [2019-09-24 14:22:56 +0200] information/ConfigItem: Instantiated 1492 Notifications. [2019-09-24 14:22:56 +0200] information/ConfigItem: Instantiated 1 IcingaApplication. [2019-09-24 14:22:56 +0200] information/ConfigItem: Instantiated 8 HostGroups. [2019-09-24 14:22:56 +0200] information/ConfigItem: Instantiated 100 Hosts. [2019-09-24 14:22:56 +0200] information/ConfigItem: Instantiated 1 EventCommand. [2019-09-24 14:22:56 +0200] information/ConfigItem: Instantiated 2 Comments. [2019-09-24 14:22:56 +0200] information/ConfigItem: Instantiated 1 CheckerComponent. [2019-09-24 14:22:56 +0200] information/ConfigItem: Instantiated 58 Zones. [2019-09-24 14:22:56 +0200] information/ConfigItem: Instantiated 61 Endpoints. [2019-09-24 14:22:56 +0200] information/ConfigItem: Instantiated 1 ExternalCommandListener. [2019-09-24 14:22:56 +0200] information/ConfigItem: Instantiated 6 ApiUsers. [2019-09-24 14:22:56 +0200] information/ConfigItem: Instantiated 7 UserGroups. [2019-09-24 14:22:56 +0200] information/ConfigItem: Instantiated 1 ApiListener. [2019-09-24 14:22:56 +0200] information/ConfigItem: Instantiated 1 IdoMysqlConnection. [2019-09-24 14:22:56 +0200] information/ConfigItem: Instantiated 256 CheckCommands. [2019-09-24 14:22:56 +0200] information/ConfigItem: Instantiated 2 TimePeriods. [2019-09-24 14:22:56 +0200] information/ConfigItem: Instantiated 7 Users. [2019-09-24 14:22:56 +0200] information/ConfigItem: Instantiated 1283 Services. [2019-09-24 14:22:56 +0200] information/ConfigItem: Instantiated 8 ServiceGroups. [2019-09-24 14:22:56 +0200] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars' [2019-09-24 14:22:56 +0200] information/cli: Finished validating the configuration file(s

dnsmichi commented 5 years ago

Please add more details.

zones.conf from all involved nodes
Logs from both sides when such a (re)connect happens
Which icinga version is used on the agents?

sharkyzz commented 5 years ago

I cannot add the zones.conf of all the involved nodes (just too many) But I will add a few

Masters:

Master1

object Endpoint "master01" {
  host = "x.x.x.x"
}

object Endpoint "master02" {
  host = "x.x.x.x"
}

object Zone "master" {
        endpoints = [ "master01", "master02" ]
}

object Zone "global-templates" {
        global = true
}

object Zone "director-global" {
        global = true
}

Master2

object Endpoint "master01" {
  host = "x.x.x.x"
}

object Endpoint "master02" {
  host = "x.x.x.x"
}

object Zone "master" {
        endpoints = [ "master01", "master02" ]
}

object Zone "global-templates" {
        global = true
}

object Zone "director-global" {
        global = true
}

Sat01-ZoneA

object Endpoint "master01" {
}

object Endpoint "master02" {
}

object Endpoint "sat02-ZoneA" {
        host = "x.x.x.x"
        port = "5665"
}

object Zone "master" {
        endpoints = [ "master01", "master02" ]
}

object Zone "global-templates" {
        global = true
}

object Zone "director-global" {
        global = true
}

object Endpoint NodeName {
}

object Zone "ws-int" {
        endpoints = [ NodeName, "sat02-ZoneA" ]
        parent = "master"
}

Sat02-ZoneA

object Endpoint "master01" {
}

object Endpoint "Master02" {
}

object Zone "master" {
        endpoints = [ "master01", "master02" ]
}

object Endpoint "Sat01-ZoneA" {
        host = "x.x.x.x"
        port = "5665"
}

object Zone "global-templates" {
        global = true
}

object Zone "director-global" {
        global = true
}

object Endpoint NodeName {
}

object Zone "ws-int" {
        endpoints = [ NodeName, "Sat02-ZoneA" ]
        parent = "master"
}

And 6 more satellites (3 zones) with the same config as above

The log of a satellite:

icinga_satellite_log.TXT

The log of an agent (which came in the unknown Remote Icinga instance xxx is not connected to xxx status):

icinga_agent.TXT

The Masters and all of the Satellites are running: 2.11.0-1 The agents: All of them 2.11.0-1, expect one Windows host 2.10.4

Hope you can help us out! Thanks in advance!

dnsmichi commented 5 years ago

Looking at the satellite nodes specifically, I can see that they do not attempt to connect to the master. So the master nodes need to connect to the satellites.

Why are the satellite zone and endpoints not configured in the master's zones.conf as suggested by the documentation?

Unfortunately I cannot read much in the satellite log since every endpoint and zone name is replaced with 'xxx'. That makes it hard to guess if that's always the same or different. Better use placeholders like from the docs, icinga2-satellite1, icinga2-agent1, etc.

On the agent, this looks really odd. Manual restart or triggered via script? The config sync doesn't trigger the reload.

2019-09-25 16:09:50 +0200] information/FileLogger: 'main-log' started.
[2019-09-25 16:09:43 +0200] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones-stage/director-global' (42315 Bytes).
[2019-09-25 16:09:43 +0200] information/ApiListener: Received configuration updates (1) from endpoint 'xxx' are equal to production, not triggering reload.
[2019-09-25 16:09:44 +0200] information/Application: Received request to shut down.
[2019-09-25 16:09:44 +0200] information/Application: Shutting down...
[2019-09-25 16:09:44 +0200] information/RemoteCheckQueue: items: 0, rate: 0/s (12/min 60/5min 180/15min);
[2019-09-25 16:09:50 +0200] information/CheckerComponent: 'checker' stopped.
[2019-09-25 16:09:50 +0200] information/ApiListener: 'api' stopped.
[2019-09-25 16:09:50 +0200] information/FileLogger: 'main-log' started.
[2019-09-25 16:09:50 +0200] information/ApiListener: 'api' started.
[2019-09-25 16:09:50 +0200] information/ApiListener: Started new listener on 'xxx'
[2019-09-25 16:09:50 +0200] information/CheckerComponent: 'checker' started.
[2019-09-25 16:09:50 +0200] information/ConfigItem: Activated all objects.

Which connection direction is used for the satellites and agents? Does the agent connect to the satellite, or vice versa?

sharkyzz commented 5 years ago

That's because a part is of the zone configuration is managed by Director:

root@xxx:/var/lib/icinga2/api/zones/master/director# cat zones.conf 
object Zone "ws-int" {
    parent = "master"
    endpoints = [
        "sat01",
        "sat02"
    ]
}

object Zone "ws-dmz-ext" {
    parent = "master"
    endpoints = [
        "sat01",
        "sat02"
    ]
}

object Zone "ws-tip" {
    parent = "master"
    endpoints = [
        "sat01",
        "sat02"
    ]
}

object Zone "ws-cw" {
    parent = "master"
    endpoints = [
        "sat01",
        "sat02"
    ]
}

I will look into the log file if I can make it the way you suggested

The restart is triggered by the deployment of the director, so no manual intervention. So this is really what seems to happen.

sharkyzz commented 5 years ago

Which connection direction is used for the satellites and agents? Does the agent connect to the satellite, or vice versa?

Only the satellites are making active connections towards the agents

sharkyzz commented 5 years ago

As far I can tell now. It only happens while adding new host(s). When modifying a host (eg. adding a service) it doesn't happen (could be just a coincidence)

sharkyzz commented 5 years ago

Also experienced it right now while changing something in the service apply rules. Hundreds of unknowns :(

sharkyzz commented 5 years ago

I will send you all of the unedited configs/logs. What is the best way to get it to you ? Don't want all of our infrastructure information exposed on the internet.

dnsmichi commented 5 years ago

Ok. So you're managing the zones inside other zones, which is prohibited with 2.11. That being said, remove the satellite zone from the Director, and put the details into the zones.conf file on both masters keeping the hierarchy intact.

As far I can tell now. It only happens while adding new host(s). When modifying a host (eg. adding a service) it doesn't happen (could be just a coincidence)

Likely explains the above even more.

This is now mentioned in the upgrading docs: https://icinga.com/docs/icinga2/latest/doc/16-upgrading-icinga-2/#config-sync-zones-in-zones

Before sending any logs, please do so with removing the satellite zone from the infrastructure tab, setting this up in both zones.conf files and last, restarting both nodes with then triggering a Director kickstart to import the outside satellite zone/endpoint against, plust a new deployment.

Unfortunately you're using something which we never supported, somehow it worked before, now it is hardened and zone in zone cluster config sync doesn't work anymore.

sharkyzz commented 5 years ago

Thanks a lot! I will try asap.

dnsmichi commented 5 years ago

The linked issue targets the same question, with some more explanations: https://github.com/Icinga/icinga2/issues/7542#issuecomment-535874740 The solution was already implemented with the suggested config change.

sharkyzz commented 5 years ago

I will start migrating to the new zone hierarchy in our production environment tomorrow morning. I will perform tests afterwards and will get back to you with my results.

sharkyzz commented 5 years ago

So there we are. It was hell of a ride to get it all straight. But we are now using the zone hierarchy of the Icinga API. Removed all the zones and imported them with kickstart:

This how our zones.conf on the masters looks like right now:

# Masters
object Endpoint "master01" {
        host = "master01"
        port = "5665"
}
object Endpoint "master02" {
        host = "master02"
        port = "5665"
}

# Satellites zone "ws-cw"
object Endpoint "sat01" {
        host = "sat01"
        port = "5665"
}
object Endpoint "sat02" {
        host = "sat02"
        port = "5665"
}

# Satellites zone "ws-dmz-ext"
object Endpoint "sat01" {
        host = "sat01"
        port = "5665"
}
object Endpoint "sat02" {
        host = "sat02"
        port = "5665"
}

# Satellites zone "ws-int"
object Endpoint "sat01" {
        host = "sat01"
        port = "5665"
}
object Endpoint "sat02" {
        host = "sat02"
        port = "5665"
}

# Satellites zone "ws-tip"
object Endpoint "sat01" {
        host = "sat01"
        port = "5665"
}
object Endpoint "sat02" {
        host = "sat02"
        port = "5665"
}

## ZONES ##
object Zone "master" {
        endpoints = [ "master01", "master02" ]
}

object Zone "ws-cw" {
        endpoints = [ "sat01", "sat02" ]
        parent = "master"
}

object Zone "ws-dmz-ext" {
        endpoints = [ "sat01", "sat02" ]
        parent = "master"
}

object Zone "ws-int" {
        endpoints = [ "sat01", "sat02" ]
        parent = "master"
}

object Zone "ws-tip" {
        endpoints = [ "sat01", "sat02" ]
        parent = "master"
}

object Zone "global-templates" {
        global = true
}

object Zone "director-global" {
        global = true
}

sharkyzz commented 5 years ago

Unfortunately this didn't fix our problem. The issue remains the same.

Can I somehow send you all of the logging ? It is a bit of pain to rewrite all the logs so our internal infrastructure isn't visible.

sharkyzz commented 5 years ago

This is btw how it looks like on the side of the agent. Maybe this helps you figuring out what is going on:

icinga_agent_log.TXT

sharkyzz commented 5 years ago

I have also created a log of one of the satellites while the issue occurred. I replaced all the masters/satellites with master01/2, sat01/2. Hope this helps:

icinga_satellite.TXT

MohamedRedaLyoubi commented 5 years ago

hello, we have also the same problem since we upgraded icinga2 to 2.11 (masters & satellites) this morning, all zones and endpoints are declared in zones.conf and imported into director by kickstart.

master01 zones.conf:

object Endpoint "master02" {
        host = "master2"
        port = "icinga_port"
}

object Endpoint "master01" {
}

object Endpoint "sat01" {
  host = "sat01"
  port = "icinga_port"
}

object Endpoint "sat02" {
  host = "sat02"
  port = "icinga_port"
}
object Endpoint "sat03" {
  host = "sat03"
  port = "icinga_port"
}

object Zone "master" {
        endpoints = [ "master01,"master02" ]
}

object Zone "zone01" {
  endpoints = [ "sat01","sat02"]
  parent = "master"
}
object Zone "zone02" {
  endpoints = [ "sat03"]
  parent = "master"
}

object Zone "global-templates" {
        global = true
}

object Zone "director-global" {
        global = true
}

sat01 zones.conf:

object Endpoint "master02" {
}

object Endpoint "master01" {
}

object Endpoint "sat01" {
}

object Endpoint "sat02" {
#   host = "sat02"
#   port = "icinga_port"
}

object Zone "master" {
        endpoints = [ "master02", "master01" ]
}

object Zone "zone01" {
        endpoints = [ "sat01","sat02" ]
        parent = "master"
}

object Zone "global-templates" {
        global = true
}

object Zone "director-global" {
        global = true
}

Thank you for your help ^^

sharkyzz commented 5 years ago

@MohamedRedaLyoubi can you please also send the logfiles of the satellite and agent while you do a deployment with Icinga Director ? I'm very curious if I see the same behaviour in your setup.

MohamedRedaLyoubi commented 5 years ago

@sharkyzz yes take a look client.log sat01.log

sharkyzz commented 5 years ago

@MohamedRedaLyoubi you got indeed exact the same issue. Let's wait for @dnsmichi to reply

dnsmichi commented 5 years ago

Satellite triggers a reload.

[2019-10-01 17:30:47 +0200] information/ApiListener: Applying config update from endpoint 'master01' of zone 'master'.
[2019-10-01 17:30:47 +0200] information/ApiListener: Received configuration for zone 'zone01' from endpoint 'master01'. Comparing the checksums.

...

[2019-10-01 17:30:47 +0200] information/ApiListener: Received configuration updates (3) from endpoint 'master01' are different to production, triggering validation and reload.
[2019-10-01 17:30:47 +0200] information/ApiListener: Config validation for stage '/var/lib/icinga2/api/zones-stage/' was OK, replacing into '/var/lib/icinga2/api/zones/' and triggering reload.

The timestamps of the agent log don't match unfortunately, the logs start at 17:33:25 being too late.

Try to match the timestamps and see what's causing the agent disconnect (obviously a satellite reload) and also analyse how often these satellite reloads happen.

Also analyse the satellite's performance (troubleshooting & advanced topic docs) while I am away for vacation.

sharkyzz commented 5 years ago

@dnsmichi please check my logs, because they are correct.

This is so awkward (agent):

[2019-10-01 13:45:40 +0200] information/ApiListener: Received configuration updates (1) from endpoint 'sat02' are equal to production, not triggering reload.
[2019-10-01 13:45:40 +0200] information/Application: Received request to shut down.
[2019-10-01 13:45:41 +0200] information/Application: Shutting down...

sharkyzz commented 5 years ago

And Master:

[2019-10-01 13:45:33 +0200] information/ApiListener: Received configuration updates (2) from endpoint 'master01' are different to production, triggering validation and reload.
[2019-10-01 13:45:33 +0200] information/ApiListener: Config validation for stage '/var/lib/icinga2/api/zones-stage/' was OK, replacing into '/var/lib/icinga2/api/zones/' and triggering reload.
[2019-10-01 13:45:33 +0200] information/ApiListener: Copying file 'director-global//.checksums' from config sync staging to production zones directory.
[2019-10-01 13:45:33 +0200] information/ApiListener: Copying file 'director-global//.timestamp' from config sync staging to production zones directory.
[2019-10-01 13:45:33 +0200] information/ApiListener: Copying file 'director-global//director/001-director-basics.conf' from config sync staging to production zones directory.
[2019-10-01 13:45:33 +0200] information/ApiListener: Copying file 'director-global//director/commands.conf' from config sync staging to production zones directory.
[2019-10-01 13:45:33 +0200] information/ApiListener: Copying file 'director-global//director/host_templates.conf' from config sync staging to production zones directory.
[2019-10-01 13:45:33 +0200] information/ApiListener: Copying file 'director-global//director/hostgroups.conf' from config sync staging to production zones directory.
[2019-10-01 13:45:33 +0200] information/ApiListener: Copying file 'director-global//director/service_apply.conf' from config sync staging to production zones directory.
[2019-10-01 13:45:33 +0200] information/ApiListener: Copying file 'director-global//director/service_templates.conf' from config sync staging to production zones directory.
[2019-10-01 13:45:33 +0200] information/ApiListener: Copying file 'director-global//director/servicegroups.conf' from config sync staging to production zones directory.
[2019-10-01 13:45:33 +0200] information/ApiListener: Copying file 'director-global//director/timeperiod_templates.conf' from config sync staging to production zones directory.
[2019-10-01 13:45:33 +0200] information/ApiListener: Copying file 'director-global//director/timeperiods.conf' from config sync staging to production zones directory.
[2019-10-01 13:45:33 +0200] information/ApiListener: Copying file 'director-global//director/user_templates.conf' from config sync staging to production zones directory.
[2019-10-01 13:45:33 +0200] information/ApiListener: Copying file 'ws-int//.checksums' from config sync staging to production zones directory.
[2019-10-01 13:45:33 +0200] information/ApiListener: Copying file 'ws-int//.timestamp' from config sync staging to production zones directory.
[2019-10-01 13:45:33 +0200] information/ApiListener: Copying file 'ws-int//director/agent_endpoints.conf' from config sync staging to production zones directory.
[2019-10-01 13:45:33 +0200] information/ApiListener: Copying file 'ws-int//director/agent_zones.conf' from config sync staging to production zones directory.
[2019-10-01 13:45:33 +0200] information/ApiListener: Copying file 'ws-int//director/hosts.conf' from config sync staging to production zones directory.
[2019-10-01 13:45:33 +0200] information/ApiListener: Copying file 'ws-int//director/services.conf' from config sync staging to production zones directory.
[2019-10-01 13:45:33 +0200] information/ApiListener: Copying file 'ws-int//director/servicesets.conf' from config sync staging to production zones directory.
[2019-10-01 13:45:35 +0200] information/Application: Received request to shut down.

sharkyzz commented 5 years ago

Doesn't make sense. To me it really looks like a bug. For the record: We didn't had this issue with 2.10-5

MohamedRedaLyoubi commented 5 years ago

@dnsmichi Thank you for your replay, There are logs (debug logs) at the same timestamps debug_sat01.log debug_client.log

I discovered a new error on the satellite :

2019-10-02 10:04:23 +0200] notice/JsonRpcConnection: Error while reading JSON-RPC message for identity 'sat02': Error: short read

    (0) icinga2: icinga::JsonRpc::ReadMessage(std::shared_ptr<icinga::AsioTlsStream> const&, boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >, long) (+0x95) [0x92d695]
    (1) icinga2: icinga::JsonRpcConnection::HandleIncomingMessages(boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >) (+0xf4) [0xbc20a4]
    (2) /usr/lib64/icinga2/sbin/icinga2() [0xbc2811]
    (3) libboost_context.so.1.69.0: make_fcontext (+0x2f) [0x7f8023a1218f]

Got the same error on master01

[2019-10-02 03:19:47 +0200] notice/JsonRpcConnection: Error while reading JSON-RPC message for identity 'sat01': Error: short read

  (0) icinga2: icinga::JsonRpc::ReadMessage(std::shared_ptr<icinga::AsioTlsStream> const&, boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >, long) (+0x95) [0x92d695]
  (1) icinga2: icinga::JsonRpcConnection::HandleIncomingMessages(boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >) (+0xf4) [0xbc20a4]
  (2) /usr/lib64/icinga2/sbin/icinga2() [0xbc2811]
  (3) libboost_context.so.1.69.0: make_fcontext (+0x2f) [0x7f5d26a6618f]

We didn't have this issue with 2.10-5

I'll try to analyse more...

have a nice vacation ^^

sharkyzz commented 5 years ago

Wow nice catch @MohamedRedaLyoubi

Seems I experience the exact same error (which occurred during a deployment):

[2019-10-02 08:32:09 +0200] notice/JsonRpcConnection: Error while reading JSON-RPC message for identity 'master01': Error: stream truncated

        (0) icinga2: icinga::NetString::ReadStringFromStream(std::shared_ptr<icinga::AsioTlsStream> const&, boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::
executor> >, long) (+0x406) [0x5578abfecf06]
        (1) icinga2: icinga::JsonRpc::ReadMessage(std::shared_ptr<icinga::AsioTlsStream> const&, boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >
, long) (+0x49) [0x5578ac0a92c9]
        (2) icinga2: icinga::JsonRpcConnection::HandleIncomingMessages(boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >) (+0x115) [0x5578ac0b9875
]
        (3) icinga2: <unknown function> (+0x43a00d) [0x5578ac1a100d]
        (4) libboost_context.so.1.67.0: make_fcontext (+0x2f) [0x7ffa7654c72f]

sharkyzz commented 5 years ago

Enjoy your well deserved vacation. Is there someone else of the Icinga team who can help with this bug while you are away ?

MohamedRedaLyoubi commented 5 years ago

The satellite and master execute reload command frequently, in 5 minutes the sat01 did this 66 times:

[2019-10-02 10:05:34 +0200] notice/Application: Got reload command, forwarding to umbrella process (PID 89312)
[2019-10-02 10:05:35 +0200] information/Application: Received request to shut down.
[2019-10-02 10:05:35 +0200] notice/JsonRpcConnection: Error while reading JSON-RPC message for identity 'sat02': Error: short read

    (0) icinga2: icinga::JsonRpc::ReadMessage(std::shared_ptr<icinga::AsioTlsStream> const&, boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >, long) (+0x95) [0x92d695]
    (1) icinga2: icinga::JsonRpcConnection::HandleIncomingMessages(boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >) (+0xf4) [0xbc20a4]
    (2) /usr/lib64/icinga2/sbin/icinga2() [0xbc2811]
    (3) libboost_context.so.1.69.0: make_fcontext (+0x2f) [0x7f8023a1218f]

Woytek-Polnik commented 5 years ago

I think that this is connected with icinga restarting network

Oct 03 14:12:47 safe-reload[27257]: Reloading Icinga 2: Done
Oct 03 14:13:05 systemd[1]: Reloaded Icinga host/service/network monitoring system.
Oct 03 14:49:22 systemd[1]: Reloading Icinga host/service/network monitoring system.
Oct 03 14:49:24 safe-reload[22268]: Validating config files: Done
Oct 03 14:49:24 safe-reload[22268]: Reloading Icinga 2: Done
Oct 03 14:49:39 systemd[1]: Reloaded Icinga host/service/network monitoring system.
Oct 03 14:50:19 systemd[1]: Reloading Icinga host/service/network monitoring system.
Oct 03 14:50:21 safe-reload[23109]: Validating config files: Done
Oct 03 14:50:21 safe-reload[23109]: Reloading Icinga 2: Done
Oct 03 14:50:28 systemd[1]: Reloaded Icinga host/service/network monitoring system.

For me hosts also lose connection, especially for services on master which is being reloaded/restarted

Woytek-Polnik commented 5 years ago

Update to this ^ I believe that the issue was the incompatibility of versions between master and clients. We had clients using ubuntu 14,16,18. Incompatibility increased time of service reload from 1/2 seconds to 15 effecitvely causing clients losing connection.

MohamedRedaLyoubi commented 5 years ago

@sharkyzz any updates about the problem ?

sharkyzz commented 5 years ago

@MohamedRedaLyoubi nope. Waiting for @dnsmichi I guess..

dnsmichi commented 5 years ago

The Error: short read messages are "normal" in the way that they happen on one side when the remote end did not shutdown the TLS layer as expected. This happens also with REST APIs where browsers don't close the TLS connection in order to safe CPU cycles, Google Chrome does that.

In this specific regard, it seems that the other end does not or can not wait for shutting down all TLS connections, but continues with the reload. Error: stream truncated points towards the same problem.

The reason why you are not seeing these errors in 2.10 is mainly because the error handling there did not catch these errors and silently logged nothing. One could of course argue that the information level of these messages is close to zero, on the other hand it certainly unveils connections which are forcefully closed, not being the standard behaviour.

The real question is why this happens:

The satellite and master execute reload command frequently, in 5 minutes the sat01 did this 66 times:

Analyse this further in the way of

are the configuration checksums always different (debug log)
Which trigger on the master is needed to let the reload happen this often?
To mitigate further, shutdown the secondary master and check how the configuration sync is done
Also, shutdown the secondary satellite (if you can) to see whether this changes something in the cluster and sync behaviour

MohamedRedaLyoubi commented 5 years ago

Checksums are different everytime that's why reload command is executed verytime on satellites.
The reload is executed only on satellites after syncing configuration
When I've shutdown the master02 we had not the problem anymore, sync config is been done only once, and same for reload command
Once I've restarted master02 the satellite try to sync config in loop and of course reload everytime after syncing.

Strange behaviour !

@dnsmichi Is that the same problem here https://github.com/Icinga/icinga2/issues/7565 ?

Thnak you ^^

dnsmichi commented 5 years ago

Highly likely. The linked ticket originates from a customer report, and your last analysis shed some light into what I guessed (this time, I am right ;)). I'll reference/update here once we've got a solution for the linked issue.

sharkyzz commented 5 years ago

Nice! Thanks a lot for confirming! Will the fix make it to 2.11.1 ?

dnsmichi commented 5 years ago

If it is easy, 2.11.2 with a target of CW43. If it requires a cluster message change, 2.12 as intermediate release for OSMC & IcingaDB. We'll be looking into it end of CW42.

dnsmichi commented 5 years ago

Applies for 2.11.2, the linked PR contains all the details including tests.

sharkyzz commented 5 years ago

@dnsmichi I thought it was fixed. But it still happens. What kind of information do you need ?

@MohamedRedaLyoubi how things are going at your side ?

HOSTED-POWER commented 4 years ago

Just let you know that when we reload icinga2 manually with: /bin/systemctl reload icinga2

we also see these issues each time...

sharkyzz commented 4 years ago

what is your setup @HOSTED-POWER ?

dnsmichi commented 4 years ago

Please continue in #7597 - this issue has been resolved for an already released version, and therefore cannot be re-opened.

Icinga / icinga2