Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.03k stars 578 forks source link

Agents will be disconnected if check execution time is longer then 60s #5449

Closed jkroepke closed 6 years ago

jkroepke commented 7 years ago

Our agents will be disconnected, if they must run long running checks.

See: https://monitoring-portal.org/index.php?thread/41439-agents-verbinden-sporadisch-die-verbindung/

Expected Behavior

Current Behavior

[2017-08-02 12:57:49 +0200] warning/JsonRpcConnection: API client disconnected for identity 'agent'
[2017-08-02 12:57:49 +0200] warning/ApiListener: Removing API client for endpoint 'agent'. 0 API clients left.

Possible Solution

If we add an exit 0 at the top to the long running check, the agent will not disconnect anymore.

Steps to Reproduce (for bugs)

  1. Setup distributed icinga2
  2. Configure multiple services for the agent, set the timeout to 2m

Your Environment

jkroepke commented 7 years ago

After some research:

Some agents have the same behavior but have only 4 services running 30 seconds.

lazyfrosch commented 7 years ago

That's strange, the API connections are not related to the actual execution of checks.

The command_endpoint setting sends a check request to the agent, which then asynchronously executes the check, and sends a result back to the parent.

What's about the log on the agent? Any details there?

Have you enabled debugging on the agent side to see whats going on?

jkroepke commented 7 years ago

Hi,

thanks for your answer. I posted some debug logs in the forum thead. Here is a typical roundtrip:

[2017-08-01 16:10:54 +0200] information/ApiListener: New client connection for identity 'mon01' from [10.207.193.186]:38008
[2017-08-01 16:10:54 +0200] notice/ApiListener: New JSON-RPC client
[2017-08-01 16:10:54 +0200] notice/WorkQueue: Spawning WorkQueue threads for 'JsonRpcConnection, #2'
[2017-08-01 16:10:54 +0200] information/ApiListener: Sending config updates for endpoint 'mon01'.
[2017-08-01 16:10:54 +0200] information/ApiListener: Syncing runtime objects to endpoint 'mon01'.
[2017-08-01 16:10:54 +0200] notice/JsonRpcConnection: Received 'icinga::Hello' message from 'mon01'
[2017-08-01 16:10:54 +0200] information/ApiListener: Finished syncing runtime objects to endpoint 'mon01'.
[2017-08-01 16:10:54 +0200] information/ApiListener: Finished sending config updates for endpoint 'mon01'.
[2017-08-01 16:10:54 +0200] information/ApiListener: Sending replay log for endpoint 'mon01'.
[2017-08-01 16:10:54 +0200] information/ApiListener: Finished sending replay log for endpoint 'mon01'.
[2017-08-01 16:10:54 +0200] notice/JsonRpcConnection: Received 'config::Update' message from 'mon01'
[2017-08-01 16:10:54 +0200] notice/ApiListener: Creating config update for file '/var/lib/icinga2/api/zones/global-templates/.timestamp'
[2017-08-01 16:10:54 +0200] notice/ApiListener: Creating config update for file '/var/lib/icinga2/api/zones/global-templates/_etc/.checkcommands.conf.swp'
[2017-08-01 16:10:54 +0200] notice/ApiListener: Creating config update for file '/var/lib/icinga2/api/zones/global-templates/_etc/checkcommands.conf'
[2017-08-01 16:10:54 +0200] notice/ApiListener: Creating config update for file '/var/lib/icinga2/api/zones/global-templates/_etc/checkcommands.ps.conf'
[2017-08-01 16:10:55 +0200] notice/JsonRpcConnection: Received 'log::SetLogPosition' message from 'mon01'
[2017-08-01 16:10:56 +0200] notice/JsonRpcConnection: Received 'event::ExecuteCommand' message from 'mon01'
[2017-08-01 16:10:58 +0200] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #19 (JsonRpcConnection, #11) tasks: 0
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #11 (JsonRpcConnection, #3) tasks: 0
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #17 (JsonRpcConnection, #9) tasks: 0
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #23 (JsonRpcConnection, #15) tasks: 0
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #12 (JsonRpcConnection, #4) tasks: 0
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #28 (JsonRpcConnection, #20) tasks: 0
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #24 (JsonRpcConnection, #16) tasks: 0
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #26 (JsonRpcConnection, #18) tasks: 0
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #39 (JsonRpcConnection, #31) tasks: 0
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #38 (JsonRpcConnection, #30) tasks: 0
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #33 (JsonRpcConnection, #25) tasks: 0
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #25 (JsonRpcConnection, #17) tasks: 0
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #20 (JsonRpcConnection, #12) tasks: 0
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #27 (JsonRpcConnection, #19) tasks: 0
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #14 (JsonRpcConnection, #6) tasks: 0
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #35 (JsonRpcConnection, #27) tasks: 0
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #31 (JsonRpcConnection, #23) tasks: 0
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #30 (JsonRpcConnection, #22) tasks: 0
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #34 (JsonRpcConnection, #26) tasks: 0
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #15 (JsonRpcConnection, #7) tasks: 0
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #22 (JsonRpcConnection, #14) tasks: 0
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #21 (JsonRpcConnection, #13) tasks: 0
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #18 (JsonRpcConnection, #10) tasks: 0
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #37 (JsonRpcConnection, #29) tasks: 0
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #9 (JsonRpcConnection, #1) tasks: 28
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #13 (JsonRpcConnection, #5) tasks: 0
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #36 (JsonRpcConnection, #28) tasks: 0
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #29 (JsonRpcConnection, #21) tasks: 0
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #10 (JsonRpcConnection, #2) tasks: 1
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #16 (JsonRpcConnection, #8) tasks: 0
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #8 (JsonRpcConnection, #0) tasks: 26
[2017-08-01 16:10:59 +0200] notice/WorkQueue: #32 (JsonRpcConnection, #24) tasks: 0
[2017-08-01 16:11:03 +0200] notice/WorkQueue: #4 (ApiListener, RelayQueue) tasks: 0
[2017-08-01 16:11:03 +0200] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2017-08-01 16:11:03 +0200] notice/WorkQueue: #5 (ApiListener, SyncQueue) tasks: 0
[2017-08-01 16:11:08 +0200] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2017-08-01 16:11:08 +0200] notice/ThreadPool: Pool #1: Pending tasks: 0; Average latency: 0ms; Threads: 4; Pool utilization: 0.00440086%
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #19 (JsonRpcConnection, #11) tasks: 0
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #11 (JsonRpcConnection, #3) tasks: 0
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #17 (JsonRpcConnection, #9) tasks: 0
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #28 (JsonRpcConnection, #20) tasks: 0
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #24 (JsonRpcConnection, #16) tasks: 0
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #12 (JsonRpcConnection, #4) tasks: 0
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #23 (JsonRpcConnection, #15) tasks: 0
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #31 (JsonRpcConnection, #23) tasks: 0
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #26 (JsonRpcConnection, #18) tasks: 0
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #39 (JsonRpcConnection, #31) tasks: 0
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #25 (JsonRpcConnection, #17) tasks: 0
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #15 (JsonRpcConnection, #7) tasks: 0
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #20 (JsonRpcConnection, #12) tasks: 0
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #33 (JsonRpcConnection, #25) tasks: 0
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #38 (JsonRpcConnection, #30) tasks: 0
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #27 (JsonRpcConnection, #19) tasks: 0
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #21 (JsonRpcConnection, #13) tasks: 0
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #35 (JsonRpcConnection, #27) tasks: 0
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #14 (JsonRpcConnection, #6) tasks: 0
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #30 (JsonRpcConnection, #22) tasks: 0
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #22 (JsonRpcConnection, #14) tasks: 0
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #37 (JsonRpcConnection, #29) tasks: 0
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #9 (JsonRpcConnection, #1) tasks: 28
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #34 (JsonRpcConnection, #26) tasks: 0
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #16 (JsonRpcConnection, #8) tasks: 0
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #29 (JsonRpcConnection, #21) tasks: 0
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #13 (JsonRpcConnection, #5) tasks: 0
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #8 (JsonRpcConnection, #0) tasks: 26
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #32 (JsonRpcConnection, #24) tasks: 0
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #18 (JsonRpcConnection, #10) tasks: 0
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #36 (JsonRpcConnection, #28) tasks: 0
[2017-08-01 16:11:09 +0200] notice/WorkQueue: #10 (JsonRpcConnection, #2) tasks: 5
[2017-08-01 16:11:13 +0200] notice/WorkQueue: #4 (ApiListener, RelayQueue) tasks: 0
[2017-08-01 16:11:13 +0200] notice/WorkQueue: #5 (ApiListener, SyncQueue) tasks: 0
[2017-08-01 16:11:13 +0200] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2017-08-01 16:11:18 +0200] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #11 (JsonRpcConnection, #3) tasks: 0
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #19 (JsonRpcConnection, #11) tasks: 0
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #17 (JsonRpcConnection, #9) tasks: 0
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #39 (JsonRpcConnection, #31) tasks: 0
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #24 (JsonRpcConnection, #16) tasks: 0
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #26 (JsonRpcConnection, #18) tasks: 0
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #28 (JsonRpcConnection, #20) tasks: 0
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #20 (JsonRpcConnection, #12) tasks: 0
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #33 (JsonRpcConnection, #25) tasks: 0
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #12 (JsonRpcConnection, #4) tasks: 0
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #15 (JsonRpcConnection, #7) tasks: 0
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #27 (JsonRpcConnection, #19) tasks: 0
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #23 (JsonRpcConnection, #15) tasks: 0
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #14 (JsonRpcConnection, #6) tasks: 0
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #35 (JsonRpcConnection, #27) tasks: 0
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #25 (JsonRpcConnection, #17) tasks: 0
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #31 (JsonRpcConnection, #23) tasks: 0
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #29 (JsonRpcConnection, #21) tasks: 0
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #38 (JsonRpcConnection, #30) tasks: 0
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #16 (JsonRpcConnection, #8) tasks: 0
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #36 (JsonRpcConnection, #28) tasks: 0
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #13 (JsonRpcConnection, #5) tasks: 0
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #30 (JsonRpcConnection, #22) tasks: 0
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #21 (JsonRpcConnection, #13) tasks: 0
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #34 (JsonRpcConnection, #26) tasks: 0
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #18 (JsonRpcConnection, #10) tasks: 0
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #37 (JsonRpcConnection, #29) tasks: 0
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #9 (JsonRpcConnection, #1) tasks: 28
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #22 (JsonRpcConnection, #14) tasks: 0
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #8 (JsonRpcConnection, #0) tasks: 26
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #32 (JsonRpcConnection, #24) tasks: 0
[2017-08-01 16:11:19 +0200] notice/WorkQueue: #10 (JsonRpcConnection, #2) tasks: 9
[2017-08-01 16:11:23 +0200] notice/WorkQueue: #5 (ApiListener, SyncQueue) tasks: 0
[2017-08-01 16:11:23 +0200] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2017-08-01 16:11:23 +0200] notice/WorkQueue: #4 (ApiListener, RelayQueue) tasks: 1
[2017-08-01 16:11:23 +0200] notice/ApiListener: Relaying 'event::UpdateRepository' message
[2017-08-01 16:11:23 +0200] notice/ApiListener: Sending message 'event::UpdateRepository' to 'mon01'
[2017-08-01 16:11:23 +0200] debug/ApiListener: Not connecting to Endpoint 'mon01' because the host/port attributes are missing.
[2017-08-01 16:11:23 +0200] debug/ApiListener: Not connecting to Endpoint 'mon02' because the host/port attributes are missing.
[2017-08-01 16:11:23 +0200] debug/ApiListener: Not connecting to Endpoint 'agent' because that's us.
[2017-08-01 16:11:23 +0200] notice/ApiListener: Current zone master: agent
[2017-08-01 16:11:23 +0200] notice/ApiListener: Connected endpoints: mon01 (1)
[2017-08-01 16:11:23 +0200] notice/ThreadPool: Pool #1: Pending tasks: 0; Average latency: 0ms; Threads: 4; Pool utilization: 0.00379595%
[2017-08-01 16:11:28 +0200] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #11 (JsonRpcConnection, #3) tasks: 0
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #19 (JsonRpcConnection, #11) tasks: 0
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #26 (JsonRpcConnection, #18) tasks: 0
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #17 (JsonRpcConnection, #9) tasks: 0
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #39 (JsonRpcConnection, #31) tasks: 0
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #33 (JsonRpcConnection, #25) tasks: 0
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #31 (JsonRpcConnection, #23) tasks: 0
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #35 (JsonRpcConnection, #27) tasks: 0
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #37 (JsonRpcConnection, #29) tasks: 0
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #25 (JsonRpcConnection, #17) tasks: 0
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #22 (JsonRpcConnection, #14) tasks: 0
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #24 (JsonRpcConnection, #16) tasks: 0
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #29 (JsonRpcConnection, #21) tasks: 0
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #28 (JsonRpcConnection, #20) tasks: 0
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #8 (JsonRpcConnection, #0) tasks: 26
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #16 (JsonRpcConnection, #8) tasks: 0
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #15 (JsonRpcConnection, #7) tasks: 0
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #23 (JsonRpcConnection, #15) tasks: 0
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #27 (JsonRpcConnection, #19) tasks: 0
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #36 (JsonRpcConnection, #28) tasks: 0
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #12 (JsonRpcConnection, #4) tasks: 0
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #21 (JsonRpcConnection, #13) tasks: 0
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #14 (JsonRpcConnection, #6) tasks: 0
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #30 (JsonRpcConnection, #22) tasks: 0
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #20 (JsonRpcConnection, #12) tasks: 0
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #38 (JsonRpcConnection, #30) tasks: 0
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #13 (JsonRpcConnection, #5) tasks: 0
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #34 (JsonRpcConnection, #26) tasks: 0
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #18 (JsonRpcConnection, #10) tasks: 0
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #9 (JsonRpcConnection, #1) tasks: 28
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #32 (JsonRpcConnection, #24) tasks: 0
[2017-08-01 16:11:29 +0200] notice/WorkQueue: #10 (JsonRpcConnection, #2) tasks: 15
[2017-08-01 16:11:33 +0200] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2017-08-01 16:11:33 +0200] notice/WorkQueue: #5 (ApiListener, SyncQueue) tasks: 0
[2017-08-01 16:11:33 +0200] notice/WorkQueue: #4 (ApiListener, RelayQueue) tasks: 0
[2017-08-01 16:11:38 +0200] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2017-08-01 16:11:38 +0200] notice/ThreadPool: Pool #1: Pending tasks: 0; Average latency: 0ms; Threads: 4; Pool utilization: 0.00170543%
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #11 (JsonRpcConnection, #3) tasks: 0
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #33 (JsonRpcConnection, #25) tasks: 0
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #19 (JsonRpcConnection, #11) tasks: 0
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #39 (JsonRpcConnection, #31) tasks: 0
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #26 (JsonRpcConnection, #18) tasks: 0
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #31 (JsonRpcConnection, #23) tasks: 0
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #16 (JsonRpcConnection, #8) tasks: 0
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #17 (JsonRpcConnection, #9) tasks: 0
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #35 (JsonRpcConnection, #27) tasks: 0
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #27 (JsonRpcConnection, #19) tasks: 0
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #37 (JsonRpcConnection, #29) tasks: 0
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #30 (JsonRpcConnection, #22) tasks: 0
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #34 (JsonRpcConnection, #26) tasks: 0
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #25 (JsonRpcConnection, #17) tasks: 0
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #15 (JsonRpcConnection, #7) tasks: 0
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #22 (JsonRpcConnection, #14) tasks: 0
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #36 (JsonRpcConnection, #28) tasks: 0
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #29 (JsonRpcConnection, #21) tasks: 0
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #24 (JsonRpcConnection, #16) tasks: 0
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #14 (JsonRpcConnection, #6) tasks: 0
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #28 (JsonRpcConnection, #20) tasks: 0
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #23 (JsonRpcConnection, #15) tasks: 0
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #8 (JsonRpcConnection, #0) tasks: 26
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #18 (JsonRpcConnection, #10) tasks: 0
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #21 (JsonRpcConnection, #13) tasks: 0
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #12 (JsonRpcConnection, #4) tasks: 0
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #38 (JsonRpcConnection, #30) tasks: 0
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #20 (JsonRpcConnection, #12) tasks: 0
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #32 (JsonRpcConnection, #24) tasks: 0
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #9 (JsonRpcConnection, #1) tasks: 28
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #13 (JsonRpcConnection, #5) tasks: 0
[2017-08-01 16:11:39 +0200] notice/WorkQueue: #10 (JsonRpcConnection, #2) tasks: 20
[2017-08-01 16:11:39 +0200] information/ApiListener: New client connection for identity 'mon02' from [10.207.193.188]:47956
[2017-08-01 16:11:39 +0200] notice/ApiListener: New JSON-RPC client
[2017-08-01 16:11:39 +0200] notice/WorkQueue: Spawning WorkQueue threads for 'JsonRpcConnection, #3'
[2017-08-01 16:11:39 +0200] information/ApiListener: Sending config updates for endpoint 'mon02'.
[2017-08-01 16:11:39 +0200] information/ApiListener: Syncing runtime objects to endpoint 'mon02'.
[2017-08-01 16:11:39 +0200] notice/JsonRpcConnection: Received 'icinga::Hello' message from 'mon02'
[2017-08-01 16:11:39 +0200] information/ApiListener: Finished syncing runtime objects to endpoint 'mon02'.
[2017-08-01 16:11:39 +0200] information/ApiListener: Finished sending config updates for endpoint 'mon02'.
[2017-08-01 16:11:39 +0200] information/ApiListener: Sending replay log for endpoint 'mon02'.
[2017-08-01 16:11:39 +0200] information/ApiListener: Finished sending replay log for endpoint 'mon02'.
[2017-08-01 16:11:39 +0200] notice/JsonRpcConnection: Received 'config::Update' message from 'mon02'
[2017-08-01 16:11:39 +0200] notice/ApiListener: Creating config update for file '/var/lib/icinga2/api/zones/global-templates/.timestamp'
[2017-08-01 16:11:39 +0200] notice/ApiListener: Creating config update for file '/var/lib/icinga2/api/zones/global-templates/_etc/.checkcommands.conf.swp'
[2017-08-01 16:11:39 +0200] notice/ApiListener: Creating config update for file '/var/lib/icinga2/api/zones/global-templates/_etc/checkcommands.conf'
[2017-08-01 16:11:39 +0200] notice/ApiListener: Creating config update for file '/var/lib/icinga2/api/zones/global-templates/_etc/checkcommands.ps.conf'
[2017-08-01 16:11:43 +0200] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2017-08-01 16:11:43 +0200] notice/WorkQueue: #5 (ApiListener, SyncQueue) tasks: 0
[2017-08-01 16:11:43 +0200] notice/WorkQueue: #4 (ApiListener, RelayQueue) tasks: 0
[2017-08-01 16:11:44 +0200] notice/JsonRpcConnection: Received 'log::SetLogPosition' message from 'mon02'
[2017-08-01 16:11:44 +0200] notice/JsonRpcConnection: Received 'event::ExecuteCommand' message from 'mon02'
[2017-08-01 16:11:48 +0200] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #39 (JsonRpcConnection, #31) tasks: 0
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #11 (JsonRpcConnection, #3) tasks: 2
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #33 (JsonRpcConnection, #25) tasks: 0
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #19 (JsonRpcConnection, #11) tasks: 0
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #22 (JsonRpcConnection, #14) tasks: 0
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #17 (JsonRpcConnection, #9) tasks: 0
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #28 (JsonRpcConnection, #20) tasks: 0
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #30 (JsonRpcConnection, #22) tasks: 0
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #21 (JsonRpcConnection, #13) tasks: 0
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #26 (JsonRpcConnection, #18) tasks: 0
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #31 (JsonRpcConnection, #23) tasks: 0
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #37 (JsonRpcConnection, #29) tasks: 0
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #10 (JsonRpcConnection, #2) tasks: 27
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #15 (JsonRpcConnection, #7) tasks: 0
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #16 (JsonRpcConnection, #8) tasks: 0
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #36 (JsonRpcConnection, #28) tasks: 0
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #27 (JsonRpcConnection, #19) tasks: 0
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #25 (JsonRpcConnection, #17) tasks: 0
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #35 (JsonRpcConnection, #27) tasks: 0
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #29 (JsonRpcConnection, #21) tasks: 0
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #8 (JsonRpcConnection, #0) tasks: 26
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #34 (JsonRpcConnection, #26) tasks: 0
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #24 (JsonRpcConnection, #16) tasks: 0
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #18 (JsonRpcConnection, #10) tasks: 0
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #14 (JsonRpcConnection, #6) tasks: 0
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #23 (JsonRpcConnection, #15) tasks: 0
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #32 (JsonRpcConnection, #24) tasks: 0
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #12 (JsonRpcConnection, #4) tasks: 0
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #13 (JsonRpcConnection, #5) tasks: 0
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #38 (JsonRpcConnection, #30) tasks: 0
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #9 (JsonRpcConnection, #1) tasks: 28
[2017-08-01 16:11:49 +0200] notice/WorkQueue: #20 (JsonRpcConnection, #12) tasks: 0
[2017-08-01 16:11:53 +0200] notice/WorkQueue: #5 (ApiListener, SyncQueue) tasks: 0
[2017-08-01 16:11:53 +0200] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2017-08-01 16:11:53 +0200] notice/WorkQueue: #4 (ApiListener, RelayQueue) tasks: 0
[2017-08-01 16:11:53 +0200] notice/ApiListener: Relaying 'event::UpdateRepository' message
[2017-08-01 16:11:53 +0200] notice/ApiListener: Sending message 'event::UpdateRepository' to 'mon01'
[2017-08-01 16:11:53 +0200] notice/ThreadPool: Pool #1: Pending tasks: 0; Average latency: 0ms; Threads: 4; Pool utilization: 0.0109463%
[2017-08-01 16:11:58 +0200] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #39 (JsonRpcConnection, #31) tasks: 0
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #21 (JsonRpcConnection, #13) tasks: 0
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #11 (JsonRpcConnection, #3) tasks: 6
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #28 (JsonRpcConnection, #20) tasks: 0
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #22 (JsonRpcConnection, #14) tasks: 0
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #35 (JsonRpcConnection, #27) tasks: 0
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #33 (JsonRpcConnection, #25) tasks: 0
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #34 (JsonRpcConnection, #26) tasks: 0
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #26 (JsonRpcConnection, #18) tasks: 0
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #17 (JsonRpcConnection, #9) tasks: 0
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #24 (JsonRpcConnection, #16) tasks: 0
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #37 (JsonRpcConnection, #29) tasks: 0
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #19 (JsonRpcConnection, #11) tasks: 0
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #30 (JsonRpcConnection, #22) tasks: 0
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #8 (JsonRpcConnection, #0) tasks: 26
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #36 (JsonRpcConnection, #28) tasks: 0
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #10 (JsonRpcConnection, #2) tasks: 32
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #31 (JsonRpcConnection, #23) tasks: 0
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #38 (JsonRpcConnection, #30) tasks: 0
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #16 (JsonRpcConnection, #8) tasks: 0
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #15 (JsonRpcConnection, #7) tasks: 0
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #29 (JsonRpcConnection, #21) tasks: 0
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #25 (JsonRpcConnection, #17) tasks: 0
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #27 (JsonRpcConnection, #19) tasks: 0
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #18 (JsonRpcConnection, #10) tasks: 0
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #23 (JsonRpcConnection, #15) tasks: 0
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #32 (JsonRpcConnection, #24) tasks: 0
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #9 (JsonRpcConnection, #1) tasks: 28
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #14 (JsonRpcConnection, #6) tasks: 0
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #20 (JsonRpcConnection, #12) tasks: 0
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #12 (JsonRpcConnection, #4) tasks: 0
[2017-08-01 16:11:59 +0200] notice/WorkQueue: #13 (JsonRpcConnection, #5) tasks: 0
[2017-08-01 16:12:03 +0200] notice/WorkQueue: #4 (ApiListener, RelayQueue) tasks: 0
[2017-08-01 16:12:03 +0200] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2017-08-01 16:12:03 +0200] notice/WorkQueue: #5 (ApiListener, SyncQueue) tasks: 0
[2017-08-01 16:12:08 +0200] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2017-08-01 16:12:08 +0200] notice/ThreadPool: Pool #1: Pending tasks: 0; Average latency: 0ms; Threads: 4; Pool utilization: 0.00212955%
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #11 (JsonRpcConnection, #3) tasks: 12
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #26 (JsonRpcConnection, #18) tasks: 0
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #33 (JsonRpcConnection, #25) tasks: 0
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #39 (JsonRpcConnection, #31) tasks: 0
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #35 (JsonRpcConnection, #27) tasks: 0
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #21 (JsonRpcConnection, #13) tasks: 0
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #30 (JsonRpcConnection, #22) tasks: 0
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #28 (JsonRpcConnection, #20) tasks: 0
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #34 (JsonRpcConnection, #26) tasks: 0
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #22 (JsonRpcConnection, #14) tasks: 0
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #17 (JsonRpcConnection, #9) tasks: 0
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #24 (JsonRpcConnection, #16) tasks: 0
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #10 (JsonRpcConnection, #2) tasks: 38
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #37 (JsonRpcConnection, #29) tasks: 0
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #32 (JsonRpcConnection, #24) tasks: 0
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #19 (JsonRpcConnection, #11) tasks: 0
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #14 (JsonRpcConnection, #6) tasks: 0
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #15 (JsonRpcConnection, #7) tasks: 0
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #36 (JsonRpcConnection, #28) tasks: 0
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #16 (JsonRpcConnection, #8) tasks: 0
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #31 (JsonRpcConnection, #23) tasks: 0
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #8 (JsonRpcConnection, #0) tasks: 26
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #38 (JsonRpcConnection, #30) tasks: 0
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #29 (JsonRpcConnection, #21) tasks: 0
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #25 (JsonRpcConnection, #17) tasks: 0
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #23 (JsonRpcConnection, #15) tasks: 0
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #27 (JsonRpcConnection, #19) tasks: 0
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #18 (JsonRpcConnection, #10) tasks: 0
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #20 (JsonRpcConnection, #12) tasks: 0
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #9 (JsonRpcConnection, #1) tasks: 28
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #12 (JsonRpcConnection, #4) tasks: 0
[2017-08-01 16:12:09 +0200] notice/WorkQueue: #13 (JsonRpcConnection, #5) tasks: 0
[2017-08-01 16:12:09 +0200] information/JsonRpcConnection: No messages for identity 'mon01' have been received in the last 60 seconds.
[2017-08-01 16:12:09 +0200] warning/JsonRpcConnection: API client disconnected for identity 'mon01'
[2017-08-01 16:12:09 +0200] warning/JsonRpcConnection: API client disconnected for identity 'mon01'
[2017-08-01 16:12:09 +0200] warning/ApiListener: Removing API client for endpoint 'mon01'. 0 API clients left.
[2017-08-01 16:12:09 +0200] warning/ApiListener: Removing API client for endpoint 'mon01'. 0 API clients left.

Error like

[2017-08-01 16:12:09 +0200] information/JsonRpcConnection: No messages for identity 'mon01' have been received in the last 60 seconds.

Does not appears, if we disable the long running checks.

lazyfrosch commented 7 years ago

Then the agent willingly disconnects from the master, since he thinks the connection is dead.

Happening here: https://github.com/Icinga/icinga2/blob/e8463887fdb586b1e787f4461388d09178967f0f/lib/remote/jsonrpcconnection.cpp#L323

Heartbeats are supposed to be sent every 10 seconds to all active clients: https://github.com/Icinga/icinga2/blob/e8463887fdb586b1e787f4461388d09178967f0f/lib/remote/jsonrpcconnection-heartbeat.cpp#L41

How big is this setup in terms of clients directly connected to the master nodes?

jkroepke commented 7 years ago

Hi,

How big is this setup in terms of clients directly connected to the master nodes?

We have 2 endpoints in the master zone. No satellites.

icinga2 object list -t Endpoint | grep Object -c
292

Debug Log from mon01:

[root@i-mon01 ~]# zgrep "2017-08-01 16:12" /var/log/icinga2/debug.log-20170802.gz | egrep 'Api|Json' | grep agent
[2017-08-01 16:12:00 +0200] notice/ApiListener: Setting log position for identity 'agent': 2017/08/01 16:11:53
[2017-08-01 16:12:00 +0200] notice/ApiListener: Sending message 'event::ExecuteCommand' to 'agent'
[2017-08-01 16:12:02 +0200] notice/JsonRpcConnection: Received 'event::Heartbeat' message from 'agent'
[2017-08-01 16:12:03 +0200] notice/ApiListener: Sending message 'event::ExecuteCommand' to 'agent'
[2017-08-01 16:12:05 +0200] notice/ApiListener: Setting log position for identity 'agent': 2017/08/01 16:11:53
[2017-08-01 16:12:08 +0200] notice/ApiListener: Sending message 'event::ExecuteCommand' to 'agent'
[2017-08-01 16:12:09 +0200] warning/JsonRpcConnection: API client disconnected for identity 'agent'
[2017-08-01 16:12:09 +0200] warning/ApiListener: Removing API client for endpoint 'agent'. 0 API clients left.
[2017-08-01 16:12:54 +0200] information/JsonRpcConnection: Reconnecting to API endpoint 'agent' via host '10.0.0.3' and port '5665'
[2017-08-01 16:12:54 +0200] information/ApiListener: New client connection for identity 'agent' to [10.0.0.3]:5665
[2017-08-01 16:12:54 +0200] information/ApiListener: Sending config updates for endpoint 'agent'.
[2017-08-01 16:12:54 +0200] information/ApiListener: Syncing configuration files for global zone 'global-templates' to endpoint 'agent'.
[2017-08-01 16:12:54 +0200] information/ApiListener: Syncing runtime objects to endpoint 'agent'.
[2017-08-01 16:12:54 +0200] information/ApiListener: Finished syncing runtime objects to endpoint 'agent'.
[2017-08-01 16:12:54 +0200] information/ApiListener: Finished sending config updates for endpoint 'agent'.
[2017-08-01 16:12:54 +0200] information/ApiListener: Sending replay log for endpoint 'agent'.
[2017-08-01 16:12:54 +0200] information/ApiListener: Finished sending replay log for endpoint 'agent'.
[2017-08-01 16:12:55 +0200] notice/ApiListener: Setting log position for identity 'agent': 2017/08/01 16:11:53
[2017-08-01 16:12:56 +0200] notice/ApiListener: Sending message 'event::ExecuteCommand' to 'agent'
jkroepke commented 7 years ago

@lazyfrosch. It's possible log events like Sending 'event::Heartbeat' message to Server?

Currently, icinga2 logs only received 'event::Heartbeat'.

dnsmichi commented 7 years ago

I'm still wondering what the issue is ... are these check results not returned to the master, or is it just about the disconnect?

jkroepke commented 7 years ago

Hi,

after upgrade to 2.7.0. The error still appears on a host.

Our icinga agent check just reports

Remote Icinga instance 'agent' is not connected to 'mon02'

dnsmichi commented I'm still wondering what the issue is ... are these check results not returned to the master, or is it just about the disconnect?

How Can i verify it?

The history from the check is flapping between Remote Icinga instance .. is not connected and just the normal output.

Anyway if we disable multiple long running checks (> 30 seconds and more). The connections problem are gone.

dnsmichi commented 7 years ago

The connection "problem" is not really one, as the connection will be re-established once there are new messages to send from the client to the master.

Attach to the API event streams and filter for that specific host/service. Look whether the master receives the check results for those long lasting checks.