Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
1.99k stars 570 forks source link

Replay-log not writing performance data #7704

Open schuettloeffel-elsa opened 4 years ago

schuettloeffel-elsa commented 4 years ago

Describe the bug

In a single master setup, when the master is down, replay logs are written on all agents into /var/lib/icinga2/api/log/current; but when the master is up again, no performance data is replayed to our influxdb. The state history is also empty for that time.

It seems replaying to the master is working:

Agents log

[2019-12-09 12:21:12 +0100] information/ApiListener: Sending replay log for endpoint 'xxx' in zone 'master'.
[2019-12-09 12:21:12 +0100] notice/ApiListener: Replaying log: /var/lib/icinga2/api/log/current
[2019-12-09 12:21:12 +0100] information/ApiListener: Replayed 129 messages.

Masters log

[2019-12-09 12:21:12 +0100] notice/JsonRpcConnection: Received 'event::SetNextCheck' message from identity 'yyy'.
[2019-12-09 12:21:12 +0100] notice/JsonRpcConnection: Received 'log::SetLogPosition' message from identity 'yyy'.
[2019-12-09 12:21:12 +0100] notice/JsonRpcConnection: Received 'event::SetNextCheck' message from identity 'yyy'.
[2019-12-09 12:21:12 +0100] notice/JsonRpcConnection: Received 'event::CheckResult' message from identity 'yyy'.
[2019-12-09 12:21:12 +0100] notice/JsonRpcConnection: Received 'event::SetNextCheck' message from identity 'yyy'.
...

But for some reason the master is not processing the replay logs after receiving them.

To Reproduce

  1. On the master stop the icinga2 service
  2. On the agents wait for the replay log to be filled
  3. Start the icinga2 service
  4. Check for history data during the downtime

Expected behavior

  1. Performance-data is replayed to data backends (e.g. influxdb)
  2. state histories are updated

Your Environment

Object 'yyy' of type 'Endpoint':
  % declared in '/etc/icinga2/zones.conf', lines 60:1-60:42
  * __name = "yyy"
  * host = "yyy"
    % = modified in '/etc/icinga2/zones.conf', lines 60:45-60:75
  * log_duration = 86400
  * name = "yyy"
  * package = "_etc"
  * port = "5665"
  * source_location
    * first_column = 1
    * first_line = 60
    * last_column = 42
    * last_line = 60
    * path = "/etc/icinga2/zones.conf"
  * templates = [ "yyy" ]
    % = modified in '/etc/icinga2/zones.conf', lines 60:1-60:42
  * type = "Endpoint"
  * zone = ""
dnsmichi commented 4 years ago

Hi,

do you see the InfluxDB metrics being collected from these check results? This can be seen/correlated in the debug log as well.

@lippserd Maybe there's something weird going on with the JSON messages here as well.

schuettloeffel-elsa commented 4 years ago

@dnsmichi : no, the next log entry for that agent writing to influx db is:

[2019-12-09 12:21:29 +0100] debug/InfluxdbWriter: Checkable 'yyy!workers' adds to metric list:'procs,hostname=yyy,service=workers,metric=procs crit=2500,value=6,warn=1850 1575890489'.

And that is already new data after the master was up again.

dgoetz commented 4 years ago

Does the agent schedule the checks or is it only used as command endpoint? If the first it should work, if the second no checks are executed while disconnected because of the master is scheduling the checks but can not tell the endpoint to execute them. How this is setup I can not see from the issue, so can you provide at least one service configuration?

schuettloeffel-elsa commented 4 years ago

@dgoetz: From the agents replay log being filled during master downtime, I assumed the agent still runs checks.

Here is an excerpt of the replay log where you can see performance data being written to:

897:{"message":"{\"jsonrpc\":\"2.0\",\"method\":\"event::CheckResult\",\"params\":{\"cr\":{\"active\":true,\"check_source\":\"yyy\",\"command\":[\"/usr/lib/nagios/plugins/check_procs\",\"-c\",\"400\",\"-w\",\"250\"],\"execution_end\":1576141528.909151,\"execution_start\":1576141528.880641,\"exit_status\":0.0,\"output\":\"PROCS OK: 151 processes \",\"performance_data\":[\"procs=151;250;400;0;\"],\"schedule_end\":1576141528.909239,\"schedule_start\":1576141528.8799999,\"state\":0.0,\"ttl\":0.0,\"type\":\"CheckResult\",\"vars_after\":{\"attempt\":1.0,\"reachable\":true,\"state\":0.0,\"state_type\":1.0},\"vars_before\":{\"attempt\":1.0,\"reachable\":true,\"state\":0.0,\"state_type\":1.0}},\"host\":\"yyy\",\"service\":\"procs\"},\"ts\":1576141528.909676}","secobj":{"name":"yyy!procs","type":"Service"},"timestamp":1576141528.909676}

The service definition:

apply Service "processes" {
  import "generic-service"

  check_command = "procs"
  command_endpoint = host.vars.remote_client

  /* Set W and C Values for all Hosts. */
  vars.procs_warning = 1500
  vars.procs_critical = 2000

  /* assign where a remote client is set */
  assign where host.vars.remote_client
}
dnsmichi commented 4 years ago

@lippserd this could correlate with your observation you've old me about.

Al2Klimov commented 4 years ago

@dnsmichi @lippserd Which observation?

lippserd commented 4 years ago

@dnsmichi @lippserd Which observation?

Performance data is missing in a cluster setup. And this one could be one of the reasons. Since @htriem is working on a setup to reproduce this, I'll hand this over.

schuettloeffel-elsa commented 4 years ago

Any progress here?

Al2Klimov commented 2 years ago

@htriem Give us a status update.