Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.03k stars 578 forks source link

Same downtime being created repeatedly in a cluster loop #7198

Closed jhou4 closed 5 years ago

jhou4 commented 5 years ago

Describe the bug

Occasionally icinga constantly recreates same downtime over and over again, and barely does anything else. Issue goes away after reloads.

To Reproduce

Not sure how to reproduce.

Expected behavior

A downtime gets created once.

Screenshots

[2019-05-16 23:24:05 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:07 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:07 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.

Your Environment

Include as many relevant details about the environment you experienced the problem in

object Zone "master" { endpoints = ["master1", "master2"] }

object Zone "zone1" { endpoints = ["sat1", "sat2"] parent = "master" }

object Zone "zone2" { endpoints = ["sat3", "sat4"] parent = "master" } Note here master[12]/sat[1234] are simplified hostnames just to show the configuration.

Additional context

Add any other context about the problem here. We first saw this issue on 2.10.2, was hoping 2.10.4 has a fix but it doesn't. The behavior is slightly different though. In 2.10.2, issue happened on one master and two endpoints in the same zone at the same time. With 2.10.4, so far it happend twice, both on two endpoints in the same zone only, not on master.

dnsmichi commented 5 years ago

How's that Downtime being added initially - via API action/Icinga Web 2 or generated from a ScheduledDowntime object?

jhou4 commented 5 years ago

Hi, thanks for looking into this. We schedule downtimes from API on one of the master. One condition that could trigger this, was I found the downtime being recreated happened to have been scheduled on a master, when the master just started reloading, but the parent process was still processing everything. We have a large number of hosts/services, usually it takes a few minutes for the parent to finish up before the children processes take over.

omattsson commented 5 years ago

Hi,

We have seen the same issue with Comments. From time to time our satellites go into spin and recreates comments about 200K times an hour. It is only two comments that are recreated over and over again. It looks like the config master disconnects from the satellite: [2019-06-11 09:46:03 +0000] warning/TlsStream: TLS stream was disconnected. [2019-06-11 09:46:03 +0000] warning/JsonRpcConnection: API client disconnected for identity 'eu-west-1-icinga-01.hosted.axis.com' [2019-06-11 09:46:03 +0000] warning/ApiListener: Removing API client for endpoint 'eu-west-1-icinga-01.hosted.axis.com'. 0 API clients left.

And then the two comments are recreated: [2019-06-11 09:46:08 +0000] information/ConfigObjectUtility: Created and activated object 'HOSTNAME!reboot-required!f4a48c8c-d6c4-4f2c-9fb6-0f1 3d5545762' of type 'Comment'. [2019-06-11 09:46:08 +0000] information/ConfigObjectUtility: Created and activated object 'HOSTNAME!reboot-required!a1e4a26e-3e6b-45df-9fa6-5 13447d29d90' of type 'Comment'.

Brgds

Crunsher commented 5 years ago

This may be related to another bug. Are you using the IDO and see those replicated Downtimes/Comments there?

omattsson commented 5 years ago

Hi,

We use IDO on the config master and the comments are only stored once in the database.

omattsson commented 5 years ago

Did a restart of Icinga2 on the config master and got this on the satellites:

[2019-06-18 06:34:07 +0000] information/ApiListener: Applying config update from endpoint 'eu-west-1-icinga-01.hosted.axis.com' of zone 'master'. [2019-06-18 06:34:07 +0000] information/ApiListener: Updating configuration file: /var/lib/icinga2/api/zones/global-templates//.timestamp [2019-06-18 06:34:07 +0000] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones/global-templates' (17 Bytes). Received timestamp '2019-06-18 06:33:55 +0000' (1560839635.996240), Current timestamp '2019-06-13 02:00:43 +0000' (1560391243.082262). [2019-06-18 06:34:07 +0000] information/ApiListener: Updating configuration file: /var/lib/icinga2/api/zones/us-west-2//.timestamp [2019-06-18 06:34:07 +0000] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones/us-west-2' (17 Bytes). Received timestamp '2019-06-18 06:33:55 +0000' (1560839635.993111), Current timestamp '2019-06-13 02:00:42 +0000' (1560391242.987072).

After this the create bonanza starts and continues until Icinga is restarted on the satellite.

I've started debuglog on one of our satellites and hopefully I can trigger the error while debuglog is running.

omattsson commented 5 years ago

I restart the config master and the debuglog on satellite 1 shows : [2019-06-18 06:52:18 +0000] notice/JsonRpcConnection: Received 'config::UpdateObject' message from 'SATELLITE2' [2019-06-18 06:52:18 +0000] notice/ApiListener: Received update for object: {"config":"object Comment \"c1a064e3-9d59-4973-a018-bc1a7c8f2164\" ignore_on_error {\n\tauthor = \"admin\"\n\tentry_time = 1549964140.574786\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"HOSTNAME\"\n\tpersistent = false\n\tservice_name = \"reboot-required\"\n\ttext = \"Reboot\"\n\tversion = 1549964140.574807\n\tzone = \"us-west-2\"\n}\n","modified_attributes":{},"name":"HOSTNAME!reboot-required!c1a064e3-9d59-4973-a018-bc1a7c8f2164","original_attributes":[],"type":"Comment","version":1549964140.5748069286} [2019-06-18 06:52:18 +0000] notice/ApiListener: Discarding config update for object 'HOSTNAME!reboot-required!c1a064e3-9d59-4973-a018-bc1a7c8f2164': Object version 1549964140.574807 is more recent than the received version 1549964140.574807.

This repeats for two comments until Icinga2 is restarted. So it looks like the error occurs when the object update comes from satellite 2

dnsmichi commented 5 years ago

Investigate on whether your _api package is broken on the affected endpoints, this is the internal storage for runtime created objects including downtimes/comments. Fixes for this are included in the upcoming 2.11 with snapshot packages available for testing.

You can also analyse this on your own with using this guide.

omattsson commented 5 years ago

It looks alright when I follow the guide. I've also stopped icinga on the satellite, removed packages and zones from /var/lib/icinga2/api and started icinga again to make sure that it gets a completely fresh sync from the config master. I will continue to try and trigger the error and see if I can collect any more information.

dnsmichi commented 5 years ago

@omattsson which Icinga version are you using?

omattsson commented 5 years ago

@dnsmichi We run 2.10.3-1 in production at the moment.

dnsmichi commented 5 years ago

Hm, ok. And there are no time difference between the instances, e.g. ntp is configured and the local ntp checks don't say that the offset has gone nuts?

omattsson commented 5 years ago

They use the same NTP server and the offset on satellite 1 is -0.0005632638931 secs and satellite 2 is 0.007149338722 secs. The master has an offset of 0.0002931952477 secs. I consider that to be an okay offset at least.

dnsmichi commented 5 years ago

Ok, thanks. Since you're not the original issue author, I'm just going the detail collect route again here. I'm wondering about the "comment object route" here, like which endpoint is the source for the comment creation and which other endpoints do the routing of that message, each creating the comment ... that would be awesome if extracted from the debug logs.

dnsmichi commented 5 years ago

@jhou4 are you using ScheduledDowntime objects which generate such runtime Downtime objects? I've now spotted a similar case where one of my masters had a local host object with services, generated a downtime from an SD, and the first master couldn't apply the object because there was a missing object reference.

[2019-06-19 16:53:58 +0200] notice/JsonRpcConnection: Received 'config::UpdateObject' message from 'master2'
[2019-06-19 16:53:58 +0200] notice/ApiListener: Received update for object: {"config":"object Downtime \"08998dc7-c0e4-46fa-a382-39b438097c75\" ignore_on_error {\n\tauthor = \"icingaadmin\"\n\tauthoritative_zone = \"master\"\n\tcomment = \"Scheduled downtime for backup\"\n\tconfig_owner = \"master2!load!backup-downtime\"\n\tduration = 0.000000\n\tend_time = 1560992400.000000\n\tentry_time = 1560955436.475581\n\tfixed = true\n\thost_name = \"master2\"\n\tscheduled_by = \"master2!load!backup-downtime\"\n\tservice_name = \"load\"\n\tstart_time = 1560988800.000000\n\ttriggered_by = \"\"\n\tversion = 1560955436.475701\n}\n","modified_attributes":{},"name":"master2!load!08998dc7-c0e4-46fa-a382-39b438097c75","original_attributes":[],"type":"Downtime","version":1560955436.475701}
[2019-06-19 16:53:58 +0200] notice/ConfigCompiler: Compiling config file: var-a/lib/icinga2/api/packages/_api/b01e7361-78b3-425d-9c35-7cd214fd2c7d/conf.d/downtimes/master2!load!08998dc7-c0e4-46fa-a382-39b438097c75.conf
[2019-06-19 16:53:58 +0200] debug/configitem: Committing 1 new items.
[2019-06-19 16:53:58 +0200] notice/WorkQueue: Spawning WorkQueue threads for 'ConfigObjectUtility::CreateObject'
[2019-06-19 16:53:58 +0200] debug/ConfigItem: Commit called for ConfigItem Type=Downtime, Name=08998dc7-c0e4-46fa-a382-39b438097c75
[2019-06-19 16:53:58 +0200] notice/ConfigObject: Ignoring config object '08998dc7-c0e4-46fa-a382-39b438097c75' of type 'Downtime' due to errors: Error: Validation failed for object 'master2!load!08998dc7-c0e4-46fa-a382-39b438097c75' of type 'Downtime'; Attribute 'host_name': Object 'master2' of type 'Host' does not exist.
Location: in var-a/lib/icinga2/api/packages/_api/b01e7361-78b3-425d-9c35-7cd214fd2c7d/conf.d/downtimes/master2!load!08998dc7-c0e4-46fa-a382-39b438097c75.conf: 1:0-1:69
var-a/lib/icinga2/api/packages/_api/b01e7361-78b3-425d-9c35-7cd214fd2c7d/conf.d/downtimes/master2!load!08998dc7-c0e4-46fa-a382-39b438097c75.conf(1): object Downtime "08998dc7-c0e4-46fa-a382-39b438097c75" ignore_on_error {
                                                                                                                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
var-a/lib/icinga2/api/packages/_api/b01e7361-78b3-425d-9c35-7cd214fd2c7d/conf.d/downtimes/master2!load!08998dc7-c0e4-46fa-a382-39b438097c75.conf(2):  author = "icingaadmin"
var-a/lib/icinga2/api/packages/_api/b01e7361-78b3-425d-9c35-7cd214fd2c7d/conf.d/downtimes/master2!load!08998dc7-c0e4-46fa-a382-39b438097c75.conf(3):  authoritative_zone = "master"

[2019-06-19 16:53:58 +0200] debug/configitem: Committed 1 items of type 'Downtime'.
[2019-06-19 16:53:58 +0200] debug/configitem: Committed 1 items.
[2019-06-19 16:53:58 +0200] information/ConfigObjectUtility: Created and activated object 'master2!load!08998dc7-c0e4-46fa-a382-39b438097c75' of type 'Downtime'.
[2019-06-19 16:53:58 +0200] notice/WorkQueue: Stopped WorkQueue threads for 'ConfigObjectUtility::CreateObject'

That would also explain the always same error message, since the object never is really created on the endpoint.

jhou4 commented 5 years ago

In our environment most downtimes are scheduled from schedule-downtime API call. Occasionally user could create adhoc downtime from Icingaweb. We do not use ScheduledDowntime(there is only a sample ScheduledDowntime rule in downtimes.conf, but I verified it did not create any object).

dnsmichi commented 5 years ago

Ok, then a different thought - is the object where the downtime is created for, really existing on the first master/satellite endpoint which seems to log a loop of created objects?

Your initial log shows a host object, which might be defined not inside the master zone, can you share the output of

icinga2 object list --type Host --name appbig-rr-931

on both masters?

jhou4 commented 5 years ago

Unless configuration was out of sync somehow, I would assume that the object existed on masters/satellites. Didn't verify that particular host object last time when I looked at it. Will keep it in mind and do that next time.

Here are the outputs from two masters. Config master: iob1-cas-icinga01 /etc/icinga2/zones.d/global-templates/base-checks # icinga2 object list --type Host --name appbig-rr-931 Object 'appbig-rr-931' of type 'Host': % declared in '/etc/icinga2/zones.d/nj/hosts.conf', lines 114751:1-114751:27

Second master: inj1-cas-icinga01 /var/log/icinga2 # icinga2 object list --type Host --name appbig-rr-931 Object 'appbig-rr-931' of type 'Host': % declared in '/var/lib/icinga2/api/zones/nj/_etc/hosts.conf', lines 114751:1-114751:27

jhou4 commented 5 years ago

We had a new occurrence last night. It was auto-fixed(we put in a script to monitor and auto-fix the condition). So I couldn't tell if configuration was out of sync. But there were some extra errors, which hopefully can help to identify the cuase:

[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] critical/config: Error: An object with type 'Downtime' and name 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' already exists (in /var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf: 1:0-1:69), new declaration: in /var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf: 1:0-1:69
Location: in /var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf: 1:0-1:69
/var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf(1): object Downtime "492f19a6-f587-485b-9f4c-faa1b322169c" ignore_on_error {
                                                                                                                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf(2):  author = "icinga2adm"
/var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf(3):  comment = "OS REBOOT"

[2019-06-24 22:57:16 -0400] critical/config: 1 error
[2019-06-24 22:57:16 -0400] critical/ApiListener: Could not create object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c':
[2019-06-24 22:57:16 -0400] critical/ApiListener: Error: An object with type 'Downtime' and name 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' already exists (in /var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf: 1:0-1:69), new declaration: in /var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf: 1:0-1:69
Location: in /var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf: 1:0-1:69
[2019-06-24 22:57:16 -0400] warning/PluginCheckTask: Check command for object 'nj-dfm3!check_ntp_proc' (PID: 67154, arguments: '/usr/lib64/nagios/plugins/check_nrpe' '-H' 'nj-dfm3' '-c' 'check_ntp_proc' '-t' '30') terminated with exit code 255, output: connect to address 10.126.187.94 port 5666: Connection refused
connect to host nj-dfm3 port 5666: Connection refused
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
...
jhou4 commented 5 years ago

Above logs were from one of the satellites. The same downtime was also being created repeatedly on two masters and the other satellite in the same zone at the same time. Below was the configure master(where the downtime was initially created via api), it started 10 minutes after the downtime was created:

iob1-cas-icinga01 /var/log/icinga2 # grep p203 icinga2.log-20190625 |more
[2019-06-24 22:47:33 -0400] information/ConfigObjectUtility: Created and activated object 'p203!c86ed88e-2dee-4773-8151-e0c68d9c34df' of type 'Downtime'.
[2019-06-24 22:47:35 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!53e5662c-aa4f-470a-a67e-8bb72a88916b' of type 'Downtime'.
[2019-06-24 22:47:38 -0400] information/Checkable: Checking for configured notifications for object 'p203!check_nrpe_client'
[2019-06-24 22:47:38 -0400] information/Checkable: Checking for configured notifications for object 'p203'
[2019-06-24 22:47:38 -0400] information/Checkable: Checking for configured notifications for object 'p203'
[2019-06-24 22:57:13 -0400] information/ConfigObjectUtility: Created and activated object 'p203!45aed60e-cd3c-446a-90e4-9aac8e179c37' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203'
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203'
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203!check_nrpe_client'
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203!check_nrpe_client'
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203!check_nrpe_client'
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203!check_nrpe_client'
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203!check_nrpe_client'
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203!check_nrpe_client'
...
jhou4 commented 5 years ago

Correction, the one created at 22:47:35 was actually different from the repeating one starting at 22:57:15.

Crunsher commented 5 years ago

Could it be /var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/$problematicDowntime.conf gets deleted and recreated after every load?

I'm looking at the code but don't understand how this happens after the Object Registration.

jhou4 commented 5 years ago

The issue goes away after a reload("systemctl reload icinga2" command).

dnsmichi commented 5 years ago

@jhou4 can you please collect all the debug logs from both instances when this exact situation happens again? (or, if you have them already).

Here's a safe place to upload, no-one except you and us can see the files. Thanks. https://nextcloud.icinga.com/index.php/s/eLDRTLyKLeA2mEG

jhou4 commented 5 years ago

It may be a little challenging to collect debug logs due to the fast rate it grows. But I can try logrotate. Will do some testing. Just want to confirm, you want the debug logs from one endpoint in each zone, not the two endpoints in the same zone, correct?

jhou4 commented 5 years ago

It looks like we can't rotate debug.log without restarting icinga2 service. The disk space I have available can hold a few hours' logs. That means I will have to restart icinga2 every few hours, which is not very practical, especially considering this issue only happens once or twice a month. Any suggestions on how to collect the debug logs?

dnsmichi commented 5 years ago

you want the debug logs from one endpoint in each zone, not the two endpoints in the same zone, correct?

All of them, if possible.

It looks like we can't rotate debug.log without restarting icinga2 service.

There's a special signal which is sent by logrotated - Icinga then knows to release the file handles while logrotate already moved the files.

/bin/kill -USR1 $(pidof icinga2) 2> /dev/null || true
dnsmichi commented 5 years ago

Btw, anyone who experiences this issue, is invited to upload the debug logs to the nextcloud URL. The more logs are there, the better the error can be analysed.

jhou4 commented 5 years ago

That works. Thanks. I set it up on all our Icinga instances. Now just need to wait until it happens again.

jhou4 commented 5 years ago

We got lucky, the issue happened at 01:28 this AM. I uploaded two debug logs(one from the config master, the other one was from a satellite in the affected zone) for now due to the size. The uploaded files are showing a "!" sign after finishing. Not sure if it is normal. Please check. Also, let me know if you need logs from other icinga nodes. Thanks.

dnsmichi commented 5 years ago

Didn't work unfortunately. How big is the size of the compressed logs?

jhou4 commented 5 years ago

The compressed files were 1.3G/0.5G. I tried again but still got the same thing, it took a while, then gave the "!" sign but didn't say what was wrong though.

dnsmichi commented 5 years ago

Ok, we had a limit there, it is now raised and should work. Sorry for the trouble.

jhou4 commented 5 years ago

I just uploaded the file again, result was still the same so I am not sure if it worked. Please check. Thanks.

dnsmichi commented 5 years ago

Didn't work, and we re-tested our nextcloud instance with a dummy file of 1.9 GB. Seems your connection causes trouble here. Do you have the chance to put it somewhere safe and send me the URL to michael.friedrich@icinga.com?

jhou4 commented 5 years ago

Company network may have firewall blocking the upload indeed. I am looking into this.

jhou4 commented 5 years ago

Sorry for the delay. I am still trying to get an OK for sharing the data.

Crunsher commented 5 years ago

Replay log investigation

With comments (acknowledgement) only the set-acknowledgement and clear-acknowledgement events are logged. They are relevant to the Checkable they belong to but not the comment.

Time traveling

Next steps

The issue is either with the config sync or the events sent during runtime. I'll investigate under which circumstances we are sending an UpdateObject event, maybe we are sending one when we shouldn't which then easily could be faulty and confusing.

Crunsher commented 5 years ago

I ran into the bug and was able to extract some information with the help of @dnsmichi

Circumstances

Master Master setup, henceforth named M1 and M2. M1 is the config master. Both instances are 2.10.5 + a few debug log messages patched in. They are not under much load and have ntpd active. M2 crashed with a segfault (my mistake) previously to the bug occurring.

Timeline

The issue resolved when we tried to attach gdb on M1 to the process and thereby stopped execution:

[2019-07-17 11:40:25 +0200] critical/ApiListener: Relaying 'config::UpdateObject' message. Write log is 0
[2019-07-17 11:40:25 +0200] critical/ApiListener: Relaying 'config::UpdateObject' message. Write log is 0
[2019-07-17 11:40:25 +0200] critical/ApiListener: Relaying 'config::DeleteObject' message. Write log is 0
[2019-07-17 11:42:54 +0200] information/Application: We jumped forward in time: 151.2 seconds
[2019-07-17 11:42:54 +0200] warning/JsonRpcConnection: API client disconnected for identity 'icinga2-centos7-dev.vagrant.demo.icinga.com'
[2019-07-17 11:42:54 +0200] warning/ApiListener: Removing API client for endpoint 'icinga2-centos7-dev.vagrant.demo.icinga.com'. 0 API clients left.
[2019-07-17 11:42:54 +0200] warning/TlsStream: TLS stream was disconnected.
[2019-07-17 11:42:54 +0200] warning/TlsStream: TLS stream was disconnected.

While this was happening the file in question was deleted and an created all the time (M2):

[root@icinga2-centos7-dev var]# ls lib/icinga2/api/packages/_api/8f06cf47-5301-4111-994e-d754b4ec7fba/conf.d/comments/horst\!291e38d4-c273-4013-bfdf-d2233a51360d.conf
ls: cannot access lib/icinga2/api/packages/_api/8f06cf47-5301-4111-994e-d754b4ec7fba/conf.d/comments/horst!291e38d4-c273-4013-bfdf-d2233a51360d.conf: No such file or directory
[root@icinga2-centos7-dev var]# ls lib/icinga2/api/packages/_api/8f06cf47-5301-4111-994e-d754b4ec7fba/conf.d/comments/horst\!291e38d4-c273-4013-bfdf-d2233a51360d.conf
lib/icinga2/api/packages/_api/8f06cf47-5301-4111-994e-d754b4ec7fba/conf.d/comments/horst!291e38d4-c273-4013-bfdf-d2233a51360d.conf
[root@icinga2-centos7-dev var]# ls lib/icinga2/api/packages/_api/8f06cf47-5301-4111-994e-d754b4ec7fba/conf.d/comments/horst\!291e38d4-c273-4013-bfdf-d2233a51360d.conf
lib/icinga2/api/packages/_api/8f06cf47-5301-4111-994e-d754b4ec7fba/conf.d/comments/horst!291e38d4-c273-4013-bfdf-d2233a51360d.conf

Conclusion

While I don't think datetime offset between the servers has anything to do with this, this is certainly a timing issue. The version (which did not change) is the most important change indicator, the check for it may be problematic. Other than that the existence of two CreateObject event is notable. One has the instance it self as origin, the other that of the remote. It would be interesting to know if they are the same ones as on the other master

Next Steps

I will play around with the versions a bit, maybe I'm able to force the bug this way. If so it will go a long way to finally put this issue to rest.

jhou4 commented 5 years ago

I am glad to see that we are getting closer to the root cause of this issue. Sorry I couldn't provide the full debug logs. But maybe I can extract some info from the logs if that helps. Just let me know. Thanks.

Crunsher commented 5 years ago

Another run in with this + more log info:

[2019-07-17 15:28:22 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::DeleteObject","params":{"name":"horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47","type":"Comment","version":1563370041.8454220295},"ts":1563370102.3430480957}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 15:28:22 +0200] critical/ApiListener: Received update for object: {"config":"object Comment \"f7c9e42a-63ca-4e8e-92e3-b6e8462dae47\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563369981.845314\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563370041.845422\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47","original_attributes":[],"type":"Comment","version":1563370041.8454220295}
[2019-07-17 15:28:22 +0200] critical/DEBUG: CUOH: object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' is active
Context:
        (0) Activating object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' of type 'Comment'

[2019-07-17 15:28:22 +0200] information/ConfigObjectUtility: Created and activated object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' of type 'Comment'.
[2019-07-17 15:28:22 +0200] critical/DEBUG: CUOH: object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' is active
[2019-07-17 15:28:22 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"f7c9e42a-63ca-4e8e-92e3-b6e8462dae47\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563369981.845314\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563370041.845422\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47","original_attributes":[],"type":"Comment","version":1563370041.8454220295},"ts":1563370102.3733160496}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 15:28:22 +0200] critical/ApiListener: Discarding config update for object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47': Object version 1563370041.845422 is more recent than the received version 1563370041.845422.
[2019-07-17 15:28:22 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"f7c9e42a-63ca-4e8e-92e3-b6e8462dae47\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563369981.845314\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563370041.845422\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47","original_attributes":[],"type":"Comment","version":1563370041.8454220295},"ts":1563370102.3735001087}' message from '0x7f76b0023380' to 'master' Want log: 0 need log: 0
[2019-07-17 15:28:22 +0200] critical/DEBUG: CUOH: object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' is inactive and Extension is set to:1
Context:
        (0) Deactivating object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' of type 'Comment'

[2019-07-17 15:28:22 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::DeleteObject","params":{"name":"horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47","type":"Comment","version":1563370041.8454220295},"ts":1563370102.3739030361}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 15:28:22 +0200] critical/ApiListener: Received update for object: {"config":"object Comment \"f7c9e42a-63ca-4e8e-92e3-b6e8462dae47\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563369981.845314\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563370041.845422\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47","original_attributes":[],"type":"Comment","version":1563370041.8454220295}
[2019-07-17 15:28:22 +0200] critical/DEBUG: CUOH: object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' is active
Context:
        (0) Activating object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' of type 'Comment'

[2019-07-17 15:28:22 +0200] information/ConfigObjectUtility: Created and activated object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' of type 'Comment'.

It's flipping between activation and deactiviation, triggering the activeChanged event as Michi suspected. In this case I had the following patches compiled in in hopes of forcing this (there is no reliable way to reproduce this yet)

+++ b/lib/remote/configobjectutility.cpp
@@ -89,6 +89,8 @@ String ConfigObjectUtility::CreateObjectConfig(const Type::Ptr& type, const Stri

        /* update the version for config sync */
        allAttrs->Set("version", Utility::GetTime());
+       allAttrs->Set("version", Utility::GetTime()+60);
+

+++ b/lib/remote/apilistener-configsync.cpp
@@ -191,7 +195,7 @@ Value ApiListener::ConfigUpdateObjectAPIHandler(const MessageOrigin::Ptr& origin
        }

        /* keep the object version in sync with the sender */
-       object->SetVersion(objVersion, false, origin);
+       object->SetVersion(objVersion+2, false, origin);

And hank you @jhou4 for your help so far. I think we have a good idea of where this is coming from but since we still can't reproduce this as needed, keep an eye out for this bug. Anything regarding the circumstances of this is interesting, eg. does it also happen during runtime or shutdown? Have other objects that Downtimes and Comments been affected etc.

dnsmichi commented 5 years ago

The main question is - why would SetVersion() generate a Delete event. That's not yet clear to me.

[2019-07-17 15:28:22 +0200] critical/DEBUG: CUOH: object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' is inactive and Extension is set to:1
C

Trace this event down. Maybe it is related to the reload handling.

Crunsher commented 5 years ago

Since it's only Downtimes and Comments (Ack comments in the cases I observed) it is also possible this happens as intended. Icinga starts, Host is checked and up again, Acknowledgement deleted.

dnsmichi commented 5 years ago

Logs should prove that ;-)

Crunsher commented 5 years ago

35vhk1

In this case I had two acknowledgements on the Host, so double the fun. This also makes me think it's not something specific to that one object but tied to the ClearAcks event. My workflow to semi reliably reproduce this:

  1. Config: object Host "horst" { check_command = "dummy"; check_interval=10s }
  2. Process Check result for "horst" to be down
  3. Acknowledge "horst"
  4. Restart a bunch

M2 (restarted a bunch):

[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"event::ClearAcknowledgement","params":{"host":"horst"},"ts":1563375558.5422809124}' message from '0x7fe470000fb0' to 'master' Want log: 1 need log: 0
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' is inactive and Extension is set to:1
Context:
        (0) Deactivating object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' of type 'Comment'

[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' is inactive and Extension is set to:1
Context:
        (0) Deactivating object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' of type 'Comment'

[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::DeleteObject","params":{"name":"horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c","type":"Comment","version":1563375558.2112770081},"ts":1563375558.5436320305}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::DeleteObject","params":{"name":"horst!1077e401-e578-4096-9f87-afcd85ac5e2e","type":"Comment","version":1563375351.7715198994},"ts":1563375558.5441069603}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"event::SetNextCheck","params":{"host":"horst","next_check":1563375568.0342299938},"ts":1563375558.5444951057}' message from '0x7fe4700017a0' to 'master' Want log: 1 need log: 0
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"event::CheckResult","params":{"cr":{"active":true,"check_source":"dfws","command":null,"execution_end":1563375556.9986350536,"execution_start":1563375556.9986350536,"exit_status":0.0,"output":"Check was successful.","performance_data":[],"schedule_end":1563375556.9986488819,"schedule_start":1563375556.998208046,"state":0.0,"ttl":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0},"vars_before":{"attempt":1.0,"reachable":true,"state":2.0,"state_type":0.0}},"host":"horst"},"ts":1563375558.5446588993}' message from '0x7fe4700017a0' to 'master' Want log: 1 need log: 0
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' is active
Context:
        (0) Activating object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' of type 'Comment'

[2019-07-17 16:59:18 +0200] information/ConfigObjectUtility: Created and activated object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' of type 'Comment'.
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' is active
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"1077e401-e578-4096-9f87-afcd85ac5e2e\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563375291.771488\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563375351.771520\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!1077e401-e578-4096-9f87-afcd85ac5e2e","original_attributes":[],"type":"Comment","version":1563375351.7715198994},"ts":1563375558.5641889572}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"1077e401-e578-4096-9f87-afcd85ac5e2e\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563375291.771488\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563375351.771520\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!1077e401-e578-4096-9f87-afcd85ac5e2e","original_attributes":[],"type":"Comment","version":1563375351.7715198994},"ts":1563375558.5649349689}' message from '0x7fe470001db0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' is active
Context:
        (0) Activating object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' of type 'Comment'

[2019-07-17 16:59:18 +0200] information/ConfigObjectUtility: Created and activated object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' of type 'Comment'.

M1 (left running):

[2019-07-17 16:59:18 +0200] warning/ApiListener: Ignoring config update for zone 'master' because we have an authoritative version of the zone's config.
[2019-07-17 16:59:18 +0200] critical/ApiListener: Received update for object: {"config":"object Comment \"1077e401-e578-4096-9f87-afcd85ac5e2e\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_tim
e = 1563375291.771488\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563375351.771520\n\tzone = \"master\"\n}\n"
,"modified_attributes":{},"name":"horst!1077e401-e578-4096-9f87-afcd85ac5e2e","original_attributes":[],"type":"Comment","version":1563375351.7715198994}
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' is active
Context:
        (0) Activating object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' of type 'Comment'

[2019-07-17 16:59:18 +0200] information/ConfigObjectUtility: Created and activated object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' of type 'Comment'.
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"1077e401-e578-4096-9f87-afcd85ac5e2e\" ignore_on_
error {\n\tauthor = \"Bob\"\n\tentry_time = 1563375291.771488\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563
375351.771520\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!1077e401-e578-4096-9f87-afcd85ac5e2e","original_attributes":[],"type":"Comment","version":1563375351.7715198994},"t
s":1563375558.5496549606}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' is active
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"1077e401-e578-4096-9f87-afcd85ac5e2e\" ignore_on_
error {\n\tauthor = \"Bob\"\n\tentry_time = 1563375291.771488\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563
375351.771520\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!1077e401-e578-4096-9f87-afcd85ac5e2e","original_attributes":[],"type":"Comment","version":1563375351.7715198994},"t
s":1563375558.550219059}' message from '0x7f5fa0017fe0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/ApiListener: Discarding config update for object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e': Object version 1563375351.771520 is more recent than the receiv
ed version 1563375351.771520.
[2019-07-17 16:59:18 +0200] critical/ApiListener: Received update for object: {"config":"object Comment \"ed1a05e3-a9e2-48dc-8edb-91758dbbc75c\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563375498.211203\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563375558.211277\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c","original_attributes":[],"type":"Comment","version":1563375558.2112770081}
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' is active
Context:
        (0) Activating object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' of type 'Comment'

[2019-07-17 16:59:18 +0200] information/ConfigObjectUtility: Created and activated object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' of type 'Comment'.
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"ed1a05e3-a9e2-48dc-8edb-91758dbbc75c\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563375498.211203\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563375558.211277\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c","original_attributes":[],"type":"Comment","version":1563375558.2112770081},"ts":1563375558.5557129383}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' is active
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"ed1a05e3-a9e2-48dc-8edb-91758dbbc75c\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563375498.211203\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563375558.211277\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c","original_attributes":[],"type":"Comment","version":1563375558.2112770081},"ts":1563375558.5573120117}' message from '0x7f5fa0017ae0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/ApiListener: Discarding config update for object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c': Object version 1563375558.211277 is more recent than the received version 1563375558.211277.
[2019-07-17 16:59:18 +0200] critical/ApiListener: Received update for object: {"config":"object Downtime \"ce5ba6e5-74fd-4ddb-a53f-ae0e23f3f379\" ignore_on_error {\n\tauthor = \"baum\"\n\tcomment = \"tree\"\n\tconfig_owner = \"\"\n\tduration = 0.000000\n\tend_time = 1563370435.000000\n\tentry_time = 1563370075.939388\n\tfixed = true\n\thost_name = \"horst\"\n\tscheduled_by = \"\"\n\tstart_time = 1563370075.000000\n\ttriggered_by = \"\"\n\tversion = 1563370135.939455\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!ce5ba6e5-74fd-4ddb-a53f-ae0e23f3f379","original_attributes":[],"type":"Downtime","version":1563370135.9394550323}
[2019-07-17 16:59:18 +0200] critical/ApiListener: Discarding config update for object 'horst!ce5ba6e5-74fd-4ddb-a53f-ae0e23f3f379': Object version 1563370135.939455 is more recent than the received version 1563370135.939455.
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' is inactive and Extension is set to:1
Context:
        (0) Deactivating object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' of type 'Comment'
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::DeleteObject","params":{"name":"horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c","type":"Comment","version":1563375558.2112770081},"ts":1563375558.5586109161}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' is inactive and Extension is set to:1
Context:
        (0) Deactivating object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' of type 'Comment'
jhou4 commented 5 years ago

@Crunsher - In the past with version 2.10.2, this issue(when it happened) always starts shortly after a reload. With version 2.10.4, it started on the primary master or satellites without an obvious reload(although I am not able to verify if the second master actually crashed right before that). I have only seen this behavior with downtimes, not other objects.

dnsmichi commented 5 years ago

@Crunsher I don't exactly see what the logs are telling here. Our main focus should be on why a Comment/Downtime would be deleted thus following the create and delete again. Obviously it has to happen in the same moment, with a delay (stop or gdb attached process) it stops immediately. This is why I do think that it is related to the object version and activation handling in this specific scope.

We should also try whether the version-patch works to trigger the Downtime loop.

@jhou4 how is it going with the debug logs on your side?

Crunsher commented 5 years ago

The removal of the Acknowledgement is the trigger, something similar probably has to happen for downtimes. I had a downtime while testing this but only the Ack comment would be duplicated and only when the ClearAcknowledgement Event was involved.

dnsmichi commented 5 years ago

Since this issue is about downtimes, we should try to reproduce it with this type as well. This points to specific object activation/deactivation and our handlers with receiving/denying config updates based on their version.

jhou4 commented 5 years ago

@dnsmichi There is some internal concern with sharing the logs. Very likely we will have to scrub the logs before uploading. I am still trying to get it approved. Sorry about that.