Closed jhou4 closed 5 years ago
How's that Downtime being added initially - via API action/Icinga Web 2 or generated from a ScheduledDowntime object?
Hi, thanks for looking into this. We schedule downtimes from API on one of the master. One condition that could trigger this, was I found the downtime being recreated happened to have been scheduled on a master, when the master just started reloading, but the parent process was still processing everything. We have a large number of hosts/services, usually it takes a few minutes for the parent to finish up before the children processes take over.
Hi,
We have seen the same issue with Comments. From time to time our satellites go into spin and recreates comments about 200K times an hour. It is only two comments that are recreated over and over again. It looks like the config master disconnects from the satellite: [2019-06-11 09:46:03 +0000] warning/TlsStream: TLS stream was disconnected. [2019-06-11 09:46:03 +0000] warning/JsonRpcConnection: API client disconnected for identity 'eu-west-1-icinga-01.hosted.axis.com' [2019-06-11 09:46:03 +0000] warning/ApiListener: Removing API client for endpoint 'eu-west-1-icinga-01.hosted.axis.com'. 0 API clients left.
And then the two comments are recreated: [2019-06-11 09:46:08 +0000] information/ConfigObjectUtility: Created and activated object 'HOSTNAME!reboot-required!f4a48c8c-d6c4-4f2c-9fb6-0f1 3d5545762' of type 'Comment'. [2019-06-11 09:46:08 +0000] information/ConfigObjectUtility: Created and activated object 'HOSTNAME!reboot-required!a1e4a26e-3e6b-45df-9fa6-5 13447d29d90' of type 'Comment'.
Brgds
This may be related to another bug. Are you using the IDO and see those replicated Downtimes/Comments there?
Hi,
We use IDO on the config master and the comments are only stored once in the database.
Did a restart of Icinga2 on the config master and got this on the satellites:
[2019-06-18 06:34:07 +0000] information/ApiListener: Applying config update from endpoint 'eu-west-1-icinga-01.hosted.axis.com' of zone 'master'. [2019-06-18 06:34:07 +0000] information/ApiListener: Updating configuration file: /var/lib/icinga2/api/zones/global-templates//.timestamp [2019-06-18 06:34:07 +0000] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones/global-templates' (17 Bytes). Received timestamp '2019-06-18 06:33:55 +0000' (1560839635.996240), Current timestamp '2019-06-13 02:00:43 +0000' (1560391243.082262). [2019-06-18 06:34:07 +0000] information/ApiListener: Updating configuration file: /var/lib/icinga2/api/zones/us-west-2//.timestamp [2019-06-18 06:34:07 +0000] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones/us-west-2' (17 Bytes). Received timestamp '2019-06-18 06:33:55 +0000' (1560839635.993111), Current timestamp '2019-06-13 02:00:42 +0000' (1560391242.987072).
After this the create bonanza starts and continues until Icinga is restarted on the satellite.
I've started debuglog on one of our satellites and hopefully I can trigger the error while debuglog is running.
I restart the config master and the debuglog on satellite 1 shows : [2019-06-18 06:52:18 +0000] notice/JsonRpcConnection: Received 'config::UpdateObject' message from 'SATELLITE2' [2019-06-18 06:52:18 +0000] notice/ApiListener: Received update for object: {"config":"object Comment \"c1a064e3-9d59-4973-a018-bc1a7c8f2164\" ignore_on_error {\n\tauthor = \"admin\"\n\tentry_time = 1549964140.574786\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"HOSTNAME\"\n\tpersistent = false\n\tservice_name = \"reboot-required\"\n\ttext = \"Reboot\"\n\tversion = 1549964140.574807\n\tzone = \"us-west-2\"\n}\n","modified_attributes":{},"name":"HOSTNAME!reboot-required!c1a064e3-9d59-4973-a018-bc1a7c8f2164","original_attributes":[],"type":"Comment","version":1549964140.5748069286} [2019-06-18 06:52:18 +0000] notice/ApiListener: Discarding config update for object 'HOSTNAME!reboot-required!c1a064e3-9d59-4973-a018-bc1a7c8f2164': Object version 1549964140.574807 is more recent than the received version 1549964140.574807.
This repeats for two comments until Icinga2 is restarted. So it looks like the error occurs when the object update comes from satellite 2
Investigate on whether your _api package is broken on the affected endpoints, this is the internal storage for runtime created objects including downtimes/comments. Fixes for this are included in the upcoming 2.11 with snapshot packages available for testing.
You can also analyse this on your own with using this guide.
It looks alright when I follow the guide. I've also stopped icinga on the satellite, removed packages and zones from /var/lib/icinga2/api and started icinga again to make sure that it gets a completely fresh sync from the config master. I will continue to try and trigger the error and see if I can collect any more information.
@omattsson which Icinga version are you using?
@dnsmichi We run 2.10.3-1 in production at the moment.
Hm, ok. And there are no time difference between the instances, e.g. ntp is configured and the local ntp checks don't say that the offset has gone nuts?
They use the same NTP server and the offset on satellite 1 is -0.0005632638931 secs and satellite 2 is 0.007149338722 secs. The master has an offset of 0.0002931952477 secs. I consider that to be an okay offset at least.
Ok, thanks. Since you're not the original issue author, I'm just going the detail collect route again here. I'm wondering about the "comment object route" here, like which endpoint is the source for the comment creation and which other endpoints do the routing of that message, each creating the comment ... that would be awesome if extracted from the debug logs.
@jhou4 are you using ScheduledDowntime objects which generate such runtime Downtime objects? I've now spotted a similar case where one of my masters had a local host object with services, generated a downtime from an SD, and the first master couldn't apply the object because there was a missing object reference.
[2019-06-19 16:53:58 +0200] notice/JsonRpcConnection: Received 'config::UpdateObject' message from 'master2'
[2019-06-19 16:53:58 +0200] notice/ApiListener: Received update for object: {"config":"object Downtime \"08998dc7-c0e4-46fa-a382-39b438097c75\" ignore_on_error {\n\tauthor = \"icingaadmin\"\n\tauthoritative_zone = \"master\"\n\tcomment = \"Scheduled downtime for backup\"\n\tconfig_owner = \"master2!load!backup-downtime\"\n\tduration = 0.000000\n\tend_time = 1560992400.000000\n\tentry_time = 1560955436.475581\n\tfixed = true\n\thost_name = \"master2\"\n\tscheduled_by = \"master2!load!backup-downtime\"\n\tservice_name = \"load\"\n\tstart_time = 1560988800.000000\n\ttriggered_by = \"\"\n\tversion = 1560955436.475701\n}\n","modified_attributes":{},"name":"master2!load!08998dc7-c0e4-46fa-a382-39b438097c75","original_attributes":[],"type":"Downtime","version":1560955436.475701}
[2019-06-19 16:53:58 +0200] notice/ConfigCompiler: Compiling config file: var-a/lib/icinga2/api/packages/_api/b01e7361-78b3-425d-9c35-7cd214fd2c7d/conf.d/downtimes/master2!load!08998dc7-c0e4-46fa-a382-39b438097c75.conf
[2019-06-19 16:53:58 +0200] debug/configitem: Committing 1 new items.
[2019-06-19 16:53:58 +0200] notice/WorkQueue: Spawning WorkQueue threads for 'ConfigObjectUtility::CreateObject'
[2019-06-19 16:53:58 +0200] debug/ConfigItem: Commit called for ConfigItem Type=Downtime, Name=08998dc7-c0e4-46fa-a382-39b438097c75
[2019-06-19 16:53:58 +0200] notice/ConfigObject: Ignoring config object '08998dc7-c0e4-46fa-a382-39b438097c75' of type 'Downtime' due to errors: Error: Validation failed for object 'master2!load!08998dc7-c0e4-46fa-a382-39b438097c75' of type 'Downtime'; Attribute 'host_name': Object 'master2' of type 'Host' does not exist.
Location: in var-a/lib/icinga2/api/packages/_api/b01e7361-78b3-425d-9c35-7cd214fd2c7d/conf.d/downtimes/master2!load!08998dc7-c0e4-46fa-a382-39b438097c75.conf: 1:0-1:69
var-a/lib/icinga2/api/packages/_api/b01e7361-78b3-425d-9c35-7cd214fd2c7d/conf.d/downtimes/master2!load!08998dc7-c0e4-46fa-a382-39b438097c75.conf(1): object Downtime "08998dc7-c0e4-46fa-a382-39b438097c75" ignore_on_error {
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
var-a/lib/icinga2/api/packages/_api/b01e7361-78b3-425d-9c35-7cd214fd2c7d/conf.d/downtimes/master2!load!08998dc7-c0e4-46fa-a382-39b438097c75.conf(2): author = "icingaadmin"
var-a/lib/icinga2/api/packages/_api/b01e7361-78b3-425d-9c35-7cd214fd2c7d/conf.d/downtimes/master2!load!08998dc7-c0e4-46fa-a382-39b438097c75.conf(3): authoritative_zone = "master"
[2019-06-19 16:53:58 +0200] debug/configitem: Committed 1 items of type 'Downtime'.
[2019-06-19 16:53:58 +0200] debug/configitem: Committed 1 items.
[2019-06-19 16:53:58 +0200] information/ConfigObjectUtility: Created and activated object 'master2!load!08998dc7-c0e4-46fa-a382-39b438097c75' of type 'Downtime'.
[2019-06-19 16:53:58 +0200] notice/WorkQueue: Stopped WorkQueue threads for 'ConfigObjectUtility::CreateObject'
That would also explain the always same error message, since the object never is really created on the endpoint.
In our environment most downtimes are scheduled from schedule-downtime API call. Occasionally user could create adhoc downtime from Icingaweb. We do not use ScheduledDowntime(there is only a sample ScheduledDowntime rule in downtimes.conf, but I verified it did not create any object).
Ok, then a different thought - is the object where the downtime is created for, really existing on the first master/satellite endpoint which seems to log a loop of created objects?
Your initial log shows a host object, which might be defined not inside the master zone, can you share the output of
icinga2 object list --type Host --name appbig-rr-931
on both masters?
Unless configuration was out of sync somehow, I would assume that the object existed on masters/satellites. Didn't verify that particular host object last time when I looked at it. Will keep it in mind and do that next time.
Here are the outputs from two masters. Config master: iob1-cas-icinga01 /etc/icinga2/zones.d/global-templates/base-checks # icinga2 object list --type Host --name appbig-rr-931 Object 'appbig-rr-931' of type 'Host': % declared in '/etc/icinga2/zones.d/nj/hosts.conf', lines 114751:1-114751:27
Second master: inj1-cas-icinga01 /var/log/icinga2 # icinga2 object list --type Host --name appbig-rr-931 Object 'appbig-rr-931' of type 'Host': % declared in '/var/lib/icinga2/api/zones/nj/_etc/hosts.conf', lines 114751:1-114751:27
We had a new occurrence last night. It was auto-fixed(we put in a script to monitor and auto-fix the condition). So I couldn't tell if configuration was out of sync. But there were some extra errors, which hopefully can help to identify the cuase:
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] critical/config: Error: An object with type 'Downtime' and name 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' already exists (in /var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf: 1:0-1:69), new declaration: in /var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf: 1:0-1:69
Location: in /var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf: 1:0-1:69
/var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf(1): object Downtime "492f19a6-f587-485b-9f4c-faa1b322169c" ignore_on_error {
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf(2): author = "icinga2adm"
/var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf(3): comment = "OS REBOOT"
[2019-06-24 22:57:16 -0400] critical/config: 1 error
[2019-06-24 22:57:16 -0400] critical/ApiListener: Could not create object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c':
[2019-06-24 22:57:16 -0400] critical/ApiListener: Error: An object with type 'Downtime' and name 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' already exists (in /var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf: 1:0-1:69), new declaration: in /var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf: 1:0-1:69
Location: in /var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf: 1:0-1:69
[2019-06-24 22:57:16 -0400] warning/PluginCheckTask: Check command for object 'nj-dfm3!check_ntp_proc' (PID: 67154, arguments: '/usr/lib64/nagios/plugins/check_nrpe' '-H' 'nj-dfm3' '-c' 'check_ntp_proc' '-t' '30') terminated with exit code 255, output: connect to address 10.126.187.94 port 5666: Connection refused
connect to host nj-dfm3 port 5666: Connection refused
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
...
Above logs were from one of the satellites. The same downtime was also being created repeatedly on two masters and the other satellite in the same zone at the same time. Below was the configure master(where the downtime was initially created via api), it started 10 minutes after the downtime was created:
iob1-cas-icinga01 /var/log/icinga2 # grep p203 icinga2.log-20190625 |more
[2019-06-24 22:47:33 -0400] information/ConfigObjectUtility: Created and activated object 'p203!c86ed88e-2dee-4773-8151-e0c68d9c34df' of type 'Downtime'.
[2019-06-24 22:47:35 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!53e5662c-aa4f-470a-a67e-8bb72a88916b' of type 'Downtime'.
[2019-06-24 22:47:38 -0400] information/Checkable: Checking for configured notifications for object 'p203!check_nrpe_client'
[2019-06-24 22:47:38 -0400] information/Checkable: Checking for configured notifications for object 'p203'
[2019-06-24 22:47:38 -0400] information/Checkable: Checking for configured notifications for object 'p203'
[2019-06-24 22:57:13 -0400] information/ConfigObjectUtility: Created and activated object 'p203!45aed60e-cd3c-446a-90e4-9aac8e179c37' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203'
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203'
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203!check_nrpe_client'
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203!check_nrpe_client'
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203!check_nrpe_client'
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203!check_nrpe_client'
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203!check_nrpe_client'
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203!check_nrpe_client'
...
Correction, the one created at 22:47:35 was actually different from the repeating one starting at 22:57:15.
Could it be /var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/$problematicDowntime.conf
gets deleted and recreated after every load?
I'm looking at the code but don't understand how this happens after the Object Registration.
The issue goes away after a reload("systemctl reload icinga2" command).
@jhou4 can you please collect all the debug logs from both instances when this exact situation happens again? (or, if you have them already).
Here's a safe place to upload, no-one except you and us can see the files. Thanks. https://nextcloud.icinga.com/index.php/s/eLDRTLyKLeA2mEG
It may be a little challenging to collect debug logs due to the fast rate it grows. But I can try logrotate. Will do some testing. Just want to confirm, you want the debug logs from one endpoint in each zone, not the two endpoints in the same zone, correct?
It looks like we can't rotate debug.log without restarting icinga2 service. The disk space I have available can hold a few hours' logs. That means I will have to restart icinga2 every few hours, which is not very practical, especially considering this issue only happens once or twice a month. Any suggestions on how to collect the debug logs?
you want the debug logs from one endpoint in each zone, not the two endpoints in the same zone, correct?
All of them, if possible.
It looks like we can't rotate debug.log without restarting icinga2 service.
There's a special signal which is sent by logrotated - Icinga then knows to release the file handles while logrotate already moved the files.
/bin/kill -USR1 $(pidof icinga2) 2> /dev/null || true
Btw, anyone who experiences this issue, is invited to upload the debug logs to the nextcloud URL. The more logs are there, the better the error can be analysed.
That works. Thanks. I set it up on all our Icinga instances. Now just need to wait until it happens again.
We got lucky, the issue happened at 01:28 this AM. I uploaded two debug logs(one from the config master, the other one was from a satellite in the affected zone) for now due to the size. The uploaded files are showing a "!" sign after finishing. Not sure if it is normal. Please check. Also, let me know if you need logs from other icinga nodes. Thanks.
Didn't work unfortunately. How big is the size of the compressed logs?
The compressed files were 1.3G/0.5G. I tried again but still got the same thing, it took a while, then gave the "!" sign but didn't say what was wrong though.
Ok, we had a limit there, it is now raised and should work. Sorry for the trouble.
I just uploaded the file again, result was still the same so I am not sure if it worked. Please check. Thanks.
Didn't work, and we re-tested our nextcloud instance with a dummy file of 1.9 GB. Seems your connection causes trouble here. Do you have the chance to put it somewhere safe and send me the URL to michael.friedrich@icinga.com?
Company network may have firewall blocking the upload indeed. I am looking into this.
Sorry for the delay. I am still trying to get an OK for sharing the data.
With comments (acknowledgement) only the set-acknowledgement and clear-acknowledgement events are logged. They are relevant to the Checkable they belong to but not the comment.
The issue is either with the config sync or the events sent during runtime. I'll investigate under which circumstances we are sending an UpdateObject event, maybe we are sending one when we shouldn't which then easily could be faulty and confusing.
I ran into the bug and was able to extract some information with the help of @dnsmichi
Master Master setup, henceforth named M1 and M2. M1 is the config master. Both instances are 2.10.5 + a few debug log messages patched in. They are not under much load and have ntpd active. M2 crashed with a segfault (my mistake) previously to the bug occurring.
[2019-07-17 11:04:11 +0200] information/HttpServerConnection: Request: POST /v1/actions/acknowledge-problem?host=horst (from [127.0.0.1]:35686), user: root)
[2019-07-17 11:04:11 +0200] information/ConfigObjectUtility: Created and activated object 'horst!291e38d4-c273-4013-bfdf-d2233a51360d' of type 'Comment'.
[2019-07-17 11:04:11 +0200] critical/ApiListener: Relaying 'config::UpdateObject' message. Write log is 0
[2019-07-17 11:04:11 +0200] critical/ApiListener: Relaying 'event::SetAcknowledgement' message. Write log is 1
[2019-07-17 11:04:11 +0200] information/ConfigObjectUtility: Created and activated object 'horst!291e38d4-c273-4013-bfdf-d2233a51360d' of type 'Comment'.
[2019-07-17 11:04:11 +0200] critical/ApiListener: Relaying '0x7f2b740a9600' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 11:04:11 +0200] critical/ApiListener: Relaying '0x7f2b740aaee0' message from '0x7f2b74001870' to 'master' Want log: 0 need log: 0
[2019-07-17 11:04:11 +0200] critical/ApiListener: Relaying '0x7f2b740012b0' message from '0x7f2b74001910' to 'master' Want log: 1 need log: 0
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying '{
jsonrpc = "2.0"
method = "event::ClearAcknowledgement"
params = {
host = "horst"
}
ts = 1563354586.692563
}' message from '0x7f3da8001280' to 'master' Want log: 1 need log: 0
[2019-07-17 11:09:46 +0200] information/ConfigObjectUtility: Created and activated object 'horst!291e38d4-c273-4013-bfdf-d2233a51360d' of type 'Comment'.
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying 'config::UpdateObject' message. Write log is 0
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying 'config::UpdateObject' message. Write log is 0
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying 'config::DeleteObject' message. Write log is 0
[2019-07-17 11:09:46 +0200] information/ConfigObjectUtility: Created and activated object 'horst!291e38d4-c273-4013-bfdf-d2233a51360d' of type 'Comment'.
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying 'config::UpdateObject' message. Write log is 0
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying 'config::UpdateObject' message. Write log is 0
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying 'config::DeleteObject' message. Write log is 0
[2019-07-17 11:09:46 +0200] information/ConfigObjectUtility: Created and activated object 'horst!291e38d4-c273-4013-bfdf-d2233a51360d' of type 'Comment'.
....
M2: Starts with a single Create followed by Delete before switching to two creates
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying '{
jsonrpc = "2.0"
method = "config::UpdateObject"
params = {
config = "object Comment \"291e38d4-c273-4013-bfdf-d2233a51360d\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563354251.591243\n\tentry_type = 4.000000\n\texpire_time =
0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563354251.591259\n\tzone = \"master\"\n}\n"
modified_attributes = {
}
name = "horst!291e38d4-c273-4013-bfdf-d2233a51360d"
original_attributes = [ ]
type = "Comment"
version = 1563354251.591259
}
ts = 1563354586.626437
}' message from '0' to 'master' Want log: 0 need log: 1
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying '{
jsonrpc = "2.0"
method = "config::DeleteObject"
params = {
name = "horst!291e38d4-c273-4013-bfdf-d2233a51360d"
type = "Comment"
version = 1563354251.591259
}
ts = 1563354586.693694
}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 11:09:46 +0200] information/ConfigObjectUtility: Created and activated object 'horst!291e38d4-c273-4013-bfdf-d2233a51360d' of type 'Comment'.
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying '{
jsonrpc = "2.0"
method = "config::UpdateObject"
params = {
config = "object Comment \"291e38d4-c273-4013-bfdf-d2233a51360d\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563354251.591243\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563354251.591259\n\tzone = \"master\"\n}\n"
modified_attributes = {
}
name = "horst!291e38d4-c273-4013-bfdf-d2233a51360d"
original_attributes = [ ]
type = "Comment"
version = 1563354251.591259
}
ts = 1563354586.713520
}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying '{
jsonrpc = "2.0"
method = "config::UpdateObject"
params = {
config = "object Comment \"291e38d4-c273-4013-bfdf-d2233a51360d\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563354251.591243\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563354251.591259\n\tzone = \"master\"\n}\n"
modified_attributes = {
}
name = "horst!291e38d4-c273-4013-bfdf-d2233a51360d"
original_attributes = [ ]
type = "Comment"
version = 1563354251.591259
}
ts = 1563354586.718544
}' message from '0x7f3da8001640' to 'master' Want log: 0 need log: 0
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying '{
jsonrpc = "2.0"
method = "config::DeleteObject"
params = {
name = "horst!291e38d4-c273-4013-bfdf-d2233a51360d"
type = "Comment"
version = 1563354251.591259
}
ts = 1563354586.718717
}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 11:09:46 +0200] information/ConfigObjectUtility: Created and activated object 'horst!291e38d4-c273-4013-bfdf-d2233a51360d' of type 'Comment'.
...
Note that the version stays the same but the timestamp changes.
The issue resolved when we tried to attach gdb on M1 to the process and thereby stopped execution:
[2019-07-17 11:40:25 +0200] critical/ApiListener: Relaying 'config::UpdateObject' message. Write log is 0
[2019-07-17 11:40:25 +0200] critical/ApiListener: Relaying 'config::UpdateObject' message. Write log is 0
[2019-07-17 11:40:25 +0200] critical/ApiListener: Relaying 'config::DeleteObject' message. Write log is 0
[2019-07-17 11:42:54 +0200] information/Application: We jumped forward in time: 151.2 seconds
[2019-07-17 11:42:54 +0200] warning/JsonRpcConnection: API client disconnected for identity 'icinga2-centos7-dev.vagrant.demo.icinga.com'
[2019-07-17 11:42:54 +0200] warning/ApiListener: Removing API client for endpoint 'icinga2-centos7-dev.vagrant.demo.icinga.com'. 0 API clients left.
[2019-07-17 11:42:54 +0200] warning/TlsStream: TLS stream was disconnected.
[2019-07-17 11:42:54 +0200] warning/TlsStream: TLS stream was disconnected.
While this was happening the file in question was deleted and an created all the time (M2):
[root@icinga2-centos7-dev var]# ls lib/icinga2/api/packages/_api/8f06cf47-5301-4111-994e-d754b4ec7fba/conf.d/comments/horst\!291e38d4-c273-4013-bfdf-d2233a51360d.conf
ls: cannot access lib/icinga2/api/packages/_api/8f06cf47-5301-4111-994e-d754b4ec7fba/conf.d/comments/horst!291e38d4-c273-4013-bfdf-d2233a51360d.conf: No such file or directory
[root@icinga2-centos7-dev var]# ls lib/icinga2/api/packages/_api/8f06cf47-5301-4111-994e-d754b4ec7fba/conf.d/comments/horst\!291e38d4-c273-4013-bfdf-d2233a51360d.conf
lib/icinga2/api/packages/_api/8f06cf47-5301-4111-994e-d754b4ec7fba/conf.d/comments/horst!291e38d4-c273-4013-bfdf-d2233a51360d.conf
[root@icinga2-centos7-dev var]# ls lib/icinga2/api/packages/_api/8f06cf47-5301-4111-994e-d754b4ec7fba/conf.d/comments/horst\!291e38d4-c273-4013-bfdf-d2233a51360d.conf
lib/icinga2/api/packages/_api/8f06cf47-5301-4111-994e-d754b4ec7fba/conf.d/comments/horst!291e38d4-c273-4013-bfdf-d2233a51360d.conf
While I don't think datetime offset between the servers has anything to do with this, this is certainly a timing issue. The version (which did not change) is the most important change indicator, the check for it may be problematic. Other than that the existence of two CreateObject event is notable. One has the instance it self as origin, the other that of the remote. It would be interesting to know if they are the same ones as on the other master
I will play around with the versions a bit, maybe I'm able to force the bug this way. If so it will go a long way to finally put this issue to rest.
I am glad to see that we are getting closer to the root cause of this issue. Sorry I couldn't provide the full debug logs. But maybe I can extract some info from the logs if that helps. Just let me know. Thanks.
Another run in with this + more log info:
[2019-07-17 15:28:22 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::DeleteObject","params":{"name":"horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47","type":"Comment","version":1563370041.8454220295},"ts":1563370102.3430480957}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 15:28:22 +0200] critical/ApiListener: Received update for object: {"config":"object Comment \"f7c9e42a-63ca-4e8e-92e3-b6e8462dae47\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563369981.845314\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563370041.845422\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47","original_attributes":[],"type":"Comment","version":1563370041.8454220295}
[2019-07-17 15:28:22 +0200] critical/DEBUG: CUOH: object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' is active
Context:
(0) Activating object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' of type 'Comment'
[2019-07-17 15:28:22 +0200] information/ConfigObjectUtility: Created and activated object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' of type 'Comment'.
[2019-07-17 15:28:22 +0200] critical/DEBUG: CUOH: object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' is active
[2019-07-17 15:28:22 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"f7c9e42a-63ca-4e8e-92e3-b6e8462dae47\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563369981.845314\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563370041.845422\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47","original_attributes":[],"type":"Comment","version":1563370041.8454220295},"ts":1563370102.3733160496}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 15:28:22 +0200] critical/ApiListener: Discarding config update for object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47': Object version 1563370041.845422 is more recent than the received version 1563370041.845422.
[2019-07-17 15:28:22 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"f7c9e42a-63ca-4e8e-92e3-b6e8462dae47\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563369981.845314\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563370041.845422\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47","original_attributes":[],"type":"Comment","version":1563370041.8454220295},"ts":1563370102.3735001087}' message from '0x7f76b0023380' to 'master' Want log: 0 need log: 0
[2019-07-17 15:28:22 +0200] critical/DEBUG: CUOH: object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' is inactive and Extension is set to:1
Context:
(0) Deactivating object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' of type 'Comment'
[2019-07-17 15:28:22 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::DeleteObject","params":{"name":"horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47","type":"Comment","version":1563370041.8454220295},"ts":1563370102.3739030361}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 15:28:22 +0200] critical/ApiListener: Received update for object: {"config":"object Comment \"f7c9e42a-63ca-4e8e-92e3-b6e8462dae47\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563369981.845314\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563370041.845422\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47","original_attributes":[],"type":"Comment","version":1563370041.8454220295}
[2019-07-17 15:28:22 +0200] critical/DEBUG: CUOH: object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' is active
Context:
(0) Activating object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' of type 'Comment'
[2019-07-17 15:28:22 +0200] information/ConfigObjectUtility: Created and activated object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' of type 'Comment'.
It's flipping between activation and deactiviation, triggering the activeChanged event as Michi suspected. In this case I had the following patches compiled in in hopes of forcing this (there is no reliable way to reproduce this yet)
+++ b/lib/remote/configobjectutility.cpp
@@ -89,6 +89,8 @@ String ConfigObjectUtility::CreateObjectConfig(const Type::Ptr& type, const Stri
/* update the version for config sync */
allAttrs->Set("version", Utility::GetTime());
+ allAttrs->Set("version", Utility::GetTime()+60);
+
+++ b/lib/remote/apilistener-configsync.cpp
@@ -191,7 +195,7 @@ Value ApiListener::ConfigUpdateObjectAPIHandler(const MessageOrigin::Ptr& origin
}
/* keep the object version in sync with the sender */
- object->SetVersion(objVersion, false, origin);
+ object->SetVersion(objVersion+2, false, origin);
And hank you @jhou4 for your help so far. I think we have a good idea of where this is coming from but since we still can't reproduce this as needed, keep an eye out for this bug. Anything regarding the circumstances of this is interesting, eg. does it also happen during runtime or shutdown? Have other objects that Downtimes and Comments been affected etc.
The main question is - why would SetVersion() generate a Delete event. That's not yet clear to me.
[2019-07-17 15:28:22 +0200] critical/DEBUG: CUOH: object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' is inactive and Extension is set to:1
C
Trace this event down. Maybe it is related to the reload handling.
Since it's only Downtimes and Comments (Ack comments in the cases I observed) it is also possible this happens as intended. Icinga starts, Host is checked and up again, Acknowledgement deleted.
Logs should prove that ;-)
In this case I had two acknowledgements on the Host, so double the fun. This also makes me think it's not something specific to that one object but tied to the ClearAcks event. My workflow to semi reliably reproduce this:
object Host "horst" { check_command = "dummy"; check_interval=10s }
M2 (restarted a bunch):
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"event::ClearAcknowledgement","params":{"host":"horst"},"ts":1563375558.5422809124}' message from '0x7fe470000fb0' to 'master' Want log: 1 need log: 0
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' is inactive and Extension is set to:1
Context:
(0) Deactivating object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' of type 'Comment'
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' is inactive and Extension is set to:1
Context:
(0) Deactivating object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' of type 'Comment'
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::DeleteObject","params":{"name":"horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c","type":"Comment","version":1563375558.2112770081},"ts":1563375558.5436320305}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::DeleteObject","params":{"name":"horst!1077e401-e578-4096-9f87-afcd85ac5e2e","type":"Comment","version":1563375351.7715198994},"ts":1563375558.5441069603}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"event::SetNextCheck","params":{"host":"horst","next_check":1563375568.0342299938},"ts":1563375558.5444951057}' message from '0x7fe4700017a0' to 'master' Want log: 1 need log: 0
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"event::CheckResult","params":{"cr":{"active":true,"check_source":"dfws","command":null,"execution_end":1563375556.9986350536,"execution_start":1563375556.9986350536,"exit_status":0.0,"output":"Check was successful.","performance_data":[],"schedule_end":1563375556.9986488819,"schedule_start":1563375556.998208046,"state":0.0,"ttl":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0},"vars_before":{"attempt":1.0,"reachable":true,"state":2.0,"state_type":0.0}},"host":"horst"},"ts":1563375558.5446588993}' message from '0x7fe4700017a0' to 'master' Want log: 1 need log: 0
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' is active
Context:
(0) Activating object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' of type 'Comment'
[2019-07-17 16:59:18 +0200] information/ConfigObjectUtility: Created and activated object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' of type 'Comment'.
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' is active
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"1077e401-e578-4096-9f87-afcd85ac5e2e\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563375291.771488\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563375351.771520\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!1077e401-e578-4096-9f87-afcd85ac5e2e","original_attributes":[],"type":"Comment","version":1563375351.7715198994},"ts":1563375558.5641889572}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"1077e401-e578-4096-9f87-afcd85ac5e2e\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563375291.771488\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563375351.771520\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!1077e401-e578-4096-9f87-afcd85ac5e2e","original_attributes":[],"type":"Comment","version":1563375351.7715198994},"ts":1563375558.5649349689}' message from '0x7fe470001db0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' is active
Context:
(0) Activating object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' of type 'Comment'
[2019-07-17 16:59:18 +0200] information/ConfigObjectUtility: Created and activated object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' of type 'Comment'.
M1 (left running):
[2019-07-17 16:59:18 +0200] warning/ApiListener: Ignoring config update for zone 'master' because we have an authoritative version of the zone's config.
[2019-07-17 16:59:18 +0200] critical/ApiListener: Received update for object: {"config":"object Comment \"1077e401-e578-4096-9f87-afcd85ac5e2e\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_tim
e = 1563375291.771488\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563375351.771520\n\tzone = \"master\"\n}\n"
,"modified_attributes":{},"name":"horst!1077e401-e578-4096-9f87-afcd85ac5e2e","original_attributes":[],"type":"Comment","version":1563375351.7715198994}
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' is active
Context:
(0) Activating object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' of type 'Comment'
[2019-07-17 16:59:18 +0200] information/ConfigObjectUtility: Created and activated object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' of type 'Comment'.
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"1077e401-e578-4096-9f87-afcd85ac5e2e\" ignore_on_
error {\n\tauthor = \"Bob\"\n\tentry_time = 1563375291.771488\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563
375351.771520\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!1077e401-e578-4096-9f87-afcd85ac5e2e","original_attributes":[],"type":"Comment","version":1563375351.7715198994},"t
s":1563375558.5496549606}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' is active
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"1077e401-e578-4096-9f87-afcd85ac5e2e\" ignore_on_
error {\n\tauthor = \"Bob\"\n\tentry_time = 1563375291.771488\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563
375351.771520\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!1077e401-e578-4096-9f87-afcd85ac5e2e","original_attributes":[],"type":"Comment","version":1563375351.7715198994},"t
s":1563375558.550219059}' message from '0x7f5fa0017fe0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/ApiListener: Discarding config update for object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e': Object version 1563375351.771520 is more recent than the receiv
ed version 1563375351.771520.
[2019-07-17 16:59:18 +0200] critical/ApiListener: Received update for object: {"config":"object Comment \"ed1a05e3-a9e2-48dc-8edb-91758dbbc75c\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563375498.211203\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563375558.211277\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c","original_attributes":[],"type":"Comment","version":1563375558.2112770081}
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' is active
Context:
(0) Activating object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' of type 'Comment'
[2019-07-17 16:59:18 +0200] information/ConfigObjectUtility: Created and activated object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' of type 'Comment'.
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"ed1a05e3-a9e2-48dc-8edb-91758dbbc75c\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563375498.211203\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563375558.211277\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c","original_attributes":[],"type":"Comment","version":1563375558.2112770081},"ts":1563375558.5557129383}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' is active
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"ed1a05e3-a9e2-48dc-8edb-91758dbbc75c\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563375498.211203\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563375558.211277\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c","original_attributes":[],"type":"Comment","version":1563375558.2112770081},"ts":1563375558.5573120117}' message from '0x7f5fa0017ae0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/ApiListener: Discarding config update for object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c': Object version 1563375558.211277 is more recent than the received version 1563375558.211277.
[2019-07-17 16:59:18 +0200] critical/ApiListener: Received update for object: {"config":"object Downtime \"ce5ba6e5-74fd-4ddb-a53f-ae0e23f3f379\" ignore_on_error {\n\tauthor = \"baum\"\n\tcomment = \"tree\"\n\tconfig_owner = \"\"\n\tduration = 0.000000\n\tend_time = 1563370435.000000\n\tentry_time = 1563370075.939388\n\tfixed = true\n\thost_name = \"horst\"\n\tscheduled_by = \"\"\n\tstart_time = 1563370075.000000\n\ttriggered_by = \"\"\n\tversion = 1563370135.939455\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!ce5ba6e5-74fd-4ddb-a53f-ae0e23f3f379","original_attributes":[],"type":"Downtime","version":1563370135.9394550323}
[2019-07-17 16:59:18 +0200] critical/ApiListener: Discarding config update for object 'horst!ce5ba6e5-74fd-4ddb-a53f-ae0e23f3f379': Object version 1563370135.939455 is more recent than the received version 1563370135.939455.
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' is inactive and Extension is set to:1
Context:
(0) Deactivating object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' of type 'Comment'
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::DeleteObject","params":{"name":"horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c","type":"Comment","version":1563375558.2112770081},"ts":1563375558.5586109161}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' is inactive and Extension is set to:1
Context:
(0) Deactivating object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' of type 'Comment'
@Crunsher - In the past with version 2.10.2, this issue(when it happened) always starts shortly after a reload. With version 2.10.4, it started on the primary master or satellites without an obvious reload(although I am not able to verify if the second master actually crashed right before that). I have only seen this behavior with downtimes, not other objects.
@Crunsher I don't exactly see what the logs are telling here. Our main focus should be on why a Comment/Downtime would be deleted thus following the create and delete again. Obviously it has to happen in the same moment, with a delay (stop or gdb attached process) it stops immediately. This is why I do think that it is related to the object version and activation handling in this specific scope.
We should also try whether the version-patch works to trigger the Downtime loop.
@jhou4 how is it going with the debug logs on your side?
The removal of the Acknowledgement is the trigger, something similar probably has to happen for downtimes. I had a downtime while testing this but only the Ack comment would be duplicated and only when the ClearAcknowledgement Event was involved.
Since this issue is about downtimes, we should try to reproduce it with this type as well. This points to specific object activation/deactivation and our handlers with receiving/denying config updates based on their version.
@dnsmichi There is some internal concern with sharing the logs. Very likely we will have to scrub the logs before uploading. I am still trying to get it approved. Sorry about that.
Describe the bug
Occasionally icinga constantly recreates same downtime over and over again, and barely does anything else. Issue goes away after reloads.
To Reproduce
Not sure how to reproduce.
Expected behavior
A downtime gets created once.
Screenshots
[2019-05-16 23:24:05 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:07 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'. [2019-05-16 23:24:07 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
Your Environment
Include as many relevant details about the environment you experienced the problem in
Version used (
icinga2 --version
): r2.10.4-1, but this happened on 2.10.2 as well.Operating System and version:
Red Hat Enterprise Linux Server release 7.3 (Maipo)
Enabled features (
icinga2 feature list
): Enabled features: api checker mainlogIcinga Web 2 version and modules (System - About): 2.6.2 monitoring | 2.6.2 setup | 2.6.2
Config validation (
icinga2 daemon -C
): ... [2019-05-23 14:55:33 -0400] information/cli: Finished validating the configuration file(s).If you run multiple Icinga 2 instances, the
zones.conf
file (oricinga2 object list --type Endpoint
andicinga2 object list --type Zone
) from all affected nodes. object Zone "global-templates" { global = true }object Zone "master" { endpoints = ["master1", "master2"] }
object Zone "zone1" { endpoints = ["sat1", "sat2"] parent = "master" }
object Zone "zone2" { endpoints = ["sat3", "sat4"] parent = "master" } Note here master[12]/sat[1234] are simplified hostnames just to show the configuration.
Additional context
Add any other context about the problem here. We first saw this issue on 2.10.2, was hoping 2.10.4 has a fix but it doesn't. The behavior is slightly different though. In 2.10.2, issue happened on one master and two endpoints in the same zone at the same time. With 2.10.4, so far it happend twice, both on two endpoints in the same zone only, not on master.