Open w1ll-i-code opened 1 month ago
On vanilla icinga2 I can only reproduce it on the rhel8 package:
Here is everything you will need to reproduce the same errors locally:
tar -xaf icinga2-bugreport-reproduce.tar.gz
dropped_state_query
contains a script in rust to locate the lost events more easilyreproduce_local
contains a docker-compose to setup the test environment-icinga2-bugreport-reproduce/reproduce_local/icinga2-init/rhel8.Dockerfile
docker-compose up --build
in reproduce_local
docker exec reproduce_local_icingaweb_1 bash -c "while true; do icingacli director config deploy --force ; sleep 30; done"
dropped_state_query
run
cargo run -- --host localhost --user icingadb --password icingadb --database icingadb
Here is the archive with all needed data: icinga2-bugreport-reproduce.tar.gz
I could also reproduce the same problem on both Debian and RHEL 9, but it was orders of magnitude less likely to happen there.
I have now also reproduced the same issue with icingadb:
git clone https://github.com/lippserd/docker-compose-icinga
cd docker-compose-icinga
docker-compose up
docker exec docker-compose-icinga_director_1 icingacli director basket restore < icinga-lost-statechange-basket.json
docker exec docker-compose-icinga_director_1 bash -c 'for i in $(seq -f "%05g" 0 999); do
icingacli director host create "host-icinga-lost-statehistory-${i}" --imports "ht-icinga-lost-statechange"
done'
docker exec docker-compose-icinga_director_1 icingacli director config deploy
docker exec docker-compose-icinga_director_1 bash -c 'while true; do sleep 300; icingacli director config deploy --force; done'
From my local tests on a VM with 4 Cores and 8GB of RAM, the error should be observable in around 10 Minutes / two deploys. To locate the errors automatically from the db, I have attached the rust program I used to analyze the history tables. dropped_state_query.tar.gz
To run that, download the dropped_state_query.tar.gz
file and run:
tar -xaf dropped_state_query.tar.gz
cd dropped_state_query
cargo run --release -- --host 127.0.0.1 --user icingadb --password icingadb --icingadb | tee analyzed-history.log
Hint: Rust can be installed easily from https://rustup.rs/
Once the rust program has found some missing state_history and/or notification, you can verify that by looking at the service history.
Hi all,
Thank you for all the details. I am trying to reproduce the scenario, but so far without success. When was the screenshot taken? Immediately after executing the queries to check if something is missing in the database? As there is a high chance that Icinga DB has not inserted everything yet.
Also, missing entries in the database do not necessarily mean that Icinga has not sent a notification. That should rather be verified using custom check and notification plugins.
Best regards, Eric
Hello Eric.
I have run the tests over the weekend. Even after several days they do not appear in icingadb. This is of course obvious to you, as I posted the screenshot on the 14th, while the state changes happened on the 10th. And as you know, if it were a few minutes ago, it would not have shown the date, but rather the delta time since then.
As for the notifications, we became aware of the problem, because one of our services went into critical without sending any notifications. That does normally work and we have tried the configuration to make sure it works. It was on pure coincident that we noticed that, which lead us to investigate.
@lippserd I have opened a PR to fix the issue.
@lippserd Have you seen my PR? I would really appreciate some feedback.
The discussion in #10191 was rapidly superseded by the other comments, which is why I am attempting to clarify it her. I would like to begin by emphasising that I do really appreciate the analyses and PR you have presented and have no intention of discrediting them in any way. However, your script relies heavily on data from the database. I am not saying that the data from the database is entirely inaccurate, but there are other factors at play like Redis, Icinga DB etc. Thus, you can't trace a bug in Icinga 2 solely based on the database info. Though, while writing down this comment, I was able to successfully trigger some of the described behaviours with just a few hosts.
But first, I'll start by outlining the current deactivation order (without your PR) of the config objects. Initially, the checker is deactivated before anything else, and typically, during a normal circumstance, there won't be running checks after scheduler stops its thread. https://github.com/Icinga/icinga2/blob/7a20d987f6b44cf1e15d9a86e4cb84abfdecda15/lib/checker/checkercomponent.cpp#L79-L81 https://github.com/Icinga/icinga2/blob/7a20d987f6b44cf1e15d9a86e4cb84abfdecda15/lib/checker/checkercomponent.cpp#L107-L108
Afterwards, it will not schedule any new checks, but it will not stop already scheduled checks either. As a result, you may still observe running checks even after deactivating the checker. However, such running checks left overs should not cause a state change as once the Host or Service is also deactivated, no check result is processed (should not are the key words here! It is challenging, but it is possible to trigger state change events when the timings are perfectly matched). As illustrated in the example below, even if the check plugin has reported a problem state, most of the check results are nevertheless discarded because the corresponding hosts have already been deactivated. https://github.com/Icinga/icinga2/blob/7a20d987f6b44cf1e15d9a86e4cb84abfdecda15/lib/icinga/checkable-check.cpp#L151-L152
$ icinga2 daemon -c icinga2.conf
...
[2024-11-13 15:17:24 +0100] information/config: Host recovered
[2024-11-13 15:17:24 +0100] information/Checkable: ProcessCheckResult IsActive - 1 state - 0 output - I'm just testing something
IcingaDB::Checkable::OnNewCheckResult
[2024-11-13 15:17:24 +0100] information/Checkable: Checkable 'test' has 2 notification(s). Checking filters for type 'Recovery', sends will be logged.
IcingaDB::Checkable::OnNotificationSentToAllUsers.
[2024-11-13 15:17:24 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:24 +0100] information/config: Host recovered
[2024-11-13 15:17:24 +0100] information/Checkable: ProcessCheckResult IsActive - 1 state - 0 output - I'm just testing something
IcingaDB::Checkable::OnNewCheckResult
[2024-11-13 15:17:24 +0100] information/Checkable: Checkable 'test1' has 2 notification(s). Checking filters for type 'Recovery', sends will be logged.
IcingaDB::Checkable::OnNotificationSentToAllUsers.
[2024-11-13 15:17:24 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:24 +0100] information/config: Host recovered
[2024-11-13 15:17:24 +0100] information/Checkable: ProcessCheckResult IsActive - 1 state - 0 output - I'm just testing something
IcingaDB::Checkable::OnNewCheckResult
[2024-11-13 15:17:24 +0100] information/Checkable: Checkable 'test3' has 2 notification(s). Checking filters for type 'Recovery', sends will be logged.
IcingaDB::Checkable::OnNotificationSentToAllUsers.
[2024-11-13 15:17:24 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:25 +0100] information/config: Host recovered
[2024-11-13 15:17:25 +0100] information/Checkable: ProcessCheckResult IsActive - 1 state - 0 output - I'm just testing something
IcingaDB::Checkable::OnNewCheckResult
[2024-11-13 15:17:25 +0100] information/Checkable: Checkable 'test4' has 2 notification(s). Checking filters for type 'Recovery', sends will be logged.
IcingaDB::Checkable::OnNotificationSentToAllUsers.
[2024-11-13 15:17:25 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:25 +0100] information/config: Host recovered
[2024-11-13 15:17:25 +0100] information/Checkable: ProcessCheckResult IsActive - 1 state - 0 output - I'm just testing something
IcingaDB::Checkable::OnNewCheckResult
[2024-11-13 15:17:25 +0100] information/Checkable: Checkable 'test2' has 2 notification(s). Checking filters for type 'Recovery', sends will be logged.
IcingaDB::Checkable::OnNotificationSentToAllUsers.
[2024-11-13 15:17:25 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:25 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:25 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:25 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:25 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:25 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:25 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:26 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:26 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:26 +0100] information/Check Scheduler: Scheduling a check
^C[2024-11-13 15:17:26 +0100] information/Application: Received request to shut down.
[2024-11-13 15:17:26 +0100] information/Application: Shutting down...
[2024-11-13 15:17:26 +0100] information/ConfigObject: Deactivate() called for config object 'checker' with type 'CheckerComponent'.
[2024-11-13 15:17:26 +0100] information/Check Scheduler: Scheduler stopped
[2024-11-13 15:17:26 +0100] information/CheckerComponent: 'checker' stopped.
[2024-11-13 15:17:26 +0100] information/ConfigObject: Deactivate() called for config object 'notification' with type 'NotificationComponent'.
[2024-11-13 15:17:26 +0100] information/NotificationComponent: 'notification' stopped.
[2024-11-13 15:17:26 +0100] information/ConfigObject: Deactivate() called for config object 'icingadb' with type 'IcingaDB'.
[2024-11-13 15:17:26 +0100] information/IcingaDB: Flushing history data buffer to Redis.
[2024-11-13 15:17:28 +0100] information/config: Host run into or is in problem state
[2024-11-13 15:17:28 +0100] information/Checkable: ProcessCheckResult IsActive - 1 state - 2 output - I'm just testing something
IcingaDB::Checkable::OnNewCheckResult
[2024-11-13 15:17:28 +0100] information/Checkable: Checkable 'test' has 2 notification(s). Checking filters for type 'Problem', sends will be logged.
[2024-11-13 15:17:28 +0100] information/Notification: Sending 'Problem' notification 'test!problem' for user 'icingaadmin'
IcingaDB::Checkable::OnNotificationSentToAllUsers.
IcingaDB::SendSentNotification dropped a notification <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
[2024-11-13 15:17:28 +0100] information/Notification: Completed sending 'Problem' notification 'test!problem' for checkable 'test' and user 'icingaadmin' using command 'send'.
[2024-11-13 15:17:28 +0100] critical/IcingaDB: history: 2 queries failed (attempt #3) while we're about to shut down. Giving up and discarding additional 8 queued history queries.
[2024-11-13 15:17:28 +0100] information/IcingaDB: 'icingadb' stopped.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'icingaadmin' with type 'User'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test4' with type 'Host'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test' with type 'Host'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test3' with type 'Host'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test1' with type 'Host'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test2' with type 'Host'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'icingadb' with type 'CheckCommand'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test1!problem' with type 'Notification'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test2!problem' with type 'Notification'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test4!recover' with type 'Notification'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test3!problem' with type 'Notification'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test3!recover' with type 'Notification'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test!problem' with type 'Notification'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test2!recover' with type 'Notification'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test4!problem' with type 'Notification'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test1!recover' with type 'Notification'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test!recover' with type 'Notification'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'send' with type 'NotificationCommand'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'app' with type 'IcingaApplication'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Dumping program state to file '/Users/yhabteab/Workspace/icinga2/prefix/var/lib/icinga2/icinga2.state'
[2024-11-13 15:17:28 +0100] information/config: Host run into or is in problem state
[2024-11-13 15:17:28 +0100] information/Checkable: ProcessCheckResult IsActive - 0 state - 2 output - I'm just testing something
[2024-11-13 15:17:28 +0100] information/config: Host run into or is in problem state
[2024-11-13 15:17:28 +0100] information/Checkable: ProcessCheckResult IsActive - 0 state - 2 output - I'm just testing something
[2024-11-13 15:17:29 +0100] information/config: Host run into or is in problem state
[2024-11-13 15:17:29 +0100] information/Checkable: ProcessCheckResult IsActive - 0 state - 2 output - I'm just testing something <<<<<<<<<<<<< is discarded
[2024-11-13 15:17:29 +0100] information/config: Host run into or is in problem state
[2024-11-13 15:17:29 +0100] information/Checkable: ProcessCheckResult IsActive - 0 state - 2 output - I'm just testing something <<<<<<<<<<<< is discarded
[2024-11-13 15:17:29 +0100] information/IcingaApplication: Icinga has shut down.
It is possible that the issue you are experiencing may be resolved by the proposed PR #10191, but not because of the checker is stopped last, rather because of the changes to IDO and the Notification class. Therefore, I guess it makes sense to stop IDO, IcingaDB and the Notification component last, as well as to start them before the other objects, as is the case now.
This did indeed alleviate the problem, but it still does not completely solve it.
This did indeed alleviate the problem, but it still does not completely solve it.
Would you mind sharing a few more details on what you've done and what isn't still fixed? :)
Therefore, I guess it makes sense to stop IDO, IcingaDB and the Notification component last, as well as to start them before the other objects, as is the case now.
I took this suggestion and changed the activation_priority
of them to -50
for the ido/icingadb and to -25
for the NotificationComponent. I only experienced a single lost statechange (and subsequent missing notification).
I had also a look on the CheckerComponent::ObjectHandler function and saw that if an object was paused, it was erased from m_IdleCheckables and m_PendingCheckables. For testing I cleared both before the shutdown and have not seen any problems since, but I'm still testing. I hope I understood the use of runtimeRemoved
correctly?
void CheckerComponent::Stop(bool runtimeRemoved)
{
{
std::unique_lock<std::mutex> lock(m_Mutex);
m_Stopped = true;
if (!runtimeRemoved) {
m_IdleCheckables.clear();
m_PendingCheckables.clear();
}
m_CV.notify_all();
}
m_ResultTimer->Stop(true);
m_Thread.join();
Log(LogInformation, "CheckerComponent")
<< "'" << GetName() << "' stopped.";
ObjectImpl<CheckerComponent>::Stop(runtimeRemoved);
}
I took this suggestion and changed the
activation_priority
of them to-50
for the ido/icingadb and to-25
for the NotificationComponent. I only experienced a single lost statechange (and subsequent missing notification).
Have you verified in the logs whether these objects were stopped last? And how did you detect the missing state change? You can also add some custom logs to trace the execution of the check without having to enable debug logs for all components.
I just quickly did some poor man's patching and testing, and couldn't find any ignored events from Icinga DB.
diff --git a/lib/base/configobject.cpp b/lib/base/configobject.cpp
index 4317771d1..4435456f4 100644
--- a/lib/base/configobject.cpp
+++ b/lib/base/configobject.cpp
@@ -586,6 +586,13 @@ void ConfigObject::StopObjects()
std::vector<Type::Ptr> types = Type::GetAllTypes();
std::sort(types.begin(), types.end(), [](const Type::Ptr& a, const Type::Ptr& b) {
+ if (a->GetName() == "IcingaDB" || a->GetName() == "NotificationComponent") {
+ return false;
+ }
+ if (b->GetName() == "IcingaDB" || b->GetName() == "NotificationComponent") {
+ return true;
+ }
+
if (a->GetActivationPriority() > b->GetActivationPriority())
return true;
return false;
@@ -599,7 +606,7 @@ void ConfigObject::StopObjects()
for (const ConfigObject::Ptr& object : dtype->GetObjects()) {
#ifdef I2_DEBUG
- Log(LogDebug, "ConfigObject")
+ Log(LogInformation, "ConfigObject")
<< "Deactivate() called for config object '" << object->GetName() << "' with type '" << type->GetName() << "'.";
I hope I understood the use of
runtimeRemoved
correctly?
That parameter indicates whether a configuration object was deleted at runtime, e.g. via the REST API or via a cluster message from some other endpoint. It is unlikely that this will happen for the Checker component though, as it is typically managed in configuration files and not through the API.
Have you verified in the logs whether these objects were stopped last?
Yes, I did as you suggested in the comment and logged the deactivate on LogInformation.
And how did you detect the missing state change?
As always by looking at the database.
That parameter indicates whether a configuration ...
Thanks, exactly as I expected. With the last fix I posted, I can no longer see any missed State changes. I have seen one lost notifications, which is a problem, as we rely on the notifications to trigger a webhook, but that might have effectively just been the db connection, as I can't replicate it anymore.
@yhabteab Thank you for your insight. It was really helpful to get at the bottom of this bug. With the changes I have made to the PR, this issue does no longer occur in my test environment. If there are any concerns left, please let me know and I'll do my best to address them.
Describe the bug
If an object has a state change during an icinga2 restart (e.g. during a deploy), it is sometimes not written to the database and does not trigger the notifications.
To Reproduce
icingacli director basket restore < icinga-lost-statechange-basket.json
icingacli director config deploy
With that configuration running, deploy icinga2 a few times:
icingacli director config deploy --force --wait
Soon there will be state changes in the state history that should not be possible:
In this case, the service went from hard warning into soft warning. The soft warning history says that the last state was Ok, but that was never written into the history.
To find lost state histories quicker I used the following script: dropped_state_query.tar.gz
It needs as parameters the endpoint, user and password. If the db is postgres, it can be run with the
--postgres
flag.Expected behavior
I expect icinga2 to not loose state changes like that.
Your Environment
Include as many relevant details about the environment you experienced the problem in
icinga2 --version
):Copyright (c) 2012-2024 Icinga GmbH (https://icinga.com/) License GPLv2+: GNU GPL version 2 or later https://gnu.org/licenses/gpl2.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.
Build information: Compiler: GNU 8.5.0 Build host: staging5591master OpenSSL version: OpenSSL 1.1.1k FIPS 25 Mar 2021
System information: Platform: Red Hat Enterprise Linux Platform version: 8.10 (Ootpa) Kernel: Linux Kernel version: 4.18.0-553.el8_10.x86_64 Architecture: x86_64
Disabled features: command compatlog debuglog elasticsearch gelf graphite icingadb influxdb2 journald opentsdb perfdata syslog Enabled features: api checker ido-mysql influxdb livestatus mainlog notification
Icinga Web 2 NetEye release 4.39 (Traditional bock)
PHP Version 7.4.33 MODULE VERSION analytics 1.58.0 auditlog 1.15.1 cube 1.1.0 customproblemview 0.0.0 director 1.11.1 geomap 1.22.0 grafana 1.4.2 neteye 1.155.0-1 host2servicedetailview 1.4.0 idoreports 0.10.1 incubator 0.22.0 ipl v0.5.0 lampo 1.2.2 leafletjs 1.9.4 loginaudit 0.0.1 mapDatatype 0.1.0 monitoring 2.10.5 monitoringview 1.7.0 nagvis 1.1.1 pdfexport 0.10.2 reactbundle 0.9.0 reporting 1.0.0 shutdownmanager 0.0.0 srwebbackend 0.0.0 tornado 2.19.2 update 1.44.1-2
[2024-09-30 14:45:34 +0200] information/cli: Icinga application loader (version: r2.14.2-1) [2024-09-30 14:45:34 +0200] information/cli: Loading configuration file(s). [2024-09-30 14:45:34 +0200] information/ConfigItem: Committing config item(s). [2024-09-30 14:45:34 +0200] information/ApiListener: My API identity: localhost.localdomain [2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 NotificationComponent. [2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 LivestatusListener. [2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 InfluxdbWriter. [2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 IdoMysqlConnection. [2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 CheckerComponent. [2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 ServiceGroup. [2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 902 Services. [2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 3 Zones. [2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 2 NotificationCommands. [2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 FileLogger. [2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 IcingaApplication. [2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 101 Hosts. [2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 Endpoint. [2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 6 ApiUsers. [2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 ApiListener. [2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 251 CheckCommands. [2024-09-30 14:45:34 +0200] information/ScriptGlobal: Dumping variables to file '/neteye/shared/icinga2/data/cache/icinga2/icinga2.vars' [2024-09-30 14:45:34 +0200] information/cli: Finished validating the configuration file(s).