Closed BTMichel closed 1 month ago
We've recently been affected multiple times by this issue on our production enviroment and have had to invest lots of manpower to get our systems to operate as usual. Looking forward to a fix in the near future.
You could also try to disable IDO HA, so that both nodes write to database, as a workaround. https://icinga.com/docs/icinga-2/latest/doc/06-distributed-monitoring/#high-availability-with-db-ido
@Al2Klimov
You could also try to disable IDO HA, so that both nodes write to database, as a workaround. https://icinga.com/docs/icinga-2/latest/doc/06-distributed-monitoring/#high-availability-with-db-ido
Thank you very much for your input on this issue. Do you have any additional info on possible issues/drawbacks with disabling the HA feature this way? The documentation does not elaborate much on this and we would make this change in a large production environment, where issues would have big impact.
Well, you'll double the write load on your DB. If in doubt, setup a test DB first and create an additional IdoMysqlConnection on each master w/ HA disabled. Then watch your test DB.
Hi @BTMichel and all the others, first of all thanks for reporting.
Icinga could at least try a fallback to the other master in that case, since connectivity issues can a lot of times only affect a single server and also because standard loadbalancing also tries to fallback to the other node in such cases
Icinga 2 should/does automatically fall back to the other master as per the configured failover_timeout
which happens to be 30s
by default. So my question is, did any of you actually waited for at least 30s
or whatever number of seconds you have configured for your IDO before manually restarting the Icinga 2 service? If not and waiting for the default timeout of 30s
is not an option for you, ~I suggest changing the failover timeout to a lower value that suits you~ (the minimum failover timeout is 30s
).
Thanks for coming back to this topic @Al2Klimov and @yhabteab! 👍 @yhabteab we did actually wait for much longer than the default failover_timeout, since our external monitoring did alert about 10 minutes after the database connection stopped working, that the cluster status is unhealthy. And after that, when our standby team tried finding the root cause for the monitoring issues, quite some more time went by. Since we do not have a failover_timeout set in the IDO, this auto fallback feature does not seem to work correctly.
Edit Looking at the documentation:
If the instance with the active DB IDO connection dies, the HA functionality will automatically elect a new DB IDO master.
Maybe this auto failover mechanism is designed to only failover if the actual node itself, which currently is master, is not responding anymore? This sounds as if a timeout to the database itself would not force a failover, only issues on the node itself.
Can you please share some logs from your both masters for the time where Icinga 2 does not automatically failover after the default 30s
. Does the active endpoint even notice that it cannot reach the database and complain about it? Also, the passive endpoint should actually try to take over the responsibility every 10s
and cancel the attempt if it thinks that the other endpoint is still active and is writing to the database, e.g. with [2024-08-05 14:01:00 +0000] information/IdoMysqlConnection: Last update by endpoint 'master2' was ... ago ... Retrying.
@yhabteab unfortunately, because of log rotations and because the last time this issue occured was (fortunately) in april, I do not have those specific logs at hand anymore. I can however provide you with the following information:
Apr 19 03:06:53 <MASTER1> icingadirector: Zend_Db_Adapter_Exception in /usr/share/icinga-php/vendor/vendor/shardj/zf1-future/library/Zend/Db/Adapter/Pdo/Abstract.php:171 with message: SQLSTATE[HY000] [2002] Connection timed out <- PDOException in /usr/share/icinga-php/vendor/vendor/shardj/zf1-future/library/Zend/Db/Adapter/Pdo/Abstract.php:145 with message: SQLSTATE[HY000] [2002] Connection timed out
While the other node did not have such logs, since it was able to contact the database just fine.Last update by endpoint 'master2' was ... ago ... Retrying.
[2024-08-02 02:18:25 +0200] information/IdoMysqlConnection: Last update by endpoint '<config_master>' was 3.17505s ago (< failover timeout of 30s). Retrying.
[2024-08-02 02:18:55 +0200] information/IdoMysqlConnection: Last update by endpoint '<config_master>' was 33.1017s ago. Taking over 'ido-mysql' in HA zone '<zone_name>'.
[2024-08-03 02:23:32 +0200] information/IdoMysqlConnection: Last update by endpoint '<config_master>' was 27.3017s ago (< failover timeout of 30s). Retrying.
Hi @BTMichel, sorry for the delay!
Is this expected behavior or should there normally be a line like this every 10 seconds?
I was expecting the passive node to monitor the icinga_programstatus
table every 10s
, but after testing this on my end, this is not the case. It appears, instead, that the IDO feature on the passive node is getting totally paused, i.e. it closes all its database connections and stops its timers when the IDO feature on the other node becomes active. Meaning, as long as the IDO object on the active node is not paused, the IDO feature on the passive node is effectively dead. However, the active node does not give up trying to reconnect to the database even after the fail-over timeout.
I actually was expecting that when it can't establish a functional database connection within the failover timeout, it would simply pause itself to let the IDO feature on the other node resume, but that's not the case, and I don't know if we're going to change that now. CC @julianbrost, @Al2Klimov
Passive node:
[2024-09-06 14:11:48 +0000] information/IdoMysqlConnection: Pending queries: 73605 (Input: 888/s; Output: 2716/s)
[2024-09-06 14:11:49 +0000] information/DbConnection: Pausing IDO connection: ido-mysql
[2024-09-06 14:12:01 +0000] information/ApiListener: Replayed 61960 messages.
[2024-09-06 14:12:12 +0000] information/IdoMysqlConnection: Disconnected from 'ido-mysql' database 'icinga2'.
[2024-09-06 14:12:12 +0000] information/IdoMysqlConnection: 'ido-mysql' paused.
Active node:
[2024-09-06 14:30:32 +0000] critical/IdoMysqlConnection: Exception during database operation: Verify that your database is operational!
[2024-09-06 14:30:43 +0000] critical/IdoMysqlConnection: Connection to database 'icinga2' with user 'icinga2' on 'localhost:3306' failed: "Access denied for user 'icinga2'@'localhost' (using password: YES)"
Context:
(0) Reconnecting to MySQL IDO database 'ido-mysql'
[2024-09-06 14:30:43 +0000] critical/IdoMysqlConnection: Exception during database operation: Verify that your database is operational!
[2024-09-06 14:30:54 +0000] critical/IdoMysqlConnection: Connection to database 'icinga2' with user 'icinga2' on 'localhost:3306' failed: "Access denied for user 'icinga2'@'localhost' (using password: YES)"
Context:
(0) Reconnecting to MySQL IDO database 'ido-mysql'
[2024-09-06 14:30:54 +0000] critical/IdoMysqlConnection: Exception during database operation: Verify that your database is operational!
Programstatus table:
MariaDB [icinga2]> SELECT endpoint_name, is_currently_running, status_update_time FROM icinga_programstatus;
+---------------+----------------------+---------------------+
| endpoint_name | is_currently_running | status_update_time |
+---------------+----------------------+---------------------+
| master2 | 0 | 2024-09-06 14:11:44 |
+---------------+----------------------+---------------------+
@yhabteab thanks for the analyzing the issue! So there is actually a bug in the IDO HA feature.
I don't know if we're going to change that now.
By saying that, you refer to the fact, that the IDO Feature is deprecated and you are not sure if this will be fixed since IcingaDB is the de facto standard going forward?
Is this certain or could this be discussed with other Icinga team members, since this bug can cause total loss of monitoring functionality?
If fixing the IDO bug is not an option: Since we have to make an informed decision whether we need to migrate to IcingaDB in the near future or if we should wait for a bug fix in IDO; Could you verify whether or not this bug is also present when using IcingaDB?
I don't know if we're going to change that now.
By saying that, you refer to the fact, that the IDO Feature is deprecated and you are not sure if this will be fixed since IcingaDB is the de facto standard going forward?
Exactly! This bug is a shortcoming of the IDO HA implementation/design and we won't be changing this behaviour any time soon.
Since we have to make an informed decision whether we need to migrate to IcingaDB in the near future or if we should wait for a bug fix in IDO;
When you have the option to migrate to Icinga DB, then I would definitely recommend it as well - like I said, we don't plan to fix this since it's such a design flaw why Icinga DB was introduced in the first place.
Could you verify whether or not this bug is also present when using IcingaDB?
The Icinga DB HA feature
works completely differently to the IDO and does not suffer from these kinds of problems.
@yhabteab thank you very much for checking and confirming that it is in fact a bug, now we know for sure, that we need to migrate to IcingaDB to avoid this in the future 👍 I'll mark this issue as closed, since all questions have been answered, I appreciate it!🙂
Describe the bug
In a distributed monitoring setup with two monitoring master nodes, icinga chooses one node as the "Active Endpoint" (as seen in icingaweb2 under _https:///monitoring/health/info_).
This active icinga endpoint updates the icinga2 database and also updates the "Last Status Update" (also seen in icingaweb2 under _https:///monitoring/health/info_).
If this active endpoint can not reach the database, updates are obviously not possible and the message "The monitoring backend 'icinga' is not running" is shown in icingaweb2, while checks results are also not written to the database.
This is of course just fine and expected, if the database itself is not reachable. The issue arises, when only one master can not reach the database, because of network issues for example, while the other master can still reach the database just fine. Now, if the master which can not reach the database is also the "Active Endpoint", that's a big issue, because icinga basically stops working in that case, since it does not seem to actively check for database connectivity of this active endpoint.
We experienced this exact issue today and could only bring icinga back to life by stopping the icinga2 systemd service on the active endpoint. After that, icinga promptly switched the active endpoint to the other master node and icinga worked properly again. This raised questions, since we had no other way of switching the active endpoint, only stopping the affected icinga master has had an effect, which we did not expect.
We expected icinga master loadbalancing to switch the active endpoint to the other master in case of connectivity issues, since it should be clear, that this endpoint can not perform its job without database connectivity. Icinga could at least try a fallback to the other master in that case, since connectivity issues can a lot of times only affect a single server and also because standard loadbalancing also tries to fallback to the other node in such cases.
To Reproduce
Expected behavior
When the current "Active Endpoint" is not able to connect to the database, loadbalancing should kick in and try to fallback to the other master node, to see if that node can connect to the database. If that node also can not connect, there is of course nothing else to try and there is a general issue with the database, but not trying at all, is leading to huge issues in production environments.
Screenshots
No screenshots are necessary.
Your Environment
Include as many relevant details about the environment you experienced the problem in
icinga2 --version
): r2.14.2-1icinga2 feature list
): api checker graphite ido-mysql mainlog notificationicinga2 daemon -C
):[2024-04-19 10:49:59 +0200] information/cli: Icinga application loader (version: r2.14.2-1)
[2024-04-19 10:50:06 +0200] information/cli: Finished validating the configuration file(s).
//ICINGA MASTER HOSTS
object Endpoint "master-host-1" { host = ""
port = "5665"
}
object Endpoint "master-host-2" { host = ""
port = "5665"
}
//ICINGA MASTER ZONE
object Zone "master-zone" { endpoints = [ "master-host-1","master-host-2" ] }
//ICINGA SATELLITES
object Endpoint "satellite-1" { host = ""
port = "5665"
}
object Endpoint "satellite-2" { host = ""
port = "5665"
}
object Endpoint "satellite-3" { host = ""
port = "5665"
}
object Endpoint "satellite-4" { host = ""
port = "5665"
}
//ICINGA SATELLITE ZONE
object Zone "satellite-zone-1" { endpoints = [ "satellite-1","satellite-2" ] parent = "master-zone" }
object Zone "satellite-zone-2" { endpoints = [ "satellite-3","satellite-4" ] parent = "master-zone" }
//ICINGA GLOBAL ZONE
object Zone "global-templates" { global = true }
object Zone "director-global" { global = true }
object Zone "windows-commands" { global = true }