Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.03k stars 578 forks source link

Icinga Master Endpoint Loadbalancing Ignoring Database Connectivity Issues -> Icinga becomes unusable #10047

Closed BTMichel closed 1 month ago

BTMichel commented 7 months ago

Describe the bug

In a distributed monitoring setup with two monitoring master nodes, icinga chooses one node as the "Active Endpoint" (as seen in icingaweb2 under _https:///monitoring/health/info_). This active icinga endpoint updates the icinga2 database and also updates the "Last Status Update" (also seen in icingaweb2 under _https:///monitoring/health/info_). If this active endpoint can not reach the database, updates are obviously not possible and the message "The monitoring backend 'icinga' is not running" is shown in icingaweb2, while checks results are also not written to the database.

This is of course just fine and expected, if the database itself is not reachable. The issue arises, when only one master can not reach the database, because of network issues for example, while the other master can still reach the database just fine. Now, if the master which can not reach the database is also the "Active Endpoint", that's a big issue, because icinga basically stops working in that case, since it does not seem to actively check for database connectivity of this active endpoint.

We experienced this exact issue today and could only bring icinga back to life by stopping the icinga2 systemd service on the active endpoint. After that, icinga promptly switched the active endpoint to the other master node and icinga worked properly again. This raised questions, since we had no other way of switching the active endpoint, only stopping the affected icinga master has had an effect, which we did not expect.

We expected icinga master loadbalancing to switch the active endpoint to the other master in case of connectivity issues, since it should be clear, that this endpoint can not perform its job without database connectivity. Icinga could at least try a fallback to the other master in that case, since connectivity issues can a lot of times only affect a single server and also because standard loadbalancing also tries to fallback to the other node in such cases.

To Reproduce

  1. Setup a distributed icinga2 setup with two master nodes
  2. Configure icingaweb2 in that cluster
  3. Setup an external database
  4. After setup, check the following url to see what master is the "Active Endpoint" -> _https:///monitoring/health/info_
  5. Prevent network connectivity of the active endpoint to the database, for example using firewalld, such that timeouts or connectivity errors occur on connecting to the database
  6. Wait at least 1 minute, then check the icingaweb2 WebUI, to verify that the banner "The monitoring backend 'icinga' is not running" is shown
  7. Verify that the active endpoint did in fact not switch to the other master, even though it can not perform its job -> _https:///monitoring/health/info_

Expected behavior

When the current "Active Endpoint" is not able to connect to the database, loadbalancing should kick in and try to fallback to the other master node, to see if that node can connect to the database. If that node also can not connect, there is of course nothing else to try and there is a general issue with the database, but not trying at all, is leading to huge issues in production environments.

Screenshots

No screenshots are necessary.

Your Environment

Include as many relevant details about the environment you experienced the problem in

//ICINGA MASTER HOSTS

object Endpoint "master-host-1" { host = "" port = "5665" }

object Endpoint "master-host-2" { host = "" port = "5665" }

//ICINGA MASTER ZONE

object Zone "master-zone" { endpoints = [ "master-host-1","master-host-2" ] }

//ICINGA SATELLITES

object Endpoint "satellite-1" { host = "" port = "5665" }

object Endpoint "satellite-2" { host = "" port = "5665" }

object Endpoint "satellite-3" { host = "" port = "5665" }

object Endpoint "satellite-4" { host = "" port = "5665" }

//ICINGA SATELLITE ZONE

object Zone "satellite-zone-1" { endpoints = [ "satellite-1","satellite-2" ] parent = "master-zone" }

object Zone "satellite-zone-2" { endpoints = [ "satellite-3","satellite-4" ] parent = "master-zone" }

//ICINGA GLOBAL ZONE

object Zone "global-templates" { global = true }

object Zone "director-global" { global = true }

object Zone "windows-commands" { global = true }



## Additional context

No additional context is necessary currently.
fabiankleint commented 7 months ago

We've recently been affected multiple times by this issue on our production enviroment and have had to invest lots of manpower to get our systems to operate as usual. Looking forward to a fix in the near future.

Al2Klimov commented 5 months ago

You could also try to disable IDO HA, so that both nodes write to database, as a workaround. https://icinga.com/docs/icinga-2/latest/doc/06-distributed-monitoring/#high-availability-with-db-ido

BTMichel commented 3 months ago

@Al2Klimov

You could also try to disable IDO HA, so that both nodes write to database, as a workaround. https://icinga.com/docs/icinga-2/latest/doc/06-distributed-monitoring/#high-availability-with-db-ido

Thank you very much for your input on this issue. Do you have any additional info on possible issues/drawbacks with disabling the HA feature this way? The documentation does not elaborate much on this and we would make this change in a large production environment, where issues would have big impact.

Al2Klimov commented 3 months ago

Well, you'll double the write load on your DB. If in doubt, setup a test DB first and create an additional IdoMysqlConnection on each master w/ HA disabled. Then watch your test DB.

yhabteab commented 3 months ago

Hi @BTMichel and all the others, first of all thanks for reporting.

Icinga could at least try a fallback to the other master in that case, since connectivity issues can a lot of times only affect a single server and also because standard loadbalancing also tries to fallback to the other node in such cases

Icinga 2 should/does automatically fall back to the other master as per the configured failover_timeout which happens to be 30s by default. So my question is, did any of you actually waited for at least 30s or whatever number of seconds you have configured for your IDO before manually restarting the Icinga 2 service? If not and waiting for the default timeout of 30s is not an option for you, ~I suggest changing the failover timeout to a lower value that suits you~ (the minimum failover timeout is 30s).

BTMichel commented 3 months ago

Thanks for coming back to this topic @Al2Klimov and @yhabteab! 👍 @yhabteab we did actually wait for much longer than the default failover_timeout, since our external monitoring did alert about 10 minutes after the database connection stopped working, that the cluster status is unhealthy. And after that, when our standby team tried finding the root cause for the monitoring issues, quite some more time went by. Since we do not have a failover_timeout set in the IDO, this auto fallback feature does not seem to work correctly.

Edit Looking at the documentation:

If the instance with the active DB IDO connection dies, the HA functionality will automatically elect a new DB IDO master.

Maybe this auto failover mechanism is designed to only failover if the actual node itself, which currently is master, is not responding anymore? This sounds as if a timeout to the database itself would not force a failover, only issues on the node itself.

yhabteab commented 3 months ago

Can you please share some logs from your both masters for the time where Icinga 2 does not automatically failover after the default 30s. Does the active endpoint even notice that it cannot reach the database and complain about it? Also, the passive endpoint should actually try to take over the responsibility every 10s and cancel the attempt if it thinks that the other endpoint is still active and is writing to the database, e.g. with [2024-08-05 14:01:00 +0000] information/IdoMysqlConnection: Last update by endpoint 'master2' was ... ago ... Retrying.

BTMichel commented 3 months ago

@yhabteab unfortunately, because of log rotations and because the last time this issue occured was (fortunately) in april, I do not have those specific logs at hand anymore. I can however provide you with the following information:

  1. This was the view visible under /monitoring/health/info, during the first half of the investigation, which shows that the active node was not switched for at least 25 minutes, as the "last status update" shows: image
  2. On the node which was not able to contact the database, the director service also had logs like the following: Apr 19 03:06:53 <MASTER1> icingadirector: Zend_Db_Adapter_Exception in /usr/share/icinga-php/vendor/vendor/shardj/zf1-future/library/Zend/Db/Adapter/Pdo/Abstract.php:171 with message: SQLSTATE[HY000] [2002] Connection timed out <- PDOException in /usr/share/icinga-php/vendor/vendor/shardj/zf1-future/library/Zend/Db/Adapter/Pdo/Abstract.php:145 with message: SQLSTATE[HY000] [2002] Connection timed out While the other node did not have such logs, since it was able to contact the database just fine.
  3. Regarding your info, that in the icinga2 logs, there should be multiple occurences of log lines like: Last update by endpoint 'master2' was ... ago ... Retrying.
    \->This is actually very interesting. Since on our config master (which typically is not the active icinga endpoint, aka not the node that writes to the database) I was not able to find any log lines like that. Only on the other node, which 99% of the time is the active icinga endpoint, i was able to find a few occurences of those lines. But also only from 2 days ago and no other occurences. Is this expected behavior or should there normally be a line like this every 10 seconds?
    [2024-08-02 02:18:25 +0200] information/IdoMysqlConnection: Last update by endpoint '<config_master>' was 3.17505s ago (< failover timeout of 30s). Retrying.
    [2024-08-02 02:18:55 +0200] information/IdoMysqlConnection: Last update by endpoint '<config_master>' was 33.1017s ago. Taking over 'ido-mysql' in HA zone '<zone_name>'.
    [2024-08-03 02:23:32 +0200] information/IdoMysqlConnection: Last update by endpoint '<config_master>' was 27.3017s ago (< failover timeout of 30s). Retrying.
yhabteab commented 2 months ago

Hi @BTMichel, sorry for the delay!

Is this expected behavior or should there normally be a line like this every 10 seconds?

I was expecting the passive node to monitor the icinga_programstatus table every 10s, but after testing this on my end, this is not the case. It appears, instead, that the IDO feature on the passive node is getting totally paused, i.e. it closes all its database connections and stops its timers when the IDO feature on the other node becomes active. Meaning, as long as the IDO object on the active node is not paused, the IDO feature on the passive node is effectively dead. However, the active node does not give up trying to reconnect to the database even after the fail-over timeout.

I actually was expecting that when it can't establish a functional database connection within the failover timeout, it would simply pause itself to let the IDO feature on the other node resume, but that's not the case, and I don't know if we're going to change that now. CC @julianbrost, @Al2Klimov

Passive node:

[2024-09-06 14:11:48 +0000] information/IdoMysqlConnection: Pending queries: 73605 (Input: 888/s; Output: 2716/s)
[2024-09-06 14:11:49 +0000] information/DbConnection: Pausing IDO connection: ido-mysql
[2024-09-06 14:12:01 +0000] information/ApiListener: Replayed 61960 messages.
[2024-09-06 14:12:12 +0000] information/IdoMysqlConnection: Disconnected from 'ido-mysql' database 'icinga2'.
[2024-09-06 14:12:12 +0000] information/IdoMysqlConnection: 'ido-mysql' paused.

Active node:

[2024-09-06 14:30:32 +0000] critical/IdoMysqlConnection: Exception during database operation: Verify that your database is operational!
[2024-09-06 14:30:43 +0000] critical/IdoMysqlConnection: Connection to database 'icinga2' with user 'icinga2' on 'localhost:3306' failed: "Access denied for user 'icinga2'@'localhost' (using password: YES)"
Context:
    (0) Reconnecting to MySQL IDO database 'ido-mysql'

[2024-09-06 14:30:43 +0000] critical/IdoMysqlConnection: Exception during database operation: Verify that your database is operational!
[2024-09-06 14:30:54 +0000] critical/IdoMysqlConnection: Connection to database 'icinga2' with user 'icinga2' on 'localhost:3306' failed: "Access denied for user 'icinga2'@'localhost' (using password: YES)"
Context:
    (0) Reconnecting to MySQL IDO database 'ido-mysql'

[2024-09-06 14:30:54 +0000] critical/IdoMysqlConnection: Exception during database operation: Verify that your database is operational!

Programstatus table:

MariaDB [icinga2]> SELECT endpoint_name, is_currently_running, status_update_time FROM icinga_programstatus;
+---------------+----------------------+---------------------+
| endpoint_name | is_currently_running | status_update_time  |
+---------------+----------------------+---------------------+
| master2       |                    0 | 2024-09-06 14:11:44 |
+---------------+----------------------+---------------------+
BTMichel commented 2 months ago

@yhabteab thanks for the analyzing the issue! So there is actually a bug in the IDO HA feature.

I don't know if we're going to change that now.

By saying that, you refer to the fact, that the IDO Feature is deprecated and you are not sure if this will be fixed since IcingaDB is the de facto standard going forward?

Is this certain or could this be discussed with other Icinga team members, since this bug can cause total loss of monitoring functionality?

If fixing the IDO bug is not an option: Since we have to make an informed decision whether we need to migrate to IcingaDB in the near future or if we should wait for a bug fix in IDO; Could you verify whether or not this bug is also present when using IcingaDB?

yhabteab commented 2 months ago

I don't know if we're going to change that now.

By saying that, you refer to the fact, that the IDO Feature is deprecated and you are not sure if this will be fixed since IcingaDB is the de facto standard going forward?

Exactly! This bug is a shortcoming of the IDO HA implementation/design and we won't be changing this behaviour any time soon.

Since we have to make an informed decision whether we need to migrate to IcingaDB in the near future or if we should wait for a bug fix in IDO;

When you have the option to migrate to Icinga DB, then I would definitely recommend it as well - like I said, we don't plan to fix this since it's such a design flaw why Icinga DB was introduced in the first place.

Could you verify whether or not this bug is also present when using IcingaDB?

The Icinga DB HA feature works completely differently to the IDO and does not suffer from these kinds of problems.

BTMichel commented 1 month ago

@yhabteab thank you very much for checking and confirming that it is in fact a bug, now we know for sure, that we need to migrate to IcingaDB to avoid this in the future 👍 I'll mark this issue as closed, since all questions have been answered, I appreciate it!🙂