Icinga / icingadb

Icinga configuration and state database supporting multiple environments
https://icinga.com
GNU General Public License v2.0
59 stars 21 forks source link

Explicitly reset responsible of expired icingadb_instance rows #424

Closed julianbrost closed 2 years ago

julianbrost commented 2 years ago

In HA setups, when stopping icinga2 (but keeping icingadb running) or killing icingadb with SIGKILL, it's possible to end up with the following state in the database:

+--------------------------------------------+-------------+-------------------------------+
| environment_id                             | responsible | FROM_UNIXTIME(heartbeat/1000) |
+--------------------------------------------+-------------+-------------------------------+
| 0x6CC4013ECE4FFDBB275A4A763460507D17203146 | y           | 2021-12-06 14:08:03.8610      |
| 0x6CC4013ECE4FFDBB275A4A763460507D17203146 | y           | 2021-12-06 14:21:10.2890      |
+--------------------------------------------+-------------+-------------------------------+

So there is a left-over row with responsible='y' but with an expired heartbeat.

  1. If the icingadb process is still running, it should actively retract by writing responsible='n' to its own instance (not sure why this doesn't happen already)
  2. On takeover, the responsible should explicitly be reset for other expired rows

At the moment, this bug can result in a situation where icingadb-web shows the Icinga is not running warning even though everything is fine as is seems to only consider responsible but not heartbeat when selecting the row for displaying status information. So this could also fixed by icingadb-web by using WHERE responsible = 'y' ORDER BY heartbeat DESC LIMIT 1, but I think it's cleaner to fix it here and only have one row with responsible='y' in the first place.

lippserd commented 2 years ago

We should fix this in Icinga DB and Web.