ITRS-Group / monitor-merlin

Module for Effortless Redundancy and Loadbalancing In Naemon
https://itrs-group.github.io/monitor-merlin/
GNU General Public License v2.0
22 stars 14 forks source link

Loadbalancing Irregularity #116

Open eschoeller opened 3 years ago

eschoeller commented 3 years ago

My apologies for posting this here - I was looking for the op5 mailing list (which I was subscribed to years ago) but it appears op5.org is gone ...

I am switching from a very old version of Nagios/Merlin ... probably 6 years old at least. I am seeing some behavior with the newer version that I do not understand. I am running Naemon 1.2.4-1 and pulled Merlin from git today.

In the past, the nagios node running a check would log the 'SERVICE ALERT' for that check. I had all three of our peered nagios/merlin machines syslogging to each other so I had a nice view of where checks were being run and what was happening.

With the new version it appears certain events are being "echoed" by all the naemon daemons like this:

Jul  3 21:41:31 node_a naemon: SERVICE ALERT: crac-A03;InRow Supply Humidity;WARNING;SOFT;4;SNMP WARNING - Supply Humidity *673*
Jul  3 21:41:31 node_c naemon: SERVICE ALERT: crac-A03;InRow Supply Humidity;WARNING;SOFT;4;SNMP WARNING - Supply Humidity *673*
Jul  3 21:41:31 node_b naemon: SERVICE ALERT: crac-A03;InRow Supply Humidity;WARNING;SOFT;4;SNMP WARNING - Supply Humidity *673*
Jul  3 21:42:32 node_a naemon: SERVICE ALERT: crac-A03;InRow Supply Humidity;OK;SOFT;5;SNMP OK - Supply Humidity 666
Jul  3 21:42:32 node_c naemon: SERVICE ALERT: crac-A03;InRow Supply Humidity;OK;SOFT;5;SNMP OK - Supply Humidity 666
Jul  3 21:42:32 node_b naemon: SERVICE ALERT: crac-A03;InRow Supply Humidity;OK;SOFT;5;SNMP OK - Supply Humidity 666

However, on some occasions the behavior is much different:

Jul  3 21:04:12 node_a naemon: SERVICE NOTIFICATION SUPPRESSED: pdu-f10;Infeed Power Factor;Re-notification blocked for this problem.
Jul  3 21:04:12 node_c naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;58;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.67
Jul  3 21:09:14 node_b naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;51;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.67
Jul  3 21:09:14 node_c naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;59;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.67
Jul  3 21:11:37 node_b naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;52;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.66
Jul  3 21:11:37 node_c naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;60;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.66
Jul  3 21:21:37 node_c naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;61;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.67
Jul  3 21:21:37 node_b naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;53;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.67
Jul  3 21:31:37 node_c naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;62;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.67
Jul  3 21:31:37 node_b naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;54;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.67
Jul  3 21:41:37 node_b naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;55;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.67
Jul  3 21:41:37 node_c naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;63;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.67

In the second example only "node_b" and "node_c" appear to be echoing these events, but what is additionally concerning is the retry values do not sync up correctly. ie. at 21:41:37 node_b thought this was the 55th retry while node_c thought it was the 63rd.

Here is a mon node status

Total checks (host / service): 259 / 4780

#00 2/2:2 local ipc: ACTIVE - 0.000s latency
Uptime: 38m 43s. Connected: 38m 43s. Last alive: 0s ago
Host checks (handled, expired, total)   : 86, 0, 259 (33.20% : 33.20%)
Service checks (handled, expired, total): 1593, 0, 4780 (33.33% : 33.33%)

#01 1/2:2 peer node_b: ACTIVE - 0.000s latency - (ENCRYPTED)
Uptime: 38m 43s. Connected: 38m 43s. Last alive: 0s ago
Host checks (handled, expired, total)   : 86, 0, 259 (33.20% : 33.20%)
Service checks (handled, expired, total): 1593, 0, 4780 (33.33% : 33.33%)

#02 0/2:2 peer node_c: ACTIVE - 0.000s latency - (ENCRYPTED)
Uptime: 38m 45s. Connected: 38m 43s. Last alive: 0s ago
Host checks (handled, expired, total)   : 87, 0, 259 (33.59% : 33.59%)
Service checks (handled, expired, total): 1594, 0, 4780 (33.35% : 33.35%)

I looked through the troubleshooting notes and did get identical hashes for this command: mon node ctrl --type=peer -- mon oconf hash But my object.cache files do not have the same hash value. I looked through the files side-by-side (and with diff) and it looks like Naemon is listing the 'members' of certain host/service/etc groups in a randomized order, which throws the hashes off from each other.

I looked at the merlin database and found that the 'report_data' table had wildly different rows of data. Could have been the result of testing the system. So I truncated that table and started over ... now all three have roughly the same number of rows ... 612, 619, 618. But that hasn't really resolved the problem with what's getting logged by Naemon.

I did a specific query for the device in question "pdu-f10" in the report_data table, and all three databases have the same exact info ... 'timestamp' and 'retry' are all consistent ... but the 'id' for each row is different (which I guess makes sense since it's auto-increment). So one Naemon node logs the right retry value, another one logs a retry value that's less, and the other one logs nothing at all.

+-----+------------+-----------+---------------------+-------+------+-------+
| id  | timestamp  | host_name | service_description | state | hard | retry |
+-----+------------+-----------+---------------------+-------+------+-------+
| 130 | 1625363424 | pdu-f10   | Infeed Power Factor |     1 |    0 |    36 |
| 160 | 1625363725 | pdu-f10   | Infeed Power Factor |     1 |    0 |    37 |
| 178 | 1625364029 | pdu-f10   | Infeed Power Factor |     1 |    0 |    38 |
| 193 | 1625364330 | pdu-f10   | Infeed Power Factor |     1 |    0 |    47 |
| 222 | 1625364632 | pdu-f10   | Infeed Power Factor |     1 |    0 |    48 |
| 286 | 1625364933 | pdu-f10   | Infeed Power Factor |     1 |    0 |    49 |
| 359 | 1625365235 | pdu-f10   | Infeed Power Factor |     1 |    0 |    50 |
| 384 | 1625365537 | pdu-f10   | Infeed Power Factor |     1 |    0 |    51 |
| 401 | 1625365838 | pdu-f10   | Infeed Power Factor |     1 |    0 |    52 |
| 411 | 1625366140 | pdu-f10   | Infeed Power Factor |     1 |    0 |    45 |
| 430 | 1625366646 | pdu-f10   | Infeed Power Factor |     1 |    0 |    54 |
| 456 | 1625366948 | pdu-f10   | Infeed Power Factor |     1 |    0 |    55 |
| 483 | 1625367249 | pdu-f10   | Infeed Power Factor |     1 |    0 |    56 |
| 499 | 1625367551 | pdu-f10   | Infeed Power Factor |     1 |    0 |    57 |
| 518 | 1625367852 | pdu-f10   | Infeed Power Factor |     1 |    0 |    50 |
| 531 | 1625368154 | pdu-f10   | Infeed Power Factor |     1 |    0 |    51 |
+-----+------------+-----------+---------------------+-------+------+-------+

Again my sincere apologies for posting this as an 'issue' and not a general support request elsewhere!

eschoeller commented 3 years ago

So I went ahead and blew away my retention.dat files on all three naemon machines and then truncated the report_data and notification tables to try and 'start over'. Things started to work in a more synchronized manner. Then I shut off merlin and naemon on one of the machines for about 12 hours. As I would expect the report_data table on the machine that wasn't running naemon became out of date. I then turned merlin and naemon back on. That machine started to 'catch up' and I watched the report_data table start to become more identical to the two other machines. While this happened the naemon log on the machine that was not running was flooded with SERVICE ALERT messages for each 'update'. But in the process of this unfortunately that machine started to blast out notifications for issues that had already been reported on earlier in the day. But only a certain subset of notifications (we had a very busy day of failures today). Now as it stands all three seem to be synchronized back together, albeit just off by a couple IDs in the report_data table:

| 11657 | 1625622927 | pdu-f10 | Infeed Power Factor |     1 |    0 |    84 |
| 11663 | 1625623228 | pdu-f10 | Infeed Power Factor |     1 |    0 |    85 |
| 11679 | 1625623834 | pdu-f10 | Infeed Power Factor |     1 |    0 |    86 |
+-------+------------+---------------------+---------------------+-------+------+-------+
375 rows in set (0.015 sec)

| 11657 | 1625622927 | pdu-f10 | Infeed Power Factor |     1 |    0 |    84 |
| 11663 | 1625623228 | pdu-f10 | Infeed Power Factor |     1 |    0 |    85 |
| 11679 | 1625623834 | pdu-f10 | Infeed Power Factor |     1 |    0 |    86 |
+-------+------------+---------------------+---------------------+-------+------+-------+
375 rows in set (0.015 sec)

| 11661 | 1625622927 | pdu-f10 | Infeed Power Factor |     1 |    0 |    84 |
| 11664 | 1625623228 | pdu-f10 | Infeed Power Factor |     1 |    0 |    85 |
| 11675 | 1625623834 | pdu-f10 | Infeed Power Factor |     1 |    0 |    86 |
+-------+------------+---------------------+---------------------+-------+------+-------+
375 rows in set (0.013 sec)

Do I have a configuration error somewhere? I was potentially thinking that having check_freshness enabled could be messing up merlin? or is that necessary for making the redundancy work?

eschoeller commented 3 years ago

I actually spoke too soon. I did find a particular service check that now appears to be de-synchronized:

Jul  6 20:57:58 node_c naemon: SERVICE ALERT: pdu-f8;Infeed Power Factor;WARNING;SOFT;97;WARNING - PDU F8 (CORE), Link_L1(BA) PowerFactor: 0.68
Jul  6 20:57:58 node_a naemon: SERVICE ALERT: pdu-f8;Infeed Power Factor;WARNING;SOFT;97;WARNING - PDU F8 (CORE), Link_L1(BA) PowerFactor: 0.68
Jul  6 20:57:58 node_b naemon: SERVICE ALERT: pdu-f8;Infeed Power Factor;WARNING;SOFT;73;WARNING - PDU F8 (CORE), Link_L1(BA) PowerFactor: 0.68

node_b was the machine that was not running naemon/merlin for 12 hours. It's report_data table seems OK in comparison to the others:

| 11864 | 1625626376 | pdu-f8 | Infeed Power Factor |     1 |    0 |    96 |
| 11879 | 1625626678 | pdu-f8 | Infeed Power Factor |     1 |    0 |    97 |
+-------+------------+--------------------+---------------------+-------+------+-------+
348 rows in set (0.014 sec)

| 11864 | 1625626376 | pdu-f8 | Infeed Power Factor |     1 |    0 |    96 |
| 11879 | 1625626678 | pdu-f8 | Infeed Power Factor |     1 |    0 |    97 |
+-------+------------+--------------------+---------------------+-------+------+-------+
348 rows in set (0.012 sec)

| 11860 | 1625626376 | pdu-f8 | Infeed Power Factor |     1 |    0 |    96 |
| 11875 | 1625626678 | pdu-f8 | Infeed Power Factor |     1 |    0 |    97 |
+-------+------------+--------------------+---------------------+-------+------+-------+
348 rows in set (0.015 sec)

But looking in the retention.dat files ....

current_attempt=90
current_attempt=90
current_attempt=73

I can see they aren't synchronized. There are some other metrics that aren't matching up either.

These appear to be different no matter what:

last_state_change
last_hard_state_change
last_time_ok
last_time_warning

(and obviously check_latency)

However the others that differ on the machine that was off-line:

last_event_id
current_event_id
current_problem_id
current_attempt
jacobbaungard commented 3 years ago

Merlin works (at least now, can't speak for 6 years ago), by sending the result of checks to all nodes. This way all nodes should see the same checks, even if they are executed elsewhere. Naemon doesn't really know the difference of where the check results come from. So that Naemon logs the check results/alerts across all nodes, I would imagine is normal behaviour here, although I have not spent any time really looking into this.

The issue with notifications being sent from the node that went offline is a bit strange. Do you have any logs for the notifications that were sent out?

We don't generally touch the retention data file, although some use merlins inbuilt file sync, to sync it. That usually means if nodes come up at different stages, it can be a little out of sync. Normally things like check_attempts will sync fairly quickly, i.e in your case they will sync when the object goes back to an OK state. That's also true for the retry_interval reported in the database.

This is the correct place for issues etc, so that's all good! Did you see references to op5.org anywhere? We Should get rid of those.

eschoeller commented 3 years ago

Hi, so sorry for my delayed response. I appreciate your attention to my initial issue. I've been running in production for quite some time now - and things have been relatively stable. We had one 'split-brain' incident on our network where one of our data centers had some serious network problems (and thus one of our monitoring nodes did as well) while the other two monitoring nodes (and parent data centers) remained fully functional. The system weathered through that better than I would have expected.

I've noticed that our merlin tables have just continued to increase in size. I wonder if these should be truncated occasionally? node_a:

select COUNT(*) from report_data;
+----------+
| COUNT(*) |
+----------+
|   856136 |
+----------+

node_b:

select COUNT(*) from report_data;
+----------+
| COUNT(*) |
+----------+
|   853782 |
+----------+

node_c:

select COUNT(*) from report_data;
+----------+
| COUNT(*) |
+----------+
|   855393 |
+----------+

I also encountered an odd notification logic issue the other night which has stumped me. There are two examples, first an incident where the host notification escalation kicked in as it should have:

Sep 12 20:12:56 node_c naemon: HOST ALERT: pdu-m19;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%
Sep 12 20:12:56 node_a naemon: HOST ALERT: pdu-m19;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%
Sep 12 20:12:56 node_b naemon: HOST ALERT: pdu-m19;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%
Sep 12 20:13:22 node_c naemon: HOST ALERT: pdu-m19;DOWN;SOFT;2;PING CRITICAL - Packet loss = 100%
Sep 12 20:13:22 node_a naemon: HOST ALERT: pdu-m19;DOWN;SOFT;2;PING CRITICAL - Packet loss = 100%
Sep 12 20:13:22 node_b naemon: HOST ALERT: pdu-m19;DOWN;SOFT;2;PING CRITICAL - Packet loss = 100%
Sep 12 20:14:00 node_c naemon: HOST ALERT: pdu-m19;DOWN;HARD;3;PING CRITICAL - Packet loss = 100%
Sep 12 20:14:00 node_c naemon: HOST NOTIFICATION: dcops_bogus;pdu-m19;DOWN;dcops_bogus-notification;PING CRITICAL - Packet loss = 100%
Sep 12 20:14:00 node_a naemon: HOST ALERT: pdu-m19;DOWN;HARD;3;PING CRITICAL - Packet loss = 100%
Sep 12 20:14:00 node_b naemon: HOST ALERT: pdu-m19;DOWN;HARD;3;PING CRITICAL - Packet loss = 100%
Sep 12 20:14:24 node_c naemon: HOST ALERT: pdu-m19;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 2.10 ms
Sep 12 20:14:24 node_c naemon: HOST NOTIFICATION: dcops_bogus;pdu-m19;UP;dcops_bogus-notification;PING OK - Packet loss = 0%, RTA = 2.10 ms
Sep 12 20:14:24 node_a naemon: HOST ALERT: pdu-m19;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 2.10 ms
Sep 12 20:14:24 node_b naemon: HOST ALERT: pdu-m19;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 2.10 ms

Then in this case 'dcops_bogus' was never notified first, it simply jumped straight to paging:

Sep 13 04:10:17 node_a naemon: HOST ALERT: pdu-m19;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%
Sep 13 04:10:17 node_c naemon: HOST ALERT: pdu-m19;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%
Sep 13 04:10:17 node_b naemon: HOST ALERT: pdu-m19;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%
Sep 13 04:11:00 node_a naemon: HOST ALERT: pdu-m19;DOWN;SOFT;2;PING CRITICAL - Packet loss = 100%
Sep 13 04:11:00 node_b naemon: HOST ALERT: pdu-m19;DOWN;SOFT;2;PING CRITICAL - Packet loss = 100%
Sep 13 04:11:00 node_c naemon: HOST ALERT: pdu-m19;DOWN;SOFT;2;PING CRITICAL - Packet loss = 100%
Sep 13 04:11:26 node_a naemon: HOST ALERT: pdu-m19;DOWN;HARD;3;PING CRITICAL - Packet loss = 100%
Sep 13 04:11:26 node_a naemon: HOST NOTIFICATION: admin1+admin2+admin3_email;pdu-m19;DOWN;dcops_host-notify-by-email;PING CRITICAL - Packet loss = 100%
Sep 13 04:11:26 node_a naemon: HOST NOTIFICATION: admin1-pager;pdu-m19;DOWN;dcops_host-notify-by-email-pager;PING CRITICAL - Packet loss = 100%
Sep 13 04:11:26 node_a naemon: HOST NOTIFICATION: admin1-pushover;pdu-m19;DOWN;dcops_host-notify-by-pushover;PING CRITICAL - Packet loss = 100%
Sep 13 04:11:26 node_a naemon: HOST NOTIFICATION: admin2-pushover;pdu-m19;DOWN;dcops_host-notify-by-pushover;PING CRITICAL - Packet loss = 100%
Sep 13 04:11:26 node_a naemon: HOST NOTIFICATION: admin3-pushover;pdu-m19;DOWN;dcops_host-notify-by-pushover;PING CRITICAL - Packet loss = 100%
Sep 13 04:11:26 node_b naemon: HOST ALERT: pdu-m19;DOWN;HARD;3;PING CRITICAL - Packet loss = 100%
Sep 13 04:11:26 node_b naemon: HOST NOTIFICATION SUPPRESSED: pdu-m19;Notification was blocked by a NEB module. Module '/usr/local/merlin/lib/merlin/merlin.so' cancelled notification: 'Notification will be handled by a peer (node_a.colorado.edu)'
Sep 13 04:11:26 node_c naemon: HOST ALERT: pdu-m19;DOWN;HARD;3;PING CRITICAL - Packet loss = 100%
Sep 13 04:11:26 node_c naemon: HOST NOTIFICATION SUPPRESSED: pdu-m19;Notification was blocked by a NEB module. Module '/usr/local/merlin/lib/merlin/merlin.so' cancelled notification: 'Notification will be handled by a peer (node_a.colorado.edu)'
Sep 13 04:11:27 node_b naemon: HOST NOTIFICATION SUPPRESSED: pdu-m19;Re-notification blocked for this problem.
Sep 13 04:11:27 node_a naemon: HOST NOTIFICATION SUPPRESSED: pdu-m19;Re-notification blocked for this problem.
Sep 13 04:11:27 node_c naemon: HOST NOTIFICATION SUPPRESSED: pdu-m19;Re-notification blocked for this problem.
Sep 13 04:11:49 node_a naemon: HOST ALERT: pdu-m19;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 1.49 ms
Sep 13 04:11:49 node_a naemon: HOST NOTIFICATION: admin1+admin2+admin3_email;pdu-m19;UP;dcops_host-notify-by-email;PING OK - Packet loss = 0%, RTA = 1.49 ms
Sep 13 04:11:49 node_a naemon: HOST NOTIFICATION: admin1-pager;pdu-m19;UP;dcops_host-notify-by-email-pager;PING OK - Packet loss = 0%, RTA = 1.49 ms
Sep 13 04:11:49 node_a naemon: HOST NOTIFICATION: admin1-pushover;pdu-m19;UP;dcops_host-notify-by-pushover;PING OK - Packet loss = 0%, RTA = 1.49 ms
Sep 13 04:11:49 node_a naemon: HOST NOTIFICATION: admin2-pushover;pdu-m19;UP;dcops_host-notify-by-pushover;PING OK - Packet loss = 0%, RTA = 1.49 ms
Sep 13 04:11:49 node_a naemon: HOST NOTIFICATION: admin3-pushover;pdu-m19;UP;dcops_host-notify-by-pushover;PING OK - Packet loss = 0%, RTA = 1.49 ms
Sep 13 04:11:49 node_b naemon: HOST ALERT: pdu-m19;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 1.49 ms
Sep 13 04:11:49 node_b naemon: HOST NOTIFICATION SUPPRESSED: pdu-m19;Notification was blocked by a NEB module. Module '/usr/local/merlin/lib/merlin/merlin.so' cancelled notification: 'Notification will be handled by a peer (node_a.colorado.edu)'
Sep 13 04:11:49 node_c naemon: HOST ALERT: pdu-m19;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 1.49 ms
Sep 13 04:11:49 node_c naemon: HOST NOTIFICATION SUPPRESSED: pdu-m19;Notification was blocked by a NEB module. Module '/usr/local/merlin/lib/merlin/merlin.so' cancelled notification: 'Notification will be handled by a peer (node_a.colorado.edu)'

Here is the relevant configuration:

define host {
        active_checks_enabled           1
        check_command                   dcops_check-host-alive
        check_freshness                 0
        check_period                    dcops_24x7
        contact_groups                  dcops_bogus
        event_handler_enabled           0
        flap_detection_enabled          1
        max_check_attempts              3
        name                            dcops_generic_host_template
        notification_interval           5
        notification_options            d,r
        notification_period             dcops_24x7
        notifications_enabled           1
        obsess_over_host                0
        passive_checks_enabled          0
        process_perf_data               1
        register                        0
        retain_nonstatus_information    1
        retain_status_information       1
}

And then the host escalation:

define hostescalation {
        contact_groups          admin1+admin2+admin3_email
        escalation_options      d,r
        escalation_period       dcops_24x7
        first_notification      2
        hostgroup_name          dcops
        last_notification       0
        notification_interval   0
}

The logic behind the hostescalation dates back to over a decade ago. I discussed it briefly with my counterpart who helped build this at the time and we both suspect it was a work-around, likely before 'first_notification_delay' existed as an option. So I may just simplify this scenario and remove the escalation logic entirely (it's confusing) and try using 'first_notification_delay' instead.

jacobbaungard commented 3 years ago

We had one 'split-brain' incident on our network where one of our data centers had some serious network problems (and thus one of our monitoring nodes did as well) while the other two monitoring nodes (and parent data centers) remained fully functional. The system weathered through that better than I would have expected.

Yes, this is a problem with merlins architecture, given that there are no consensus algorithms or anything like that. For now you either live with it, and adjust the report data manually when required, or you put all peers as close as possible on the network, preferably behind the same switch.

I've noticed that our merlin tables have just continued to increase in size. I wonder if these should be truncated occasionally? node_a:

We don't make any assumptions about what kind of retention people want on their report data. I think it does make sense to truncate the tables once in a while, perhaps after generating a report over the period. Merlin only log state changes, not every check, so usually it's fine to keep the report_data for a few years.

The logic behind the hostescalation dates back to over a decade ago. I discussed it briefly with my counterpart who helped build this at the time and we both suspect it was a work-around, likely before 'first_notification_delay' existed as an option. So I may just simplify this scenario and remove the escalation logic entirely (it's confusing) and try using 'first_notification_delay' instead.

Yes that does look slightly weird. Perhaps worth investigating if you see similar behaviour without Merlin installed. I haven't heard of anyone with issues regarding first_notification_delay so probably that will work fine for you (although I haven't seen any issues regarding the escalation logic either...)