ITRS-Group / monitor-merlin

Module for Effortless Redundancy and Loadbalancing In Naemon
https://itrs-group.github.io/monitor-merlin/
GNU General Public License v2.0
22 stars 14 forks source link

WIP: daemon: Log to report db if state differs to last state #78

Closed sjoegren closed 3 years ago

sjoegren commented 3 years ago

This is a workaround to log to report_data in some cases when nebattr flags is not set from Naemon. For each host/service check result that daemon gets from module (ipc), in addition to checking for nebattr flags to determine if it's a state change that should be logged in report db, it also checks if the current state is different to the last state logged for the object in db. This means that this patch queries the db for an objects last recorded state, for each check result processed, which might not be ideal...

This fixes MON-9116.

Signed-off-by: Aksel Sjögren asjogren@itrsgroup.com

sjoegren commented 3 years ago

I couldn't find any obvious ways of creating tests for this patch, there doesn't seem to be similar tests around and I don't know if/how to create cucumber feature tests including multiple peers and a database...

jacobbaungard commented 3 years ago

Code looks good.

Did you do any kind of performance testing? I am a little worried we're putting a lot of additional load on the DB, as every check will then be one DB lookup.


I am a little confused as to why this happens. I digged around a bit.

When we get a host event from a remote node we this function call path: handle_event (module.c) handle_host_result (module.c) handle_checkresult (module.c) which calls process_check_result which is a Naemon function.

Following the calls in Naemon we get: process_check_result (src/naemon/checks.c) handle_async_host_check_result (src/naemon/checks_host.c)

Now we call: process_host_check_result (src/naemon/checks_host.c)

And for this function we get the variable alert_recorded which is forwarded to the broker module.

I assume this is the variable we are using to determine a state change, or is that wrong?

If it is that variable we use why does Naemon not set it correctly? I don't see Naemon actually making use of the attr flags forwarded from the remote merlin node (if it is forwarded at all).

In Naemon process_host_check_result we first: 1) Sets the state type (soft/hard) in some cases, i.e. here . It seems like we get a HARD state given the log output HOST ALERT: segot-ad01;UP;HARD;1;OK - segot-ad01.intra.itrsent.local responds to ICMP. Packet 1, rtt 0.429ms 2) Calls handle_host_state 3) handle_host_state then determines if we have a state change . In our case it should be true I would assume? 4) A bit further down, we then set the alert_recorded flag alongside logging to the main log file

We then return back up and end up with sending the alert_recorded variable, to the broker module (to NEBCALLBACK_HOST_CHECK_DATA)

I am very confused as to where it goes wrong, here. To me it looks like it should work? We even, as far as I can see, do the host alert logging (which we do get correctly right?) at the same place we set the alert_recoded variable.

That neb callback is handled here https://github.com/ITRS-Group/monitor-merlin/blob/3fb08a5c21f9999abc68ce7f4673b57872d5304e/module/hooks.c#L384

And because the result was actually processed locally (because we ened up calling process_check_result in naemon, we end up sending that event to IPC down the line here: https://github.com/ITRS-Group/monitor-merlin/blob/3fb08a5c21f9999abc68ce7f4673b57872d5304e/module/hooks.c#L184 -> https://github.com/ITRS-Group/monitor-merlin/blob/3fb08a5c21f9999abc68ce7f4673b57872d5304e/module/hooks.c#L97

sjoegren commented 3 years ago

Did you do any kind of performance testing? I am a little worried we're putting a lot of additional load on the DB, as every check will then be one DB lookup.

No, I didn't. I don't like having to do the db queries either, obviously it's not really a desirable solution. For a system with 100k host/service checks and 5m check_interval, it should be 100,000/300sec = 333 check results / second.

I am a little confused as to why this happens. I digged around a bit.

When we get a host event from a remote node we this function call path: handle_event (module.c) handle_host_result (module.c) handle_checkresult (module.c) which calls process_check_result which is a Naemon function.

Following the calls in Naemon we get: process_check_result (src/naemon/checks.c) handle_async_host_check_result (src/naemon/checks_host.c)

Now we call: process_host_check_result (src/naemon/checks_host.c)

And for this function we get the variable alert_recorded which is forwarded to the broker module.

I assume this is the variable we are using to determine a state change, or is that wrong?

Yes, alert_recorded | first_recorded_state forwarded to NEB and eventually daemon (as nebattr) where we use it to determine if it was a state change originally when Naemon dispatched the check result.

If it is that variable we use why does Naemon not set it correctly? I don't see Naemon actually making use of the attr flags forwarded from the remote merlin node (if it is forwarded at all).

Naemon is setting it correctly. Think of a disconnected pair of peers, where peerA still sees hostA as UP, so peerA never detects a state change. PeerB however sees hostA as DOWN. After peers re-connect, peerA check result backlog is sent to peerA, who updates host state in Naemon to UP. But when merlind looks at nebattr flags, there is no state change.

In Naemon process_host_check_result we first:

1. Sets the state type (soft/hard) in some cases, i.e. [here](https://github.com/naemon/naemon-core/blob/70944db34113e41d9ebe621c75c320ba1956d930/src/naemon/checks_host.c#L736) . It seems like we get a `HARD` state given the log output `HOST ALERT: segot-ad01;UP;HARD;1;OK - segot-ad01.intra.itrsent.local responds to ICMP. Packet 1, rtt 0.429ms`

2. [Calls `handle_host_state`](https://github.com/naemon/naemon-core/blob/70944db34113e41d9ebe621c75c320ba1956d930/src/naemon/checks_host.c#L875)

3. `handle_host_state` then [determines if we have a state change](https://github.com/naemon/naemon-core/blob/70944db34113e41d9ebe621c75c320ba1956d930/src/naemon/checks_host.c#L951) . In our case it should be true I would assume?

It's not true in our problem case, see answer above.

4. [A bit further down, we then set the](https://github.com/naemon/naemon-core/blob/70944db34113e41d9ebe621c75c320ba1956d930/src/naemon/checks_host.c#L1019) `alert_recorded` flag alongside logging to the main log file

We then return back up and end up with sending the alert_recorded variable, to the broker module (to NEBCALLBACK_HOST_CHECK_DATA)

I am very confused as to where it goes wrong, here. To me it looks like it should work? We even, as far as I can see, do the host alert logging (which we do get correctly right?) at the same place we set the alert_recoded variable.

That neb callback is handled here

https://github.com/ITRS-Group/monitor-merlin/blob/3fb08a5c21f9999abc68ce7f4673b57872d5304e/module/hooks.c#L384

And because the result was actually processed locally (because we ened up calling process_check_result in naemon, we end up sending that event to IPC down the line here:

https://github.com/ITRS-Group/monitor-merlin/blob/3fb08a5c21f9999abc68ce7f4673b57872d5304e/module/hooks.c#L184 ->

https://github.com/ITRS-Group/monitor-merlin/blob/3fb08a5c21f9999abc68ce7f4673b57872d5304e/module/hooks.c#L97