Closed sjoegren closed 3 years ago
I couldn't find any obvious ways of creating tests for this patch, there doesn't seem to be similar tests around and I don't know if/how to create cucumber feature tests including multiple peers and a database...
Code looks good.
Did you do any kind of performance testing? I am a little worried we're putting a lot of additional load on the DB, as every check will then be one DB lookup.
I am a little confused as to why this happens. I digged around a bit.
When we get a host event from a remote node we this function call path:
handle_event
(module.c)
handle_host_result
(module.c)
handle_checkresult
(module.c)
which calls process_check_result
which is a Naemon function.
Following the calls in Naemon we get:
process_check_result
(src/naemon/checks.c)
handle_async_host_check_result
(src/naemon/checks_host.c)
Now we call:
process_host_check_result
(src/naemon/checks_host.c)
And for this function we get the variable alert_recorded
which is forwarded to the broker module.
I assume this is the variable we are using to determine a state change, or is that wrong?
If it is that variable we use why does Naemon not set it correctly? I don't see Naemon actually making use of the attr flags forwarded from the remote merlin node (if it is forwarded at all).
In Naemon process_host_check_result
we first:
1) Sets the state type (soft/hard) in some cases, i.e. here . It seems like we get a HARD
state given the log output HOST ALERT: segot-ad01;UP;HARD;1;OK - segot-ad01.intra.itrsent.local responds to ICMP. Packet 1, rtt 0.429ms
2) Calls handle_host_state
3) handle_host_state
then determines if we have a state change . In our case it should be true I would assume?
4) A bit further down, we then set the alert_recorded
flag alongside logging to the main log file
We then return back up and end up with sending the alert_recorded
variable, to the broker module (to NEBCALLBACK_HOST_CHECK_DATA
)
I am very confused as to where it goes wrong, here. To me it looks like it should work? We even, as far as I can see, do the host alert logging (which we do get correctly right?) at the same place we set the alert_recoded
variable.
That neb callback is handled here https://github.com/ITRS-Group/monitor-merlin/blob/3fb08a5c21f9999abc68ce7f4673b57872d5304e/module/hooks.c#L384
And because the result was actually processed locally (because we ened up calling process_check_result
in naemon, we end up sending that event to IPC down the line here: https://github.com/ITRS-Group/monitor-merlin/blob/3fb08a5c21f9999abc68ce7f4673b57872d5304e/module/hooks.c#L184 -> https://github.com/ITRS-Group/monitor-merlin/blob/3fb08a5c21f9999abc68ce7f4673b57872d5304e/module/hooks.c#L97
Did you do any kind of performance testing? I am a little worried we're putting a lot of additional load on the DB, as every check will then be one DB lookup.
No, I didn't. I don't like having to do the db queries either, obviously it's not really a desirable solution. For a system with 100k host/service checks and 5m check_interval, it should be 100,000/300sec = 333 check results / second.
I am a little confused as to why this happens. I digged around a bit.
When we get a host event from a remote node we this function call path:
handle_event
(module.c)handle_host_result
(module.c)handle_checkresult
(module.c) which callsprocess_check_result
which is a Naemon function.Following the calls in Naemon we get:
process_check_result
(src/naemon/checks.c)handle_async_host_check_result
(src/naemon/checks_host.c)Now we call:
process_host_check_result
(src/naemon/checks_host.c)And for this function we get the variable
alert_recorded
which is forwarded to the broker module.I assume this is the variable we are using to determine a state change, or is that wrong?
Yes, alert_recorded | first_recorded_state
forwarded to NEB and eventually daemon (as nebattr
) where we use it to determine if it was a state change originally when Naemon dispatched the check result.
If it is that variable we use why does Naemon not set it correctly? I don't see Naemon actually making use of the attr flags forwarded from the remote merlin node (if it is forwarded at all).
Naemon is setting it correctly. Think of a disconnected pair of peers, where peerA still sees hostA as UP, so peerA never detects a state change. PeerB however sees hostA as DOWN. After peers re-connect, peerA check result backlog is sent to peerA, who updates host state in Naemon to UP. But when merlind looks at nebattr
flags, there is no state change.
In Naemon
process_host_check_result
we first:1. Sets the state type (soft/hard) in some cases, i.e. [here](https://github.com/naemon/naemon-core/blob/70944db34113e41d9ebe621c75c320ba1956d930/src/naemon/checks_host.c#L736) . It seems like we get a `HARD` state given the log output `HOST ALERT: segot-ad01;UP;HARD;1;OK - segot-ad01.intra.itrsent.local responds to ICMP. Packet 1, rtt 0.429ms` 2. [Calls `handle_host_state`](https://github.com/naemon/naemon-core/blob/70944db34113e41d9ebe621c75c320ba1956d930/src/naemon/checks_host.c#L875) 3. `handle_host_state` then [determines if we have a state change](https://github.com/naemon/naemon-core/blob/70944db34113e41d9ebe621c75c320ba1956d930/src/naemon/checks_host.c#L951) . In our case it should be true I would assume?
It's not true in our problem case, see answer above.
4. [A bit further down, we then set the](https://github.com/naemon/naemon-core/blob/70944db34113e41d9ebe621c75c320ba1956d930/src/naemon/checks_host.c#L1019) `alert_recorded` flag alongside logging to the main log file
We then return back up and end up with sending the
alert_recorded
variable, to the broker module (toNEBCALLBACK_HOST_CHECK_DATA
)I am very confused as to where it goes wrong, here. To me it looks like it should work? We even, as far as I can see, do the host alert logging (which we do get correctly right?) at the same place we set the
alert_recoded
variable.That neb callback is handled here
And because the result was actually processed locally (because we ened up calling
process_check_result
in naemon, we end up sending that event to IPC down the line here:
This is a workaround to log to report_data in some cases when
nebattr
flags is not set from Naemon. For each host/service check result that daemon gets from module (ipc), in addition to checking for nebattr flags to determine if it's a state change that should be logged in report db, it also checks if the current state is different to the last state logged for the object in db. This means that this patch queries the db for an objects last recorded state, for each check result processed, which might not be ideal...This fixes MON-9116.
Signed-off-by: Aksel Sjögren asjogren@itrsgroup.com