Closed mittma closed 6 years ago
Probably it will break other things, which is why we need proper tests for such changes.
I did some research in the last couple of days. You're right, resetting the counter is pretty much wrong. It tries to fake a hard state to prevent notifications as such. Later on this leads into the SOFT recovery.
The Nagios code removed that code parts a while ago.
https://github.com/NagiosEnterprises/nagioscore/commit/7477b95ce211335d3ff9ced45595d903d0f95d31 http://tracker.nagios.org/view.php?id=128 (Patch at the end is not applied in Icinga). https://github.com/NagiosEnterprises/nagioscore/commit/38050baa4bd20574b148602a3a34c3b194b2a300 (patch is applied in Icinga)
90af9261c41525de9085b5ea658a265f456e9f85
Note: This if condition https://github.com/Icinga/icinga-core/blob/master/base/checks.c#L1645 is redundant to the if condition already enclosing it here https://github.com/Icinga/icinga-core/blob/master/base/checks.c#L1618 (dear god, that code would need cleanup)
I'd guess that removing the block with HARD_STATE until current_attempt = 1 (3 lines) does not hurt. It will leave the service it its current state. Can you try commenting those out and check whether it fixes your problem?
We had a server reboot that took too long, which notified a service problem, but no recovery message was sent out after the service was running again.
According to our logs the service state was reset from hard to soft after the host recovered from its short down state (check_interval is 5, retry_interval is 5 and maxcheck attempts is 4):
No recovery notification was sent because it was a soft recovery - which is wrong!?
Google found an old bug (note 0000139) which seems to be right:
In checks.c the number of the current_attempt gets set to 1, if the host is down:
This causes the service state to be set to SOFT as soon as the host recovers:
In my opinion the current_attempt shouldn't be set to 1 (but I'm not sure if that causes any problems in other parts of the code).