NagiosEnterprises / nagioscore

Nagios Core
GNU General Public License v2.0
1.53k stars 445 forks source link

"auto_reschedule_checks" Causes Indefinite Delay in Service Checks #947

Open benbyr opened 7 months ago

benbyr commented 7 months ago

I've encountered an issue where enabling the auto_reschedule_checks option in Nagios results in some services not running as expected. Additionally, the "Next Scheduled Check" for affected services gets pushed forward indefinitely, preventing these checks from being executed according to their intended schedule. Also results in some new services being stuck in a "pending" state indefinitely. I understand that this option is still considered experimental, but it's the only option that effectively decreases the monitoring load on our system.

Expected Behavior: When auto_reschedule_checks is enabled, all services should continue to run at their scheduled intervals, with reasonable adjustments to distribute the load evenly. The "Next Scheduled Check" should be rescheduled within a practical timeframe, ensuring timely execution of all checks.

Actual Behavior: For some services, after enabling auto_reschedule_checks, the checks do not run, and the "Next Scheduled Check" time is indefinitely postponed. Also results in some new services being stuck in a "pending" state indefinitely. This issue persists across service checks, leading to gaps in monitoring and potential oversight of critical issues. This is particularly concerning given that auto_reschedule_checks is the only option that significantly reduces the monitoring load on our systems.

Nagios Version: Nagios Core 4.4.10

Nagios Config: auto_reschedule_checks=1 auto_rescheduling_interval=30 auto_rescheduling_window=180

Just to note, we have not experienced this issue across all our monitoring nodes; it has only occurred with servers from a specific provider. This inconsistency is confusing, especially considering that it works correctly most of the time. All of our nodes are deployed exactly the same.

benbyr commented 7 months ago

Looks like this might be related to: https://github.com/NagiosEnterprises/nagioscore/issues/893