Open MHP47 opened 6 years ago
If I had to guess without looking into it too deeply - I'd bet the check_period
on the ssl services might play a part. What happens if you open the ssl services up to the 24x7 check period?
From what I understand, the scheduling of the check and the check_period are evaluated separately.
When a check is executed by core engine, it does some sanity checks.
Hence you end up with a situation where checks are not executed for days because they keep getting rescheduled, until they fall into a valid time period.
Your check interval is 1200 which is 20 hours.
Check runs at 11:01am on Monday Check is rescheduled to run on Tuesday at 09:01am Check does not run at 09:01am on Tuesday because it is not in the check_period Check is rescheduled to run on Wednesday at 07:01am Check does not run at 07:01am on Wednesday because it is not in the check_period Check is rescheduled to run on Thursday at 05:01am Check does not run at 05:01am on Thursday because it is not in the check_period Check is rescheduled to run on Friday at 03:01am ... and so on
@box293 thank you, that makes perfect sense :)
I thought (incorrectly) that Nagios would run a check at the first available opportunity, so in your example, the Tuesday is scheduled for 9:01am, but will then run at 11:01am, as that falls within the time-period.
Thanks again for the explanation, I will change my config to align with those rules
I think it is a worthwhile feature request, to add the logic so that it does reschedule it to fall within a valid time period.
I agree, like I said above, I thought the reschedule code would schedule a check for the 'next available valid time' within time-periods. Should I open a new issue (feature request)?
The dev team will be able to mark this as a feature request so no need to open another issue :)
@MHP47 , Are you getting a bunch of lines like the following in your log file? (Maybe not bunches, but a few):
Warning: Check of service 'SITE_NAME' on host 'SSL Domains' could not be rescheduled properly. Scheduling check for [something]...
This may help me narrow down what's going on.
@hedenface I changed my config to align with the comments provided by box293, so things seem to be working OK at the moment, and I can't find your warning string any where in the logs or archive logs.
I can revert one of the checks back to the previous, incorrect, config, and monitor it to see what shows up in the logs
The thing about it that's weird, is two days ago I set up a similar config to yours. It automatically scheduled its next check in the appropriate time period. Reverting (or creating a dummy one that you know is set up for failure) may be a good idea, I'd like to know the output of your logs. Even better would be if you could spin up a sandbox and just have a few of those checks running, and turn the debug_level
to 24
(EVENTS | CHECKS | FLAPPING) and the debug_verbosity
to 2 and submit your debug log as well. You can email it to devteam@nagios.com if you're uncomfortable publishing it here.
Thanks!
@hedenface reverted the config last week. Symptoms have returned, but nothing in the log file. Have re purposed a dev environment to run these checks with the same config. Have set the debugging per your post as well.
Will leave it running for a few days, and let you know when I have something
@hedenface Left this config running for several days before the symptoms occurred. The debug log is quite large, a million lines. Can still send that through if you would like, but I think the below is what you are after:
[1516409894.689490] [008.0] [pid=119137] ** Service Check Event ==> Host: 'SSL Domains', Service: 'Website 17', Options: 0, Latency: 0.000000 sec
[1516409894.689499] [016.0] [pid=119137] Attempting to run scheduled check of service 'Website 17' on host 'SSL Domains': check options=0, latency=0.000000
[1516409894.689524] [016.0] [pid=119137] Checking service 'Website 17' on host 'SSL Domains'...
[1516409894.689533] [2320.2] [pid=119137] Raw Command Input: $USER2$/check_ssl.sh -h $ARG1$ -w $ARG2$ -c $ARG3$
[1516409894.689544] [2320.2] [pid=119137] Expanded Command Output: $USER2$/check_ssl.sh -h $ARG1$ -w $ARG2$ -c $ARG3$
[1516409894.689593] [008.1] [pid=119137] ** Event Check Loop
[1516409894.689601] [008.1] [pid=119137] Next Event Time: Sat Jan 20 11:58:21 2018
[1516409894.689603] [008.1] [pid=119137] Current/Max Service Checks: 1/0 (inf% saturation)
[1516409894.846049] [016.2] [pid=119137] Processing check result for service 'Website 17' on host 'SSL Domains'
[1516409894.846075] [016.0] [pid=119137] ** Handling check result for service 'Website 17' on host 'SSL Domains' from 'Core Worker 119141'...
[1516409894.846080] [016.1] [pid=119137] HOST: SSL Domains, SERVICE: Website 17, CHECK TYPE: Active, OPTIONS: 0, SCHEDULED: Yes, RESCHEDULE: Yes, EXITED OK: Yes, RETURN CODE: 0, OUTPUT: OK: 161 days remaining for SSL certificate
[1516409894.846092] [016.2] [pid=119137] Parsing check output...
[1516409894.846104] [016.2] [pid=119137] Short Output: OK: 161 days remaining for SSL certificate
[1516409894.846108] [016.2] [pid=119137] Long Output: NULL
[1516409894.846112] [016.2] [pid=119137] Perf Data: NULL
[1516409894.846116] [016.2] [pid=119137] ST: HARD CA: 1 MA: 1 CS: 0 LS: 0 LHS: 0
[1516409894.846121] [016.1] [pid=119137] Service is OK.
[1516409894.846124] [016.1] [pid=119137] Service did not change state.
[1516409894.846153] [016.1] [pid=119137] Rescheduling next check of service at Sun Jan 21 07:58:14 2018
[1516409894.846186] [016.0] [pid=119137] Scheduling a non-forced, active check of service 'Website 17' on host 'SSL Domains' @ Sun Jan 21 18:02:43 2018
[1516409894.846191] [016.2] [pid=119137] Scheduling new service check event.
[1516409894.846199] [016.1] [pid=119137] Checking service 'Website 17' on host 'SSL Domains' for flapping...
[1516409894.846204] [016.2] [pid=119137] LFT=5.00, HFT=20.00, CPC=0.00, PSC=0.00%
[1516409894.846214] [016.1] [pid=119137] Service is not flapping (0.00% state change).
[1516409894.846218] [016.1] [pid=119137] Checking host 'SSL Domains' for flapping...
[1516409894.846222] [016.2] [pid=119137] LFT=5.00, HFT=20.00, CPC=0.00, PSC=0.00%
[1516409894.846228] [016.1] [pid=119137] Host is not flapping (0.00% state change).
[1516409894.846247] [016.2] [pid=119137] Raw service performance data file output: $LASTSERVICECHECK$||$HOSTNAME$||$SERVICEDESC$||$SERVICEOUTPUT$||$SERVICEPERFDATA$
[1516409894.846262] [016.2] [pid=119137] Processed service performance data file output: 1516409894||SSL Domains||Website 17||OK: 161 days remaining for SSL certificate||
...
[1516518163.821946] [008.0] [pid=119137] ** Timed Event ** Type: EVENT_SERVICE_CHECK, Run Time: Sun Jan 21 18:02:43 2018
[1516518163.821972] [008.0] [pid=119137] ** Service Check Event ==> Host: 'SSL Domains', Service: 'Website 17', Options: 0, Latency: 0.000000 sec
[1516518163.821980] [016.0] [pid=119137] Attempting to run scheduled check of service 'Website 17' on host 'SSL Domains': check options=0, latency=0.000000
[1516518163.822000] [016.2] [pid=119137] This is not a valid time for this service to be actively checked.
[1516518163.822004] [016.1] [pid=119137] Unable to run scheduled service check at this time
[1516518163.822038] [016.1] [pid=119137] Rescheduled next service check for Tue Jan 23 11:00:00 2018
[1516518163.822055] [016.0] [pid=119137] Scheduling a non-forced, active check of service 'Website 17' on host 'SSL Domains' @ Tue Jan 23 12:12:36 2018
[1516518163.822061] [016.2] [pid=119137] Scheduling new service check event.
:)
I have quite a few checks set up to check once per day, and for some reason some of them, sometimes, just don't seem to execute for days at a time. I haven't found any consistency between ones that do or do not, or what makes them start working again.
Definition of example, pulled from the precache file:
I took some screenshots from the web interface to show the scheduling:
This is the overview of a group of services (removed sensitive info). Can see the "Last Check" is days ago, which, as per the definitions above (all checks are the same), should be daily:
Screenshot of the Scheduling Queue, other services blacked out, and sensitive info removed: This looks strange to me, a lot of these times are all over the place.
And here is a grep from the archive log files, for the example service:
I would expect this to show the "day remaining" value to be one less each day, based on what the check is doing. Where the same value is multiple days in a row, looks to be where the check was not scheduled correctly.
In conclusion, is there anywhere else to look, or other things to check, to work out why these services are not scheduled correctly?