Once per day checks not scheduling times correctly

MHP47 commented 6 years ago

I have quite a few checks set up to check once per day, and for some reason some of them, sometimes, just don't seem to execute for days at a time. I haven't found any consistency between ones that do or do not, or what makes them start working again.

Definition of example, pulled from the precache file:

define host {
  host_name SSL Domains
  alias SSL Domains
  address 127.0.0.1
  check_period  24x7
  check_command check_ping!100.0,20%!500.0,60%!!!!!!
  contacts  none
  initial_state o
  importance  0
  check_interval  5.000000
  retry_interval  1.000000
  max_check_attempts  1
  active_checks_enabled 1
  passive_checks_enabled  1
  obsess  1
  event_handler_enabled 1
  low_flap_threshold  0.000000
  high_flap_threshold 0.000000
  flap_detection_enabled  1
  flap_detection_options  a
  freshness_threshold 0
  check_freshness 0
  notification_options  a
  notifications_enabled 1
  notification_interval 30.000000
  first_notification_delay  0.000000
  stalking_options  n
  process_perf_data 1
  retain_status_information 1
  retain_nonstatus_information  1
  }

define service {
  host_name SSL Domains
  service_description SITE_NAME
  check_period  ssl_checking
  check_command check_ssl_cert!$SERVICEDESC$!60!30
  contacts  CONTACT_NAME
  contact_groups  admins
  notification_period 24x7
  initial_state o
  importance  0
  check_interval  1200.000000
  retry_interval  2.000000
  max_check_attempts  1
  is_volatile 0
  parallelize_check 1
  active_checks_enabled 1
  passive_checks_enabled  1
  obsess  1
  event_handler_enabled 1
  low_flap_threshold  0.000000
  high_flap_threshold 0.000000
  flap_detection_enabled  1
  flap_detection_options  a
  freshness_threshold 0
  check_freshness 0
  notification_options  r,w,u,c
  notifications_enabled 1
  notification_interval 0.000000
  first_notification_delay  0.000000
  stalking_options  n
  process_perf_data 1
  retain_status_information 1
  retain_nonstatus_information  1
  _AUTOMATED_TEST 1
  }

define timeperiod {
  timeperiod_name ssl_checking
  alias Time of day to check SSL certificates
  sunday  11:00-13:00
  monday  11:00-13:00
  tuesday 11:00-13:00
  wednesday 11:00-13:00
  thursday  11:00-13:00
  friday  11:00-13:00
  saturday  11:00-13:00
  }

define timeperiod {
  timeperiod_name 24x7
  alias 24 Hours A Day, 7 Days A Week
  sunday  00:00-24:00
  monday  00:00-24:00
  tuesday 00:00-24:00
  wednesday 00:00-24:00
  thursday  00:00-24:00
  friday  00:00-24:00
  saturday  00:00-24:00
  }

I took some screenshots from the web interface to show the scheduling:

This is the overview of a group of services (removed sensitive info). Can see the "Last Check" is days ago, which, as per the definitions above (all checks are the same), should be daily: all_service_details

Screenshot of the Scheduling Queue, other services blacked out, and sensitive info removed: scheduled_details This looks strange to me, a lot of these times are all over the place.

And here is a grep from the archive log files, for the example service:

nagios-10-07-2017-00.log:[Fri Oct  6 00:00:00 2017] CURRENT SERVICE STATE: SSL Domains;SITE_NAME;OK;HARD;1;OK: 901 days remaining for SSL certificate
nagios-10-08-2017-00.log:[Sat Oct  7 00:00:00 2017] CURRENT SERVICE STATE: SSL Domains;SITE_NAME;OK;HARD;1;OK: 900 days remaining for SSL certificate
nagios-10-09-2017-00.log:[Sun Oct  8 00:00:00 2017] CURRENT SERVICE STATE: SSL Domains;SITE_NAME;OK;HARD;1;OK: 899 days remaining for SSL certificate
nagios-10-10-2017-00.log:[Mon Oct  9 00:00:00 2017] CURRENT SERVICE STATE: SSL Domains;SITE_NAME;OK;HARD;1;OK: 899 days remaining for SSL certificate
nagios-10-11-2017-00.log:[Tue Oct 10 00:00:00 2017] CURRENT SERVICE STATE: SSL Domains;SITE_NAME;OK;HARD;1;OK: 897 days remaining for SSL certificate
nagios-10-12-2017-00.log:[Wed Oct 11 00:00:00 2017] CURRENT SERVICE STATE: SSL Domains;SITE_NAME;OK;HARD;1;OK: 896 days remaining for SSL certificate
nagios-10-13-2017-00.log:[Thu Oct 12 00:00:00 2017] CURRENT SERVICE STATE: SSL Domains;SITE_NAME;OK;HARD;1;OK: 895 days remaining for SSL certificate
nagios-10-14-2017-00.log:[Fri Oct 13 00:00:00 2017] CURRENT SERVICE STATE: SSL Domains;SITE_NAME;OK;HARD;1;OK: 894 days remaining for SSL certificate
nagios-10-15-2017-00.log:[Sat Oct 14 00:00:00 2017] CURRENT SERVICE STATE: SSL Domains;SITE_NAME;OK;HARD;1;OK: 893 days remaining for SSL certificate
nagios-10-16-2017-00.log:[Sun Oct 15 00:00:00 2017] CURRENT SERVICE STATE: SSL Domains;SITE_NAME;OK;HARD;1;OK: 893 days remaining for SSL certificate
nagios-10-17-2017-00.log:[Mon Oct 16 00:00:00 2017] CURRENT SERVICE STATE: SSL Domains;SITE_NAME;OK;HARD;1;OK: 893 days remaining for SSL certificate
nagios-10-18-2017-00.log:[Tue Oct 17 00:00:00 2017] CURRENT SERVICE STATE: SSL Domains;SITE_NAME;OK;HARD;1;OK: 893 days remaining for SSL certificate
nagios-10-19-2017-00.log:[Wed Oct 18 00:00:00 2017] CURRENT SERVICE STATE: SSL Domains;SITE_NAME;OK;HARD;1;OK: 889 days remaining for SSL certificate
nagios-10-20-2017-00.log:[Thu Oct 19 00:00:00 2017] CURRENT SERVICE STATE: SSL Domains;SITE_NAME;OK;HARD;1;OK: 889 days remaining for SSL certificate
nagios-10-21-2017-00.log:[Fri Oct 20 00:00:00 2017] CURRENT SERVICE STATE: SSL Domains;SITE_NAME;OK;HARD;1;OK: 889 days remaining for SSL certificate
nagios-10-22-2017-00.log:[Sat Oct 21 00:00:00 2017] CURRENT SERVICE STATE: SSL Domains;SITE_NAME;OK;HARD;1;OK: 889 days remaining for SSL certificate
nagios-10-23-2017-00.log:[Sun Oct 22 00:00:00 2017] CURRENT SERVICE STATE: SSL Domains;SITE_NAME;OK;HARD;1;OK: 885 days remaining for SSL certificate
nagios-10-24-2017-00.log:[Mon Oct 23 00:00:00 2017] CURRENT SERVICE STATE: SSL Domains;SITE_NAME;OK;HARD;1;OK: 885 days remaining for SSL certificate
nagios-10-25-2017-00.log:[Tue Oct 24 00:00:00 2017] CURRENT SERVICE STATE: SSL Domains;SITE_NAME;OK;HARD;1;OK: 883 days remaining for SSL certificate
nagios-10-26-2017-00.log:[Wed Oct 25 00:00:00 2017] CURRENT SERVICE STATE: SSL Domains;SITE_NAME;OK;HARD;1;OK: 882 days remaining for SSL certificate
nagios-10-27-2017-00.log:[Thu Oct 26 00:00:00 2017] CURRENT SERVICE STATE: SSL Domains;SITE_NAME;OK;HARD;1;OK: 881 days remaining for SSL certificate
nagios-10-28-2017-00.log:[Fri Oct 27 00:00:00 2017] CURRENT SERVICE STATE: SSL Domains;SITE_NAME;OK;HARD;1;OK: 880 days remaining for SSL certificate
nagios-10-29-2017-00.log:[Sat Oct 28 00:00:00 2017] CURRENT SERVICE STATE: SSL Domains;SITE_NAME;OK;HARD;1;OK: 880 days remaining for SSL certificate
nagios-10-30-2017-00.log:[Sun Oct 29 00:00:00 2017] CURRENT SERVICE STATE: SSL Domains;SITE_NAME;OK;HARD;1;OK: 880 days remaining for SSL certificate
nagios-10-31-2017-00.log:[Mon Oct 30 00:00:00 2017] CURRENT SERVICE STATE: SSL Domains;SITE_NAME;OK;HARD;1;OK: 880 days remaining for SSL certificate
nagios-11-01-2017-00.log:[Tue Oct 31 00:00:00 2017] CURRENT SERVICE STATE: SSL Domains;SITE_NAME;OK;HARD;1;OK: 880 days remaining for SSL certificate

I would expect this to show the "day remaining" value to be one less each day, based on what the check is doing. Where the same value is multiple days in a row, looks to be where the check was not scheduled correctly.

In conclusion, is there anywhere else to look, or other things to check, to work out why these services are not scheduled correctly?

hedenface commented 6 years ago

If I had to guess without looking into it too deeply - I'd bet the check_period on the ssl services might play a part. What happens if you open the ssl services up to the 24x7 check period?

box293 commented 6 years ago

From what I understand, the scheduling of the check and the check_period are evaluated separately.

When a check is executed by core engine, it does some sanity checks.

It looks at the check_period that is defined on the object
If the time right now does not fall within that time period then the execution of the check is aborted and it is rescheduled
The check is scheduled based on the value defined in check_interval, there is no testing to see if that value will place the next scheduled check inside the timeperiod defined on the object

Hence you end up with a situation where checks are not executed for days because they keep getting rescheduled, until they fall into a valid time period.

Your check interval is 1200 which is 20 hours.

Check runs at 11:01am on Monday Check is rescheduled to run on Tuesday at 09:01am Check does not run at 09:01am on Tuesday because it is not in the check_period Check is rescheduled to run on Wednesday at 07:01am Check does not run at 07:01am on Wednesday because it is not in the check_period Check is rescheduled to run on Thursday at 05:01am Check does not run at 05:01am on Thursday because it is not in the check_period Check is rescheduled to run on Friday at 03:01am ... and so on

MHP47 commented 6 years ago

@box293 thank you, that makes perfect sense :)

I thought (incorrectly) that Nagios would run a check at the first available opportunity, so in your example, the Tuesday is scheduled for 9:01am, but will then run at 11:01am, as that falls within the time-period.

Thanks again for the explanation, I will change my config to align with those rules

box293 commented 6 years ago

I think it is a worthwhile feature request, to add the logic so that it does reschedule it to fall within a valid time period.

MHP47 commented 6 years ago

I agree, like I said above, I thought the reschedule code would schedule a check for the 'next available valid time' within time-periods. Should I open a new issue (feature request)?

box293 commented 6 years ago

The dev team will be able to mark this as a feature request so no need to open another issue :)

hedenface commented 6 years ago

@MHP47 , Are you getting a bunch of lines like the following in your log file? (Maybe not bunches, but a few):

Warning: Check of service 'SITE_NAME' on host 'SSL Domains' could not be rescheduled properly.  Scheduling check for [something]...

This may help me narrow down what's going on.

MHP47 commented 6 years ago

@hedenface I changed my config to align with the comments provided by box293, so things seem to be working OK at the moment, and I can't find your warning string any where in the logs or archive logs.

I can revert one of the checks back to the previous, incorrect, config, and monitor it to see what shows up in the logs

hedenface commented 6 years ago

The thing about it that's weird, is two days ago I set up a similar config to yours. It automatically scheduled its next check in the appropriate time period. Reverting (or creating a dummy one that you know is set up for failure) may be a good idea, I'd like to know the output of your logs. Even better would be if you could spin up a sandbox and just have a few of those checks running, and turn the debug_level to 24 (EVENTS | CHECKS | FLAPPING) and the debug_verbosity to 2 and submit your debug log as well. You can email it to devteam@nagios.com if you're uncomfortable publishing it here.

Thanks!

MHP47 commented 6 years ago

@hedenface reverted the config last week. Symptoms have returned, but nothing in the log file. Have re purposed a dev environment to run these checks with the same config. Have set the debugging per your post as well.

Will leave it running for a few days, and let you know when I have something

MHP47 commented 6 years ago

@hedenface Left this config running for several days before the symptoms occurred. The debug log is quite large, a million lines. Can still send that through if you would like, but I think the below is what you are after:

[1516409894.689490] [008.0] [pid=119137] ** Service Check Event ==> Host: 'SSL Domains', Service: 'Website 17', Options: 0, Latency: 0.000000 sec
[1516409894.689499] [016.0] [pid=119137] Attempting to run scheduled check of service 'Website 17' on host 'SSL Domains': check options=0, latency=0.000000
[1516409894.689524] [016.0] [pid=119137] Checking service 'Website 17' on host 'SSL Domains'...
[1516409894.689533] [2320.2] [pid=119137] Raw Command Input: $USER2$/check_ssl.sh -h $ARG1$ -w $ARG2$ -c $ARG3$
[1516409894.689544] [2320.2] [pid=119137] Expanded Command Output: $USER2$/check_ssl.sh -h $ARG1$ -w $ARG2$ -c $ARG3$
[1516409894.689593] [008.1] [pid=119137] ** Event Check Loop
[1516409894.689601] [008.1] [pid=119137] Next Event Time: Sat Jan 20 11:58:21 2018
[1516409894.689603] [008.1] [pid=119137] Current/Max Service Checks: 1/0 (inf% saturation)
[1516409894.846049] [016.2] [pid=119137] Processing check result for service 'Website 17' on host 'SSL Domains'
[1516409894.846075] [016.0] [pid=119137] ** Handling check result for service 'Website 17' on host 'SSL Domains' from 'Core Worker 119141'...
[1516409894.846080] [016.1] [pid=119137] HOST: SSL Domains, SERVICE: Website 17, CHECK TYPE: Active, OPTIONS: 0, SCHEDULED: Yes, RESCHEDULE: Yes, EXITED OK: Yes, RETURN CODE: 0, OUTPUT: OK: 161 days remaining for SSL certificate
[1516409894.846092] [016.2] [pid=119137] Parsing check output...
[1516409894.846104] [016.2] [pid=119137] Short Output: OK: 161 days remaining for SSL certificate
[1516409894.846108] [016.2] [pid=119137] Long Output:  NULL
[1516409894.846112] [016.2] [pid=119137] Perf Data:    NULL
[1516409894.846116] [016.2] [pid=119137] ST: HARD  CA: 1  MA: 1  CS: 0  LS: 0  LHS: 0
[1516409894.846121] [016.1] [pid=119137] Service is OK.
[1516409894.846124] [016.1] [pid=119137] Service did not change state.
[1516409894.846153] [016.1] [pid=119137] Rescheduling next check of service at Sun Jan 21 07:58:14 2018
[1516409894.846186] [016.0] [pid=119137] Scheduling a non-forced, active check of service 'Website 17' on host 'SSL Domains' @ Sun Jan 21 18:02:43 2018
[1516409894.846191] [016.2] [pid=119137] Scheduling new service check event.
[1516409894.846199] [016.1] [pid=119137] Checking service 'Website 17' on host 'SSL Domains' for flapping...
[1516409894.846204] [016.2] [pid=119137] LFT=5.00, HFT=20.00, CPC=0.00, PSC=0.00%
[1516409894.846214] [016.1] [pid=119137] Service is not flapping (0.00% state change).
[1516409894.846218] [016.1] [pid=119137] Checking host 'SSL Domains' for flapping...
[1516409894.846222] [016.2] [pid=119137] LFT=5.00, HFT=20.00, CPC=0.00, PSC=0.00%
[1516409894.846228] [016.1] [pid=119137] Host is not flapping (0.00% state change).
[1516409894.846247] [016.2] [pid=119137] Raw service performance data file output: $LASTSERVICECHECK$||$HOSTNAME$||$SERVICEDESC$||$SERVICEOUTPUT$||$SERVICEPERFDATA$
[1516409894.846262] [016.2] [pid=119137] Processed service performance data file output: 1516409894||SSL Domains||Website 17||OK: 161 days remaining for SSL certificate||
...
[1516518163.821946] [008.0] [pid=119137] ** Timed Event ** Type: EVENT_SERVICE_CHECK, Run Time: Sun Jan 21 18:02:43 2018
[1516518163.821972] [008.0] [pid=119137] ** Service Check Event ==> Host: 'SSL Domains', Service: 'Website 17', Options: 0, Latency: 0.000000 sec
[1516518163.821980] [016.0] [pid=119137] Attempting to run scheduled check of service 'Website 17' on host 'SSL Domains': check options=0, latency=0.000000
[1516518163.822000] [016.2] [pid=119137] This is not a valid time for this service to be actively checked.
[1516518163.822004] [016.1] [pid=119137] Unable to run scheduled service check at this time
[1516518163.822038] [016.1] [pid=119137] Rescheduled next service check for Tue Jan 23 11:00:00 2018
[1516518163.822055] [016.0] [pid=119137] Scheduling a non-forced, active check of service 'Website 17' on host 'SSL Domains' @ Tue Jan 23 12:12:36 2018
[1516518163.822061] [016.2] [pid=119137] Scheduling new service check event.

:)

NagiosEnterprises / nagioscore

Once per day checks not scheduling times correctly #447