Ingest jobs won't run unless it's 30-45 past the hour

ian-noaa commented 9 months ago

Describe the bug

Our ingest jobs don't run unless they're triggered between 30 - 45 minutes past the hour. (I.e. running a job between 12:30-12:45 will work, but no other time in the hour)

To Reproduce

Steps to reproduce the behavior:

Run an ingest job between 30-45 minutes past the hour. (E.g. 12:30-12:45)
Job runs as normal
Run an ingest job outside of 30-45 minutes past the hour. (E.g. 12:00-12:30 & 12:45-1:00)
Job doesn't run as "no scheduled and active jobs are currently cwavailable at this time"

Expected behavior

Our ingest jobs should run when a job is triggered via cron/developer/event.

Additional context

This is most likely due to the schedule entry in our JOB documents, and the filtering logic implemented in our Couchbase N1QL query in main.py and run-ingest.sh to get the jobs to ingest.

Sample job doc:

{
  "id": "JOB:V01:METAR:CTC:CEILING:MODEL:OPS",
  "ingest_document_ids": [
    "MD:V01:METAR:HRRR_OPS:E_US:CTC:CEILING:ingest",
    "MD:V01:METAR:HRRR_OPS:ALL_HRRR:CTC:CEILING:ingest",
    "MD:V01:METAR:HRRR_OPS:E_HRRR:CTC:CEILING:ingest",
    "MD:V01:METAR:HRRR_OPS:W_HRRR:CTC:CEILING:ingest",
    "MD:V01:METAR:HRRR_OPS:GtLk:CTC:CEILING:ingest",
    "MD:V01:METAR:RAP_OPS_130:E_US:CTC:CEILING:ingest",
    "MD:V01:METAR:RAP_OPS_130:ALL_HRRR:CTC:CEILING:ingest",
    "MD:V01:METAR:RAP_OPS_130:E_HRRR:CTC:CEILING:ingest",
    "MD:V01:METAR:RAP_OPS_130:W_HRRR:CTC:CEILING:ingest",
    "MD:V01:METAR:RAP_OPS_130:GtLk:CTC:CEILING:ingest"
  ],
  "offset_minutes": 0,
  "run_priority": 5,
  "schedule": "30 * * * *",
  "status": "active",
  "subDoc": "CEILING",
  "subDocType": "MODEL",
  "subType": "CTC",
  "subset": "METAR",
  "type": "JOB",
  "version": "V01"
}

Filtering logic: https://github.com/NOAA-GSL/VxIngest/blob/0ffc659ca8a368d0d400176570c79930a6c08429/main.py#L248-L262

randytpierce commented 9 months ago

That is a feature of the scheduling. I had set the schedules to 30 because I was thinking they would be better that way. I should have made a note. You can edit the job docs by hand or use the utility … scripts/VXIngestUtilities/usefulthings/setsisplayshedule.sh (or something like that) normally the ingest script runs at a much higher frequency than the scheduler, I may have messed it up Wednesday because I was going too fast. Just look at the frontal and at the schedule. Make sure the front an is triggering the ingest at like */15 Randy

ian-noaa commented 8 months ago

Thanks for sharing the intent behind this feature, Randy!

For now, I've gotten the jobs processing again by setting the system cron to run the job in line with the Job doc schedule. (30 */2 * * *) I contemplated setting the cron to run every 10 or 15 minutes. However, I suspect we'd still get an extra undesired job if I had done that.

I've been giving this some thought over the weekend, and in the longer term, I'm concerned about support challenges and unintended consequences with the current design, particularly with having two schedulers in place—the local system cron and the CouchbaseJOB doc's schedule entry.

We've already encountered a situation where the system behaved unexpectedly and never ran jobs as the system cron was running too infrequently, (10 */2 * * *) and I can envision scenarios where the system cron might trigger more ingest jobs than desired as well. My initial reaction to your comment about intending to have the ingest script run more frequently was to propose setting the schedule to run every minute. (* * * * *) However, that would trigger 15 jobs within the 15-minute window. (from 30-45 after the hour)

I'm wondering if there's a specific reason we couldn't rely solely on the local system cron for job scheduling? (E.g. - if we were intending to run the ingest as a long-lived process that wasn't triggered by cron) It seems like it would simplify the logic and reduce the risk of getting into a weird corner case with our cron schedules. It'd also help us in the move to event-driven ingest in the cloud.

randytpierce commented 8 months ago

The problem I was trying to solve was the ability to schedule the jobs in a data driven way that wasn’t tied to a specific server, i.e. someone adds a new builder and just has to set the schedule in the database instead of having to add another line in a specific crontab, or someone wants to disable a particular job for a while so they just set the status. If the ingest was triggered by an event the whole thing would work but we don’t have a suitable event. ITS has always used crontab entries for this kind of thing and in my opinion it got terribly confusing with very convoluted crontab on all kinds of different servers. I think it isn’t an elegant solution but it’s better than a bunch of crontab entries on different machines. A message queue that told us when data was ready would be better, for sure. Randy

ian-noaa commented 8 months ago

Gotcha, I can definitely sympathize with convoluted cron entries! I've come across a few in my time. 😅

In that case, I wonder if running the on-prem ingest as a long-lived process that polls the DB for new jobs every minute would work; so that we don't get bitten by scheduling complexities.

The only downside of that approach would be that it'd be harder to partition jobs out to various VMs, if we needed to. (E.g. - having a NetCDF/GRIB VM and a Partial Sums/CTC VM, or having a machine to do HRRR_OPS jobs and one to do GFS jobs, etc...) In that case, going with a cron/machine would be attractive so we could specify which machines ran which jobs.

For now, I think the current solution is alright. However, if we started adding more jobs with different schedules, I'd want to look into getting us down to one scheduler. So either running the ingest as a long-lived process or switching to using local crons.

Either way, I agree that an event-driven system will remove the need for complicated cron schedules.

randytpierce commented 8 months ago

I agree with you. A publish subscribe mechanism could work - which I think is basically a message queue.

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 90 days with no activity.

NOAA-GSL / VxIngest