Retry mechanism - Githubissues

petervandivier commented 2 years ago

Jobs fail through no fault of their own during failover (among other events). Jobs that run infrequently are not less important because they run infrequently.

Add a Task attribute to define whether a job should be retried after failure.

Add a proc and Task template that runs with reasonably high frequency to poll for jobs that have failed AND have a retry attribute defined. Have this job initiate a retry via sp_start_job.

Possibly add sanity checks?

On input: retry interval must be no more than 10%-25% of the regular execution time. (Probably dumb to have a job that runs hourly to retry on the half-hour)
On run: logic to allow or block retry under certain conditions (e.g. if a failover occurred after the StartTime of the last failed job execution; do not retry if a successful execution is found subsequent to the failure; do not initiate a retry if the job is currently running ('cause a DBA kicked it off manually))

taddison commented 2 years ago

What would the manual solution look like (a script that could be executed - maybe even a proc in the schema) that could accept a lookback window (go back X hours and find anything that needs retrying). And you could then automate that with a job, which allows each environment to implement custom rules on what to retry (retry this particular job if it might finish before 8AM, but not if it'll overlap business hours...).

In the presence of such a script (or proc), how valuable is extending the schema to capture retry yes/no and the in-built polling job?

Some specific concerns on the automation:

DBA retries that happen via a direct proc execution are going to be invisible to the scheduler solution
In an incident where some catastrophic failure is happening, a retry mechanism kicking off a bunch of jobs you've just killed won't help matters

petervandivier commented 2 years ago

I think monitoring and predicting job schedule overruns needs to be a separate problem to solve. Certainly I've observed them; but more commonly I've seen problem jobs with wildly variant run times based on external environmental conditions (or consistent predictable failures based on concurrent unrelated jobs). Trying to predict when a job will finish is nice when it works - but falls flat often enough that I wouldn't want to rely on it.

Direct proc execution is always going to be a concurrency problem. One of the reasons why I like sp_start_job is because SQL Agent already handles for attempts to start a running job. I like the idea of pushing the semaphore down a level (say into ExecuteTask - but I had some trouble with initial tests in a production environment (probably not insurmountable, just a weird blocker atm).

Pushing the semaphore down to the proc level could be done but feels pretty overbearing as a design choice. You could require each proc to signal to the scheduler logic that it has permission to run by deploying a semaphore that the proc itself then checks prior to running is sort of possible, but it's a pretty big departure from the current model (not to mention quite a fair chunk of work to deploy once designed).

As for large-batch starts after an incident... we're sort of already vulnerable to this (what with the high incidences of tasks scheduled for on-the-hour starts for example). A stagger-start mechanism is a lot of fun to conceptualize though... but it's definitely a rabbit hole for me. How do you prioritize what gets delayed by the stagger-start for instance? Do you need more intelligent detection for jobs that conflict?

Suffice to say, I hadn't really thought about the implications of every (or even most) jobs having a retry mechanism. The problem to be solved in my mind was first for those low-frequency jobs that get forgot about in a bit storm of job failure alerts. and won't run again for a long time.

taddison commented 2 years ago

I think it's reasonable to assume that any procedure which absolutely cannot be executed twice should be protected in the procedure direction (probably via sp_getapplock), as everything else is so vulnerable to being circumvented.

If we focus purely on minimising the number of jobs that are aborted and don't get re-triggered quickly enough, what's the smallest possible solution (I think a proc or even ad-hoc script), which could even spit out a bunch of sp_start_job statements so someone could eyeball the list and see if anything looks awry.

Explaining clearly the limitations and potential impact of setting a retry bit on a task doesn't seem like something that will be easy to articulate when asking people to self-service when scheduling items.

And one workaround - for anything truly critical - is the job should be scheduled to run hourly and contain logic to understand when there is work to do.

In deliberately not providing a retry mechanism we'd be looking to imitate cron (and do it well).

petervandivier commented 2 years ago

...any procedure which absolutely cannot be executed [concurrently] should be protected in the procedure direction...

...setting a retry bit on a task doesn't seem like something that will be easy to articulate when asking people to self-service...

...for anything truly critical - is the job should be scheduled to run hourly and contain logic to understand when there is work to do.

All salient points. I'm happy to close this issue as will-not-do

P.S. I think updating critical jobs to run with higher frequency and check for work largely obviates the need for an ad-hoc retry script (which itself is complicated and of dubious value IMO). If I do end up crafting an ad-hoc script of some kind for our internal environment though, I'll be sure to check it in here and reference this issue.

taddison commented 2 years ago

I think it probably does merit something in the readme/docs that tackles this, potentially offering examples of patterns that could be used (don't do it, incremental logic in each task, retry task to YOLO retry jobs that fail).

The other potential follow-up would making each task exclusive (via ExecuteTask and looking at context_info). If either of us does any work along those lines, let's add it here.

DBTrenches / tsqlscheduler

Retry mechanism #8