Occasional Duplicate Runs of System Jobs

Description

We're seeing what appear to be occasional and random duplicate runs of system jobs. The below screenshots, taken from pre-alpha, show what seems to be the most common manifestation of this issue. In this case, the Update Persisted DataView job was launched twice at 2:24 pm. An apparent duplicate entry appears in the service job history log and was never updated from a running status, even after subsequent job runs completed successfully.

2024-10-04_11-40-26

Duplicate job launches seem more common (and understandable) during an overlapped IIS app pool recycle. However, we're seeing this appear to occur at random, unrelated to app pool recycles or Rock restarts. In the case of the example shown above from pre-alpha, Rock reports last being restarted at 2 am when this occurred, and the duplicated job run occurred at 2:24 pm.

2024-10-04_11-45-15

There doesn't seem to be a pattern with this (we've seen it affecting multiple job types), but it seems more common with the very frequently processed job types (such as persisted datasets and persisted data views).

A scenario we do see occasionally is multiple web worker processes running and causing all job runs to be duplicated. That usually seems to happen when an application pool is recycled and the old worker process doesn't seem to exit properly, though it happens very infrequently. That doesn't seem to be what's happening here, since we've only seen a single job appear to be affected at one time.

Actual Behavior

System jobs occasionally launched multiple times, seemingly at random. One of the "duplicates" often appears to get stuck in the running state.

Expected Behavior

System jobs would only launch once when scheduled.

Steps to Reproduce

I don't have concrete steps to reproduce this, since it seems to happen at random. I used the SQL query below to look for this across several Rock environments, found a handful of examples, and posted the example I found on pre-alpha above showing pattern idenfied. Environments with examples were on different point releases of v16, including pre-alpha on v17 as described above.

This query looks in service job history for any examples of jobs with the same service job id starting within 10 seconds of each other within the last week.

SELECT
      [ServiceJobId]      = [sj].[Id]
    , [ServiceJobName]    = [sj].[Name]
    , [History1StartTime] = [sjh].[StartDateTime]
    , [History2StartTime] = [sjh2].[StartDateTime]
    , [History1Id]        = [sjh].[Id]
    , [History1Status]    = [sjh].[Status]
    , [History1Duration]  = DATEDIFF(MINUTE, [sjh].[StopDateTime], [sjh].[StartDateTime])
    , [History2Id]        = [sjh2].[Id]
    , [History2Status]    = [sjh2].[Status]
    , [History2Duration]  = DATEDIFF(MINUTE, [sjh].[StopDateTime], [sjh].[StartDateTime])
FROM
    [ServiceJobHistory] [sjh]
    JOIN [ServiceJobHistory] [sjh2]
            ON [sjh2].[ServiceJobId] = [sjh].[ServiceJobId] AND
               ABS(DATEDIFF(SECOND, [sjh].[StartDateTime], [sjh2].[StartDateTime])) <= 10 AND [sjh].[Id] != [sjh2].[Id]
    JOIN [ServiceJob] [sj]
            ON [sj].[Id] = [sjh].[ServiceJobId]
WHERE
    [sjh].[StartDateTime] >
    DATEADD(WEEK, -1, CAST(SYSDATETIMEOFFSET() AT TIME ZONE 'Eastern Standard Time' AS DATETIME))
ORDER BY
    [sjh].[ServiceJobId]
    , [sjh].[StartDateTime] DESC

Note that I couldn't find an example of this on Rock Solid Demo but did on pre-alpha, which is why I've shown that example above.

Issue Confirmation

[X] Perform a search on the Github Issues to see if your bug or enhancement is already reported.
[X] Reproduced the problem on a fresh install or on the demo site.

Rock Version

v16.3

Client Culture Setting

en-US

SparkDevNetwork / Rock