MichaelXavier / cron

Cron data structure and parser for Haskell
Other
53 stars 33 forks source link

Cron job is sometimes triggered twice in a row #38

Open rdelgatte opened 4 years ago

rdelgatte commented 4 years ago

Hello

First of all, thank you for this very useful library I recently started to work with.

I am experiencing some troubles when scheduling a simple job to start every minutes as it sometimes starts two threads within the same second.

ThreadId 20 at 2019-12-13 13:37:59.999691 UTC
ThreadId 21 at 2019-12-13 13:38:00.001119 UTC

I tried to reproduce in a standalone / simple example which is the following:

module Example where

import           Control.Concurrent (myThreadId)
import           Data.Time          (getCurrentTime)
import           System.Cron

runApplication :: IO ()
runApplication = do
  threadIds <- execSchedule setupCronJob
  putStrLn $ "Scheduling job " <> show threadIds <> " is running"
  _ <- getLine
  -- to stop only when the user types
  pure ()

setupCronJob :: Schedule ()
setupCronJob = addJob runJob "* * * * *"

runJob :: IO ()
runJob = do
  threadId <- myThreadId
  currentTime <- getCurrentTime
  putStrLn $ show threadId <> " at " <> show currentTime

This example triggers a job which prints the threadId with the current execution time and I get the following output:

Scheduling job [ThreadId 12] is running
ThreadId 13 at 2019-12-13 13:31:00.004439 UTC
ThreadId 14 at 2019-12-13 13:32:00.003727 UTC
ThreadId 15 at 2019-12-13 13:33:00.003082 UTC
ThreadId 16 at 2019-12-13 13:34:00.000189 UTC
ThreadId 17 at 2019-12-13 13:35:00.001204 UTC
ThreadId 18 at 2019-12-13 13:36:00.002054 UTC
ThreadId 19 at 2019-12-13 13:37:00.001719 UTC
ThreadId 20 at 2019-12-13 13:37:59.999691 UTC
ThreadId 21 at 2019-12-13 13:38:00.001119 UTC
ThreadId 22 at 2019-12-13 13:39:00.003116 UTC
ThreadId 23 at 2019-12-13 13:40:00.00399 UTC
ThreadId 24 at 2019-12-13 13:41:00.00291 UTC
ThreadId 25 at 2019-12-13 13:42:00.001187 UTC
ThreadId 26 at 2019-12-13 13:43:00.002664 UTC

I sometimes need to wait around 10 minutes to get this error.

I experience this issue in a MacOs Mojave as another colleague of mine but others don't get it in a Linux based OS.

Did anybody experienced the same behavior or do you have any hint to help on this topic?

Thank you in advance!

sir4ur0n commented 4 years ago

FWIW I run Nixos 20.03 and I could never reproduce the issue:

Scheduling job [ThreadId 16] is running
ThreadId 17 at 2019-12-13 13:58:00.020057591 UTC
ThreadId 18 at 2019-12-13 13:59:00.060958819 UTC
ThreadId 19 at 2019-12-13 14:00:00.061017632 UTC
ThreadId 20 at 2019-12-13 14:01:00.020861997 UTC
ThreadId 21 at 2019-12-13 14:02:00.060972721 UTC
ThreadId 22 at 2019-12-13 14:03:00.060917503 UTC
ThreadId 23 at 2019-12-13 14:04:00.059921579 UTC
ThreadId 24 at 2019-12-13 14:05:00.060109616 UTC
ThreadId 25 at 2019-12-13 14:06:00.061022934 UTC
ThreadId 26 at 2019-12-13 14:07:00.005278865 UTC
ThreadId 27 at 2019-12-13 14:08:00.006685337 UTC
ThreadId 28 at 2019-12-13 14:09:00.060965175 UTC
ThreadId 29 at 2019-12-13 14:10:00.014495628 UTC
ThreadId 30 at 2019-12-13 14:11:00.061147486 UTC
ThreadId 31 at 2019-12-13 14:12:00.060977273 UTC
ThreadId 32 at 2019-12-13 14:13:00.060930933 UTC
ThreadId 33 at 2019-12-13 14:14:00.021287285 UTC
ThreadId 34 at 2019-12-13 14:15:00.061036648 UTC
ThreadId 35 at 2019-12-13 14:16:00.061111415 UTC
ThreadId 36 at 2019-12-13 14:17:00.060988905 UTC
ThreadId 37 at 2019-12-13 14:18:00.017851708 UTC
ThreadId 38 at 2019-12-13 14:19:00.06113853 UTC
ThreadId 39 at 2019-12-13 14:20:00.013101366 UTC
ThreadId 40 at 2019-12-13 14:21:00.043022321 UTC
ThreadId 41 at 2019-12-13 14:22:00.06097804 UTC
ThreadId 42 at 2019-12-13 14:23:00.061058808 UTC
ThreadId 43 at 2019-12-13 14:24:00.03177008 UTC
ThreadId 44 at 2019-12-13 14:25:00.061060371 UTC
ThreadId 45 at 2019-12-13 14:26:00.060961921 UTC
ThreadId 46 at 2019-12-13 14:27:00.061071756 UTC
ThreadId 47 at 2019-12-13 14:28:00.041607719 UTC
ThreadId 48 at 2019-12-13 14:29:00.026571498 UTC
ThreadId 49 at 2019-12-13 14:30:00.061066224 UTC
ThreadId 50 at 2019-12-13 14:31:00.030521381 UTC
ThreadId 51 at 2019-12-13 14:32:00.061087097 UTC
ThreadId 52 at 2019-12-13 14:33:00.028795206 UTC
ThreadId 53 at 2019-12-13 14:34:00.011789461 UTC
ThreadId 54 at 2019-12-13 14:35:00.060919944 UTC
ThreadId 55 at 2019-12-13 14:36:00.060982183 UTC

Please note, my logs are always after 00s, while @rdelgatte logs always contain 1 log before 00s and 1 log after whene there is a duplicate job.

robinp commented 10 months ago

Having a look at the source code:

-- | Start a job-runner thread that runs a job at appropriate
-- intervals. Each time it is run, a new thread is forked for it,
-- meaning that a single exception does not take down the
-- scheduler.
forkJob :: Job -> IO ThreadId
forkJob (Job s a) = forkIO $ forever $ do
            (timeAt, delay) <- findNextMinuteDelay
            threadDelay delay
            when (scheduleMatches s timeAt) (void $ forkIO a)

findNextMinuteDelay :: IO (UTCTime, Int)
findNextMinuteDelay = findNextMinuteDelay' <$> getCurrentTime

findNextMinuteDelay' :: UTCTime -> (UTCTime, Int)
findNextMinuteDelay' now = (next, delay)
  where
    oneMinuteLater = addUTCTime oneMinute now
    plainMinute = truncateToPlainMinute $ utctDayTime oneMinuteLater
    next = oneMinuteLater { utctDayTime = plainMinute }
    diff = diffUTCTime next now
    delay = round (realToFrac (diff * 1000000) :: Double) :: Int

oneMinute :: NominalDiffTime
oneMinute = 60

truncateToPlainMinute :: DiffTime -> DiffTime
truncateToPlainMinute = fromIntegral . (* 60) . (`quot` 60) . (truncate :: DiffTime -> Integer)

-- | Does the given cron schedule match for the given timestamp? This
-- is usually used for implementing polling-type schedulers like cron
-- itself.
scheduleMatches
    :: CronSchedule
    -> UTCTime
    -> Bool
scheduleMatches cs@CronSchedule {..} (UTCTime d t) =
  maybe False go (expand cs)
  where
    go Expanded {..} = and
      [ FT.elem mn minF
      , FT.elem hr hourF
      , FT.elem mth monthF
      , checkDOMAndDOW
      ]
      where
        -- turns out if neither dom and dow are stars, you're supposed to
        -- OR and not AND them:
        --
        -- Note: The day of a command's execution can
        -- be specified by two fields — day of month, and day of week. If
        -- both fields are restricted (i.e., aren't *), the command will
        -- be run when either field matches the current time. For example,
        -- ``30 4 1,15 * 5'' would cause a command to be run at 4:30 am on
        -- the 1st and 15th of each month, plus every Friday. One can,
        -- however, achieve the desired result by adding a test to the
        -- command (see the last example in EXAMPLE CRON FILE below).
        checkDOMAndDOW
          | restricted (dayOfMonthSpec dayOfMonth) && restricted (dayOfWeekSpec dayOfWeek) =
              domMatches || dowMatches
          | otherwise = domMatches && dowMatches
        domMatches = FT.elem dom domF
        dowMatches = FT.elem dow dowF
    (_, mth, dom) = toGregorian d
    (hr, mn) = timeOfDay t
    dow = getDOW d

restricted :: CronField -> Bool
restricted = not . isStar

isStar :: CronField -> Bool
isStar (Field Star)    = True
isStar (ListField bfs) = FT.any (== Star) bfs
isStar (StepField' sf) = sfField sf == Star && sfStepping sf == 1
isStar _               = False

Let's see some problematic things:

  1. Using threadDelay with a specific delay - usually there's no guarantee it wouldn't take longer (or less). Might be better to have a single loop delaying a few seconds, and re-checking all cron entries all the time.
  2. After threadDelay finishes, the precalculated supposed time is passed on to check the match, instead of the actual time. Ok, passing actual might miss a schedule, but passing the supposed (but no true) time might lead to surprises when debugging. Maybe not a big problem if (1) is fixed but some clarity around would be nice
  3. scheduleMatches checks the given (now the supposed) time against the "expanded" schedule.. didn't paste it, but that means all possible values are in a list, for example for the minute. Means that for 1-min schedule, all the minutes are valid. Thus scheduleMatches will blindly accept any minute, regardless if there was a run in that minute already.
  4. Now, that scheduleMatches problem itself wouldn't be enough usually due to the delay (there might be dup minutes, but the delay should more or less work out), except there's the truncateToPlainMinute part - before determining the delay, the next timestamp is trunced to 0 seconds. So this is how you get an execution at N:59 and N+1:00.

Leaving how to properly fix it as a separate exercise.