citusdata / pg_cron

Run periodic jobs in PostgreSQL
PostgreSQL License
2.91k stars 195 forks source link

pg_cron issues when a postgres process crashes #12

Open alexeyklyukin opened 7 years ago

alexeyklyukin commented 7 years ago

When postmaster reinitializes postgres processes because of the crash of one of them, pg_cron is terminated and is not subsequently restarted. The reason is that it supplies the BGW_NEVER_RESTART flag to the RegisterBackgroundWorker. I wonder why?

The other issue is that pg_cron continues to run when postmaster dies (i.e. killed by -9). The docs actually deal with that:

Backends which need to suspend execution only temporarily should use an interruptible sleep rather than exiting; this can be achieved by calling WaitLatch(). Make sure the WL_POSTMASTER_DEATH flag is set when calling that function, and verify the return code for a prompt exit in the emergency case that postgres itself has terminated.

That's quite easy to fix:

$ git diff
diff --git a/src/pg_cron.c b/src/pg_cron.c
index a7e42c5..8d4ab05 100644
--- a/src/pg_cron.c
+++ b/src/pg_cron.c
@@ -152,7 +152,7 @@ _PG_init(void)
    /* set up common data for all our workers */
    worker.bgw_flags = BGWORKER_SHMEM_ACCESS | BGWORKER_BACKEND_DATABASE_CONNECTION;
    worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
-   worker.bgw_restart_time = BGW_NEVER_RESTART;
+   worker.bgw_restart_time = 1;
    worker.bgw_main = PgCronWorkerMain;
    worker.bgw_main_arg = Int32GetDatum(0);
    worker.bgw_notify_pid = 0;
@@ -544,16 +544,23 @@ ShouldRunTask(entry *schedule, TimestampTz currentTime, bool doWild,
 static void
 WaitForCronTasks(List *taskList)
 {
+   int     rc;
    int taskCount = list_length(taskList);

    if (taskCount > 0)
    {
        PollForTasks(taskList);
    }
-   else
+
+   /* wait for new jobs */
+   rc = WaitLatch(MyLatch,
+                  WL_LATCH_SET | WL_POSTMASTER_DEATH,
+                  taskCount > 0 ? 0: MaxWait, PG_WAIT_EXTENSION);
+   ResetLatch(MyLatch);
+   if (rc & WL_POSTMASTER_DEATH)
    {
-       /* wait for new jobs */
-       pg_usleep(MaxWait*1000L);
+       /* Postmaster died and we should bail out immediately */
+       proc_exit(1);
    }
 }

I think there should be a few additional changes, namely WaitLatch should go to the main loop and the signals should SetLatch as well. What do you think?

marcocitus commented 7 years ago

Hey Oleksii, thanks for the report.

The first issue seems most pressing and is now resolved on master.

For the second issue, your fix looks ok (with SetLatch), but it only addresses the case in which there are no jobs. Usually, pg_cron waits in poll. The problem is that we cannot easily wait for both sockets and a latch. PostgreSQL does have a convenient WaitLatckOrSocket function, but it only takes one socket, while pg_cron needs to wait for multiple sockets.

In PostgreSQL 9.6 this issue can be resolved using wait event sets, which allow you to wait for a latch and multiple sockets. To support PostgreSQL 9.5, I'm considering using a self-pipe, which also takes away the need to do pg_usleep/WaitLatch when there are no tasks.