DataONEorg / slinky

Slinky, the DataONE Graph Store
Apache License 2.0
4 stars 4 forks source link

Deployment ordering #53

Closed ThomasThelen closed 2 years ago

ThomasThelen commented 2 years ago

This is an issue for cleaning up the dependency ordering in the deployment. Right now we're using the makefile that interacts with kubectl to wait for pods to be in the ready state. This works at the moment, but doesn't apply to the docker stack deployment and adds a layer of complexity to the deployment. It also makes things tricky with Helm charts (see #52).

The general idea is to bring the 'waiting' logic into the codebase and remove it from the deployment layer.

Scheduler

The scheduler is deployed in two steps: the first step is initialization of the scheduler (happens in the slinky cli), the second is starting rqschedule (happens on the command line). These steps can be seen in the deployment file.

Both of the steps require an active instance of redis and should be able to be started independently of each other without issue. Since rqscheduler is effectively moving jobs to different queues, it should be fine if the scheduler from the first step hasn't submitted the update_job job since it'll pick it up the next time it checks.

Solution

Making the scheduler portion of startup wait on redis can be achieved by adding a method that checks for redis with a timeout and threshold. This same code can be used with the workers (see below).

Unfortunately, rqschedule doesn't have a retry flag, but we can use the same logic above. I'd like to bring the call to rqschedule inside the Slinky cli. Either in def schedule or as a separate command. This would enable us to use the blocking code from above and would allow us to manage the dependency in the code.

Workers

The workers need to be able to perform database transactions (needs Virtuoso to be online). They also depend on redis.

Solution

Redis is easily tackled by using the blocking call from the scheduler solution.

A similar approach can be taken with Virtuoso, along the same lines as the readinessProbe

amoeba commented 2 years ago

Thanks for writing this up, @ThomasThelen. Adding a CLI command for the rqscheduler and a retry mechanism sounds great.

Just as a note,

Since rqscheduler is effectively moving jobs to different queues

rqscheduler just submits any scheduled jobs to the queue they're to be scheduled on. Currently that's just enqueueing the Update job every n minutes in a cron-like fashion. No jobs are moved between queues. A worker listening to the Update queue does process the next Update job and the worker itself enqueues the jobs that actually process datasets to the Dataset queue, where a worker listening to the Dataset queue can pluck them from.

amoeba commented 2 years ago

This was done by @ThomasThelen and merged to develop in https://github.com/DataONEorg/slinky/pull/54.