Betterment / delayed

a multi-threaded, SQL-driven ActiveJob backend used at Betterment to process millions of background jobs per day
MIT License
156 stars 9 forks source link

how to monitor worker processes #21

Closed daichi5 closed 1 year ago

daichi5 commented 1 year ago

We plan to use 'delayed' as an asynchronous processing worker on Kubernetes pods. So we need a way to do healthcheck of the worker process and we have the following method to do the healthcheck now.

livenessProbe:
    exec:
      command:
        - bash
        - -lc
        - '[[ $(ps aux | grep delayed | grep -v grep | wc -l) > 0 ]]'

However, if there is a better method, we would like to adopt it. So, do you have any other methods? I'd like to know it for reference because I saw that you use kubertenes in other issues.

smudge commented 1 year ago

Hi @daichi5! We currently don't run our worker pods with a readinessProbe or livenessProbe config, largely because the processes don't accept any outside HTTP traffic, so we don't need the health checks for load balancing purposes. Instead we rely on the default behavior, which is that if the main process (at PID 1) exits, the container restarts.

We also use our cron/scheduler process to enqueue a background job once per minute, and that job emits a metric that we can monitor to alert ourselves if there are no workers running. But this exists outside of our k8s infrastructure, and we haven't shipped a generic version of this behavior, since it depends on the specifics of our internal monitoring/alerting infrastructure.

daichi5 commented 1 year ago

Hi @smudge! Thanks for the response. I understand how you operate worker pods.

Instead we rely on the default behavior, which is that if the main process (at PID 1) exits, the container restarts.

As you said, health check may not be necessary because the container will be restarted if the main process exits. However, we have enabled shareProcessNamespace, so our situation may be a little different.

We also use our cron/scheduler process to enqueue a background job once per minute, and that job emits a metric that we can monitor to alert ourselves if there are no workers running

The idea of this monitoring jobs is very helpful. I think we'll try to use a similar approach to this one to manage worker processes. Thank you!