supervisor program:rqwoker dies and stalls processing jobs

lifenautjoe commented 4 years ago

Today we had a queue of 500+ jobs not being processed by rq workers.

The second machine was running a couple worker processes.

root      1104  0.0  1.9 395228 77472 ?        S    Aug15   0:21 python manage.py rqworker default --pid /var/run/rqworker
root      1756  0.0  1.9 395228 77580 ?        S    Aug14   0:22 python manage.py rqworker default --pid /var/run/rqworker
root      3195  0.0  1.9 395284 77416 ?        S    Aug23   0:09 python manage.py rqworker default --pid /var/run/rqworker
root      5229  0.0  1.9 395228 77676 ?        S    Aug14   0:22 python manage.py rqworker default --pid /var/run/rqworker
root      6642  0.0  1.9 395228 77496 ?        S    Aug14   0:24 python manage.py rqworker default --pid /var/run/rqworker
root      7178  0.0  1.9 395228 77504 ?        S    Aug15   0:21 python manage.py rqworker default --pid /var/run/rqworker
root      8669  0.0  1.9 395228 77632 ?        S    Aug14   0:21 python manage.py rqworker default --pid /var/run/rqworker
root     12302  0.0  1.9 395228 77704 ?        S    Aug14   0:24 python manage.py rqworker default --pid /var/run/rqworker
root     12460  0.0  1.9 395228 77616 ?        S    Aug15   0:20 python manage.py rqworker default --pid /var/run/rqworker
root     13779  0.0  0.0 115272  3200 ?        S    Aug28   0:00 /bin/bash -c source /opt/python/current/env && source /opt/p

But supervisor wasn't aware of a running program:rqworker. I tried to stop the service and got

/usr/local/bin/supervisorctl -c /opt/python/etc/supervisord.conf -s unix:///opt/python/run/supervisor.sock stop rqworker
rqworker: ERROR (not running)

Once I started it

/usr/local/bin/supervisorctl -c /opt/python/etc/supervisord.conf -s unix:///opt/python/run/supervisor.sock start rqworker

Jobs started to process again.

Perhaps related to https://github.com/rq/rq/issues/758

Is our supervisor config for rqworker correct?

We MUST ensure that this doesn't happen again as we will now use rq workers to process post media.

If they stall, no posting will be possible.

@evict halp

lifenautjoe commented 4 years ago

Something that might be is that there was a redeploy? We have a script which post deploy restarts supervisord with the custom config including the djangorq program.

Perhaps this doesn't get run in all situations?

files:
files:
  "/opt/elasticbeanstalk/hooks/appdeploy/post/04_update_supervisor.sh":
    mode: "000755"
    owner: root
    group: root
    content: |
      #!/usr/bin/env bash
      /usr/local/bin/supervisorctl -c /opt/python/etc/supervisord.conf -s unix:///opt/python/run/supervisor.sock reload

evict commented 4 years ago

Nope, there was no redeploy at that time. On the 28th of August was the last one. There is no error in the logging whatsoever.

OkunaOrg / okuna-api

supervisor program:rqwoker dies and stalls processing jobs #510