Celerybeat service can fail to run in production

malefice commented 4 years ago

What happened?

The celerybeat service in production can fail to run because of an existing pidfile.

What should've happened instead?

The development version does not suffer from this issue, because its start script manually removes the pidfile. I am not sure why the production version doesn't, so if I am missing some details, please weigh in.

Ideally, celery should properly clean up after itself, and it does attempt to detect if the pidfile is stale or not, but for some reason, it does not always work when dockerized. I have never encountered this issue in traditional setups, so this is probably an upstream celery and/or docker issue. On that note, a quick workaround is to manually remove the pidfile.

Steps to reproduce

Tested on Ubuntu 18.04.4 LTS, Docker Engine version 19.03.12, docker-compose version 1.17.1

Create a new project from this template as you normally would but with docker and celery support.
Build the production stack, run migrations, and then bring up the services.
Stop all services (CTRL-C in attached mode or docker-compose top in detached), and then bring them back up again. You may need to repeat this multiple times.

Some screenshots

Andrew-Chen-Wang commented 4 years ago

@malefice Hey I filed some error regarding the PIDfiles in celery. Somehow, right after celery v4.4.2, the PIDFile was being placed incorrectly and stuff.

I haven't used celery in awhile, but that might be the case. In my opinion, your only chance is to first delete the pid file before running compose up and see if the problem arises again. Otherwise, go into compose/production/django/celery and find a start file that should try to be deleting a pid file... I think, not sure.

browniebroke commented 4 years ago

Found this question which seems to be related?

I'm not sure why we remove it locally but not on production to be honest. One guess might be because we mount the local directory in /app:

https://github.com/pydanny/cookiecutter-django/blob/9b67d828f68a7d145400f19da119f11bb6830fe3/%7B%7Bcookiecutter.project_slug%7D%7D/local.yml#L19-L20

I presume the error happens on production because the stack reuse the same celerybeat container, you'd need to remove the containers between 2 runs.

Anyway, this answer suggests to disable the pid file by passing an empty value --pidfile=. I have no idea of the implications of it, but maybe we could probably bring our dev/prod a bit more in line with that.

arnav13081994 commented 4 years ago

From celery documentation

--pidfile File used to store the process pid. Defaults to celerybeat.pid.

The program won’t start if this file already exists and the pid is still alive.

The only reason this file is persisting between celerybeat and celeryworker service restarts is because of yml templating. celerybeat and celeryworker services get the bind mounted current directory from django as seen below: https://github.com/pydanny/cookiecutter-django/blob/9b67d828f68a7d145400f19da119f11bb6830fe3/%7B%7Bcookiecutter.project_slug%7D%7D/local.yml#L19-L20

The easiest way to solve this is to redefine it to be an empty array like so


celerybeat:
    <<: *django
    image: {{ cookiecutter.project_slug }}_local_celerybeat
    container_name: celerybeat
    depends_on:
      - redis
      - postgres
      {% if cookiecutter.use_mailhog == 'y' -%}
      - mailhog
      {%- endif %}
    ports: []
    command: /start-celerybeat
    volumes: []

cookiecutter / cookiecutter-django