celery / django-celery-beat

Celery Periodic Tasks backed by the Django ORM
Other
1.69k stars 429 forks source link

Long running celery-beat queries crash postgres database #671

Open bioworkflows opened 1 year ago

bioworkflows commented 1 year ago

Summary:

The database backend of my website crashes today. Upon inspection, I found all slots are filled with database query, and ran more than 8 hrs.

SELECT "django_celery_beat_periodictasks"."ident", "django_celery_beat_periodictasks"."last_update" FROM 
"django_celery_beat_periodictasks" WHERE "django_celery_beat_periodictasks"."ident" = 1 LIMIT 21 FOR UPDATE

My periodic database had hundreds of insertion/deletion operations today but it has less than 50 entries at peak, so I am wondering what is going on here. Is it some sort of deadlock that prevented these queries from completion?

Exact steps to reproduce the issue:

I am not sure how to reproduce this.

Detailed information

image
rjcampion3 commented 5 months ago

We've experienced this issue a few times now and it does eventually eat up all of the available connections and take down the database. Any updates on a possible root cause?

curtisim0 commented 3 months ago

Hey at @bioworkflows or @rjcampion3 any work around that y'all came up with? I just upgraded a project and am having connections get slurped up by this.

rjcampion3 commented 3 months ago

@curtisim0 We did the following settings in django:

CELERY_BROKER_TRANSPORT_OPTIONS = {
    "socket_keepalive": True,
    "socket_keepalive_options": {
        socket.TCP_KEEPIDLE: 60,
        socket.TCP_KEEPCNT: 5,
        socket.TCP_KEEPINTVL: 10,
    },
}

And we added --without-mingle to the celery command. That seemed to do the trick. I think the without-mingle was the main thing that seemed to resolve it, but the other settings are probably good to have anyway

sehmaschine commented 3 weeks ago

Having the same issue (eventually takes down the database).

@rjcampion3 Can you explain, why without-mingle would solve this issue? My understanding is that mingle is only relevant for the startup of the workers. It is therefore unclear to me how this is related to the issue.

rjcampion3 commented 2 weeks ago

@sehmaschine I don't know if our workaround really has much to do with the celery table locking up. We had an issue where celery stopped responding to tasks and had to be restarted and then pulled all the awaiting tasks from redis. This was a kombu issue plus updating the settings above.

We've run into the table corruption issue a couple of times and the only work around for fixing it was to create a new database. Not much of a work around for it, but a simple solution when needed.

sehmaschine commented 2 weeks ago

@rjcampion3 Creating a new database sounds like a horror scenario. Our DB is > 100GB and that's not a straightforward process. Guess I'll check if we need django-celery-beat in the first place (I only used it, because of the ephemeral filesystem with digitalocean, but maybe there's a workaround for that, e.g. using spaces).