Open nmgeek opened 7 years ago
I noticed that the SQLAlchemy backend wraps calls like Database._store_result() in a retry decorator which catches exceptions like this and tries again. That would fix this problem would it not?
Here is a theory as to why the db connection gets broken in the first place. This is conjecture. (If you could find the root cause then no backend, even SQLAlchemy would need the less-than-optimal retry fix.)
And why would the parent and child process both be disconnected? It could be that when the child process forked it got a copy of the same db connection. If that connection is closed in the child it would no longer be valid in the parent??
I do see that the child process db connections are closed using a hook called on_worker_process_init in the backend loader class. But that would be in the child process and I believe our traceback is coming from the parent process.
At https://github.com/celery/django-celery/blob/3.1/djcelery/loaders.py#L150 you will find some hackery which may lead to the db connections being closed without proper book keeping at the Django db connection level. But I think that is in the child process.
A MySQL OperationalError is thrown when djcelery is updating the database and nothing catches it.
This is not a timeout problem. I can reproduce it within 10 minutes of starting the worker. The conditions to reproduce are:
I don't know why the database connection gets broken but this is a common theme with djcelery. There was exception handling code added recently to schedulers.py and loaders.py But this traceback unwinds through backends/database.py so I think there should be a similar try/catch there.
In my app we worked around this problem by defining a revoke handler for every task and calling django.db.connection.close() from the revoke handler. The database update for the terminated task, the first revoked task in the sequence, works so closing the connection afterwards seems to clean up the broken connection problem.