Closed daleevans closed 5 years ago
I confirm having same error every time I do warm shutdown with SIGTERM.
@daleevans Thank you for spotting this, I spent lots of hours debugging my celery setup and trying to understood what I did wrong as my dev was on previous billiard version and production went with new billiard version, it was driving me nuts why it fails. Saving requirements to billiard==3.5.0.4
fixed error.
It also introduces a slightly different crash if any of the ApplyResults has a schedule set:
RuntimeError: _SimpleQueue objects should only be shared between processes through inheritance
Traceback (most recent call last):
(...)
File "/usr/lib/python3.5/copy.py", line 223, in <listcomp>
y = [deepcopy(a, memo) for a in x]
File "/usr/lib/python3.5/copy.py", line 174, in deepcopy
rv = reductor(4)
File "/usr/local/lib/python3.5/dist-packages/billiard/queues.py", line 348, in __getstate__
context.assert_spawning(self)
File "/usr/local/lib/python3.5/dist-packages/billiard/context.py", line 421, in assert_spawning
' through inheritance' % type(obj).__name__
@thedrow Does the cache need to be deep-copied? It seems copying unresolved ApplyResults would result in them getting new locks, new job IDs and would likely result in things falling out of sync. If the goal is to make the cache dict itself thread safe then maybe it would be enough to do cache = self.cache.copy()
?
That change works for me but I don't know how to reproduce #248 and #249 did not provide any test cases.
@adamdelman Maybe you could help as you were the original reporter of #248?
Celery still depends on the buggy version as of today (celery 4.2.1 -> billiard<3.6.0,>=3.5.0.2)
Is it the issue resolved now?
While waiting for the final release of Celery 4.3, can anyone name a good reason not to run Celery 4.2.2 with Billiard 3.6.0.0?
commit 309f9663663c6dad6d40bf017514695a7c154fd appears to have introduced an issue when doing warm restarts. Now when I send a SIGTERM, I get a crash at line 697 of billiard/pool.py which kills all the workers, including those with jobs in progress. Rolling back to 3.5.0.4 resolves this for me.