We use celery 5.2.3 and billiard 3.6.1.0. What we observe is the main worker receives a task (we can see task_received signal being fired) and then there is no trace of the task in the logs after that (not always and very rarely).
The main worker thinks it has sent the task on the pipe whereas the child is still stuck on read. As ack is not received by the main worker the timers (soft and hard) also don't fire. This task is now stuck forever. When we kill the child process the task is sent to another worker. The main worker meanwhile was directing new tasks to other children successfully.
This happens very randomly (after weeks of running normally) in production.
Stacktrace of the child when it was sent SIGUSR1:
Pool process <celery.concurrency.asynpool.Worker object at 0x7f1d1c1cf850> error: SoftTimeLimitExceeded()
Traceback (most recent call last):
File ".../python/lib/python3.9/site-packages/billiard/pool.py", line 292, in __call__
sys.exit(self.workloop(pid=pid))
File ".../python/lib/python3.9/site-packages/billiard/pool.py", line 351, in workloop
req = wait_for_job()
File ".../python/lib/python3.9/site-packages/billiard/pool.py", line 473, in receive
ready, req = _receive(1.0)
File ".../python/lib/python3.9/site-packages/billiard/pool.py", line 445, in _recv
return True, loads(get_payload())
File ".../python/lib/python3.9/site-packages/billiard/queues.py", line 355, in get_payload
return self._reader.recv_bytes()
File ".../python/lib/python3.9/site-packages/billiard/connection.py", line 243, in recv_bytes
buf = self._recv_bytes(maxlength)
File ".../python/lib/python3.9/site-packages/billiard/connection.py", line 460, in _recv_bytes
return self._recv(size)
File ".../python/lib/python3.9/site-packages/billiard/connection.py", line 422, in _recv
chunk = read(handle, remaining)
File ".../python/lib/python3.9/site-packages/billiard/pool.py", line 229, in soft_timeout_sighandler
raise SoftTimeLimitExceeded()
billiard.exceptions.SoftTimeLimitExceeded: SoftTimeLimitExceeded()
The fact that when we killed this child, the task gets resubmitted to another child makes me think that the parent thought it sent the task on pipe and was waiting for ack whereas the receiver was stuck on read (forever). Could it be because of some race in the receiver or sender?
We use celery
5.2.3
and billiard3.6.1.0.
What we observe is the main worker receives a task (we can seetask_received
signal being fired) and then there is no trace of the task in the logs after that (not always and very rarely). The main worker thinks it has sent the task on the pipe whereas the child is still stuck on read. As ack is not received by the main worker the timers (soft and hard) also don't fire. This task is now stuck forever. When we kill the child process the task is sent to another worker. The main worker meanwhile was directing new tasks to other children successfully.This happens very randomly (after weeks of running normally) in production.
Stacktrace of the child when it was sent SIGUSR1:
The fact that when we killed this child, the task gets resubmitted to another child makes me think that the parent thought it sent the task on pipe and was waiting for ack whereas the receiver was stuck on read (forever). Could it be because of some race in the receiver or sender?