Task randomly get stuck (between parent and child)

We use celery 5.2.3 and billiard 3.6.1.0. What we observe is the main worker receives a task (we can see task_received signal being fired) and then there is no trace of the task in the logs after that (not always and very rarely). The main worker thinks it has sent the task on the pipe whereas the child is still stuck on read. As ack is not received by the main worker the timers (soft and hard) also don't fire. This task is now stuck forever. When we kill the child process the task is sent to another worker. The main worker meanwhile was directing new tasks to other children successfully.

This happens very randomly (after weeks of running normally) in production.

Stacktrace of the child when it was sent SIGUSR1:

Pool process <celery.concurrency.asynpool.Worker object at 0x7f1d1c1cf850> error: SoftTimeLimitExceeded()
Traceback (most recent call last):
  File ".../python/lib/python3.9/site-packages/billiard/pool.py", line 292, in __call__
    sys.exit(self.workloop(pid=pid))
  File ".../python/lib/python3.9/site-packages/billiard/pool.py", line 351, in workloop
    req = wait_for_job()
  File ".../python/lib/python3.9/site-packages/billiard/pool.py", line 473, in receive
    ready, req = _receive(1.0)
  File ".../python/lib/python3.9/site-packages/billiard/pool.py", line 445, in _recv
    return True, loads(get_payload())
  File ".../python/lib/python3.9/site-packages/billiard/queues.py", line 355, in get_payload
    return self._reader.recv_bytes()
  File ".../python/lib/python3.9/site-packages/billiard/connection.py", line 243, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File ".../python/lib/python3.9/site-packages/billiard/connection.py", line 460, in _recv_bytes
    return self._recv(size)
  File ".../python/lib/python3.9/site-packages/billiard/connection.py", line 422, in _recv
    chunk = read(handle, remaining)
  File ".../python/lib/python3.9/site-packages/billiard/pool.py", line 229, in soft_timeout_sighandler
    raise SoftTimeLimitExceeded()
billiard.exceptions.SoftTimeLimitExceeded: SoftTimeLimitExceeded()

The fact that when we killed this child, the task gets resubmitted to another child makes me think that the parent thought it sent the task on pipe and was waiting for ack whereas the receiver was stuck on read (forever). Could it be because of some race in the receiver or sender?

celery / billiard

Task randomly get stuck (between parent and child) #389