agronholm / anyio

High level asynchronous concurrency and networking framework that works on top of either trio or asyncio
MIT License
1.74k stars 134 forks source link

Timing out when a cancelled process takes too long to die #757

Open jwodder opened 1 month ago

jwodder commented 1 month ago

Things to check first

Feature description

Currently, when a running Process is cancelled, anyio simply does:

self.kill()
with CancelScope(shield=True):
    await self.wait()

However, in pathological cases, the killed process may take arbitrarily long to actually exit, resulting in the program hanging indefinitely. I therefore request the ability to specify a timeout for the "wait" above; if the process doesn't exit in time, a dedicated error is raised so that the program can continue cleaning up and the programmer can know what went wrong.

Use case

We recently ran into a situation where some child processes got stuck in "uninterruptible sleep" (as reported by ps). As a result, when the timeouts we had wrapped them in expired, our program ended up hanging waiting for the subprocesses to acknowledge their deaths. We would prefer it if our program were to exit with an informative error message when this happened rather than just stalling forever.

agronholm commented 1 month ago

If we don't reap the child process, it becomes a zombie. Trio also waits indefinitely: https://github.com/python-trio/trio/blob/main/src/trio/_subprocess.py#L754-L764

yarikoptic commented 1 month ago

Depending on the setup/system - zombies are "nothing new" and usually picked up by the root process. That often mandates for a container running such services to have such root process. For that there is even a dedicated option within docker run:

❯ docker run --help | grep -A1 -e --init
      --init                           Run an init inside the container that forwards signals and reaps
                                       processes

NB more on zombies and containers at https://stackoverflow.com/questions/49162358/docker-init-zombies-why-does-it-matter and there in.

But without such functionality, anyio cannot be used in scenarios where processes cannot be gracefully killed for some reason (e.g., filesystem might stall in our case), and process would keep running indefinitely only until some other actor/operator detects and reacts to that stalling situation. So in our particular case we do not want really to abandon/breed zombies, we want to react and exit with error upon creating one.

Altogether I do feel that default behavior might be desired to remain waiting, but I would appreciate if at least it would be optionally allowed to perform more aggressive killing and eventually abandonment of underlying process whenever setup does require to avoid overall stalling of the application. May be that cited codeblock could be made "pluggable" so applications have flexibility to alter handling of process interruption to their liking, and then a few default handlers provided for common situations?