Open yarikoptic opened 2 months ago
@yarikoptic When I ran kill -9
on the test processes last week, it was outside of/independent from the dandisets-healthstatus
script. This has nothing to do with the timeout implemented within the script, which simply does:
with anyio.fail_after(TIMEOUT):
await anyio.run_process( ... )
When it comes to just dandisets-healthstatus
, all of the termination of timed-out processes is handled by anyio, and I don't know what signals it sends.
so this code might need to be modified here after analysis of what anyio
does internally.
@yarikoptic I believe the relevant code in anyio is:
(Shouldn't GitHub show the code in the comment here? It's not doing it for me.)
Specifically, if a subprocess needs to be cancelled (e.g., due to an enclosing timeout), anyio kills the process and then waits un-cancellablely for it to exit, and that's where our processes were hanging.
I believe that handling things any other way (without reimplementing anyio) would require changes to anyio itself.
(Shouldn't GitHub show the code in the comment here? It's not doing it for me.)
it used to do that I think for me but no longer does
could we subclass Process class there with desired logic and use it instead?
could you then prepare PR to anyio with the needed logic?
@yarikoptic Subclassing Process
isn't enough; we'd need to copy & modify anyio.run_process()
to make it use the new subclass, and I would rather not do that.
I've filed a feature request with anyio
: https://github.com/agronholm/anyio/issues/757
@yarikoptic The maintainer of anyio
replied on the issue I linked above; it seems the requested functionality would be undesirable due to leaving behind zombies.
I followed up there. Can we have another thread/whatever which would monitor if any of tests stall and do extra killing/erroring out?
@yarikoptic Even if we could come up with a decent way to check for stalling, if we wanted the program to error out on a stall, the exception would still trigger the same process cleanup code I linked to above, and any cleanup currently stalled would just continue to be stalled. I believe the only way out would be for the healthstatus program to send SIGKILL
to itself, which doesn't seem like a good idea.
@yarikoptic Ping; do you still want to do this somehow?
@yarikoptic Ping.
Well, the main problem ATM is that we simply do not have an idea that the stall has happened. If we detect and exit with some ERROR and cause with that some kind of an email to be sent to notify us, we would be good even if there is still some stuck process -- we would know that there is an issue and would come to mitigate it.
Overall it might also be an issue of establishing some "invocation/progress monitoring" e.g. via https://healthchecks.io/ where we curl the ping point after one run completion and expect that to happen at least daily. WDYT?
@yarikoptic I think just using healthchecks.io or similar for monitoring should be cleaner.
Let's close whenever we add some such monitoring (thought to do now but failing to login... will try later)
inspired by davfs2 stall and our test processes also stalling and not succumbing to
kill -9
. We should test (for up to a minute) that the process we kill upon time out dies off. If it doesn't and we did trykill -9
even -- error out entire process to bring attention to the matter.