dandi / dandisets-healthstatus

Healthchecks of dandisets and support libraries (pynwb and matnwb)
0 stars 1 forks source link

Error out if we fail to kill child test processes #80

Open yarikoptic opened 2 months ago

yarikoptic commented 2 months ago

inspired by davfs2 stall and our test processes also stalling and not succumbing to kill -9. We should test (for up to a minute) that the process we kill upon time out dies off. If it doesn't and we did try kill -9 even -- error out entire process to bring attention to the matter.

jwodder commented 2 months ago

@yarikoptic When I ran kill -9 on the test processes last week, it was outside of/independent from the dandisets-healthstatus script. This has nothing to do with the timeout implemented within the script, which simply does:

with anyio.fail_after(TIMEOUT):
    await anyio.run_process( ... )

When it comes to just dandisets-healthstatus, all of the termination of timed-out processes is handled by anyio, and I don't know what signals it sends.

yarikoptic commented 2 months ago

so this code might need to be modified here after analysis of what anyio does internally.

jwodder commented 2 months ago

@yarikoptic I believe the relevant code in anyio is:

https://github.com/agronholm/anyio/blob/5675f09e7ac9e1b9c5e6d81ab523fd83f6a0b00f/src/anyio/_backends/_asyncio.py#L961-L968

(Shouldn't GitHub show the code in the comment here? It's not doing it for me.)

Specifically, if a subprocess needs to be cancelled (e.g., due to an enclosing timeout), anyio kills the process and then waits un-cancellablely for it to exit, and that's where our processes were hanging.

I believe that handling things any other way (without reimplementing anyio) would require changes to anyio itself.

yarikoptic commented 2 months ago

(Shouldn't GitHub show the code in the comment here? It's not doing it for me.)

it used to do that I think for me but no longer does

could we subclass Process class there with desired logic and use it instead?

could you then prepare PR to anyio with the needed logic?

jwodder commented 2 months ago

@yarikoptic Subclassing Process isn't enough; we'd need to copy & modify anyio.run_process() to make it use the new subclass, and I would rather not do that.

I've filed a feature request with anyio: https://github.com/agronholm/anyio/issues/757

jwodder commented 2 months ago

@yarikoptic The maintainer of anyio replied on the issue I linked above; it seems the requested functionality would be undesirable due to leaving behind zombies.

yarikoptic commented 2 months ago

I followed up there. Can we have another thread/whatever which would monitor if any of tests stall and do extra killing/erroring out?

jwodder commented 2 months ago

@yarikoptic Even if we could come up with a decent way to check for stalling, if we wanted the program to error out on a stall, the exception would still trigger the same process cleanup code I linked to above, and any cleanup currently stalled would just continue to be stalled. I believe the only way out would be for the healthstatus program to send SIGKILL to itself, which doesn't seem like a good idea.

jwodder commented 1 month ago

@yarikoptic Ping; do you still want to do this somehow?

jwodder commented 1 month ago

@yarikoptic Ping.

yarikoptic commented 1 month ago

Well, the main problem ATM is that we simply do not have an idea that the stall has happened. If we detect and exit with some ERROR and cause with that some kind of an email to be sent to notify us, we would be good even if there is still some stuck process -- we would know that there is an issue and would come to mitigate it.

Overall it might also be an issue of establishing some "invocation/progress monitoring" e.g. via https://healthchecks.io/ where we curl the ping point after one run completion and expect that to happen at least daily. WDYT?

jwodder commented 1 month ago

@yarikoptic I think just using healthchecks.io or similar for monitoring should be cleaner.

yarikoptic commented 1 month ago

Let's close whenever we add some such monitoring (thought to do now but failing to login... will try later)