com-lihaoyi / os-lib

OS-Lib is a simple, flexible, high-performance Scala interface to common OS filesystem and subprocess APIs
Other
672 stars 63 forks source link

`os.proc.call`'s `timeout` has a termination race-condition from `SIGTERM` and `SIGKILL` #284

Closed j-mie6 closed 1 month ago

j-mie6 commented 1 month ago

The current implementation of os.proc.call's timeout flag uses two consecutive calls to p.destroy() and p.forciblyDestroy(). The effects of these calls are to first send SIGTERM, and then send SIGKILL to the process.

The roles of these two signals are:

By sending these two signals back-to-back, the parent process produces a race-condition between how quickly the child can execute its SIGTERM handler and clean up resources and the issue of the SIGKILL. In my experiments, I've found that SIGKILL it the cause of process exit the vast majority of the time. This means that the (potentially necessary) clean-up of the process is often not performed or worse interrupted. If the handler itself contained code to write file contents back to disk, modify a database, and so on, these operations may be corrupted. If the child process itself has children that need terminating, this could not be issued, leading to the parent process hanging.

What are the possible desired outcomes?

There are three ways that the timeout should be terminating the process:

  1. Only send SIGTERM: it doesn't matter how long it takes, we need to ensure safe clean-up
  2. Only send SIGKILL: the process has no important state, it should be terminated immediately
  3. Send SIGTERM, wait an appropriate amount of time, then send SIGKILL: we want to offer the process an opportunity to clean-up, but if this takes too long (perhaps the clean-up process itself is hanging), we want to forcibly terminate -- this is the scenario done by os.lib, albeit without allowing sufficient time to perform the handler.

What is normally done?

The SIGKILL signal is useful to issue when a process is not responding in a timely fashion to its SIGTERM event and the two are usually sent together with a delay. The Linux timeout command offers this with the -k n flag, which sends a SIGKILL signal n seconds after the original timeout sent SIGTERM.

Solutions

The race condition caused by the consecutive calls to destroy and forciblyDestroy is a bug, and could be addressed by supporting a similar system allowing outcomes (1), (2), or (3) configurably and safely.

For backwards compatibility, however, it might be wise to just support (1) with the current system.

sake92 commented 1 month ago

The option 3 feels like the most sensible/useful to me.
One thing I found is that java 8 doesnt destroyForcibly() the same as java 9+.
https://stackoverflow.com/a/52090564/4496364. Just to take it into consideration/tests

j-mie6 commented 1 month ago

Option 3 can be ok if the kill delay is configurable or sufficient length. Some people might argue that's an unexpectedly longer timeout than they expected though. In my case, I don't want forcible termination at all.

At the very least, in addition to timeout you could have killAfter: Int, where killAfter = -1 is scenario (1), killAfter = 0 is scenario (2) and killAfter = n, n>0 is scenario (3) done correctly?

The default could then be -1 or perhaps 1000

lihaoyi commented 1 month ago

I think a configurable kill delay with a default sounds good. Such an addition can be made binary compatible and semantically compatible, and mirrors the Linux timeout command as you have mentioned. Feel free to send a PR

j-mie6 commented 1 month ago

What do you envision the binary compatible change looking like?

I think a configurable kill delay with a default sounds good. Such an addition can be made binary compatible and semantically compatible, and mirrors the Linux timeout command as you have mentioned. Feel free to send a PR

lihaoyi commented 1 month ago

Take all the methods currently taking timeout: Long, and also add a timeoutGracePeriod: Long. Keep a forwarder from the old overload to the new overload to preserve bincompat.

AFAICT, this affects os.proc.call, os.Subprocess#join, and os.Subprocess#waitFor

j-mie6 commented 1 month ago

Take all the methods currently taking timeout: Long, and also add a timeoutGracePeriod: Long. Keep a forwarder from the old overload to the new overload to preserve bincompat.

The problem is that you can't have two overloadings with default arguments. This is why default arguments is a bin-compat nightmare. I think if the argument is placed last and the underlying forwarder is stripped of its defaults this might work...

j-mie6 commented 1 month ago

os.Subprocess#waitFor is not affected, as it does not attempt to terminate the process.

lihaoyi commented 1 month ago

Yes, keep the default arguments only on the longest overload and strip them from the shorter ones.

j-mie6 commented 1 month ago

Also, as a side note, ProcessLike is sealed, is it intentional that SubProcess and ProcessPipeline are not final? The addition of the new join overload will be implemented by both of them, but in theory a user could have overridden them and their behaviour is broken.

lihaoyi commented 1 month ago

Probably not, but most people won't extend them anyway so a bit of sloppiness isn't a huge deal

j-mie6 commented 1 month ago

Well, it might well be because of the new overload: it might be better to @deprecatedInheritance them with a note about the new overload?

lihaoyi commented 1 month ago

sure that works

j-mie6 commented 1 month ago

Alternatively, I notice we are still in early-semver, and I suspect you'll want this to land in a 0.11 update? I could just mark them as final now if you think that's fine

lihaoyi commented 1 month ago

Let's preserve bincompat and release it as a 0.10.x for now. Although we can break compat whenever we want and just bump a version, it's also good to preserve compat where possible to give our users an easier time upgrading.

j-mie6 commented 1 month ago

Fair! (it's been so long since I've been in 0.x that I forgot you can release minor in the patch digit)