cpan-authors / IPC-Run

https://metacpan.org/pod/IPC::Run
Other
21 stars 38 forks source link

Reduce delays in detecting child exit #172

Closed nmisch closed 5 months ago

nmisch commented 6 months ago

In brief, this fixes most instances of #166 through the combination of [SIGCHLD=sub{}], [low-select-timeout-w32], and [waitpid-blocking]


Issue #166 reported delays of up to 0.5s in detecting child exit. When IPC::Run is interacting with the child via a file descriptor (FD), the delay likely doesn't happen. Child exit will make FDs ready, which select() will notice. With zero such FDs, _select_loop() alternates between waitpid($pid, WNOHANG) and timeout-only select(undef, undef, undef, $timeout). That limits detection latency to $timeout. A number of tactics can help, and we can combine tactics to reduce delays in more cases. I'm choosing each tactic in bold:

Tactic Race-free Helps if SIGCHLD blocked Helps Windows Helps if timeout active CPU cost Complexity
SIGCHLD=sub{} :x: :x: :x: :white_check_mark: + +
self-pipe :white_check_mark: :x: :x: :white_check_mark: + +++
low-select-timeout-w32 :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: +++ +
waitpid-blocking :white_check_mark: :white_check_mark: :white_check_mark: :x: - ++
Windows-wait :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: - ++++
helper-pid :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: ++++ ++++

[SIGCHLD=sub{}] makes Perl terminate select() in response to that signal.

[self-pipe] uses https://cr.yp.to/docs/selfpipe.html to fix a race condition in [SIGCHLD=sub{}], when SIGCHLD arrives just before select() begins.

[low-select-timeout-w32] caps the select() timeout at something less than today's 0.5s, in Windows zero-FD cases. This gains responsiveness, but it loses CPU efficiency.

[waitpid-blocking] uses waitpid($pid, 0) when we have no timeouts or FDs to interact with a child. This works with SIGCHLD unavailable (Windows) or blocked, but not working with timeouts is a major limitation.

[helper-pid] uses a separate process or thread of the current process to handle one of the tasks. For example, have the main process select(), and have a second process write to a pipe when it detects a child exit. The CPU overhead makes this a loss outside academic scenarios. It could win if SIGCHLD is blocked, but an application blocking SIGCHLD announces its disinterest in fast child exit detection. It could win on Windows with sufficiently-long child runtimes and timeouts, making the continuous cost of select() wake-ups exceed the fixed cost of creating the helper.

[Windows-wait] uses a C-language module to issue Windows API calls OpenProcess(), WSAEventSelect(), and WaitForMultipleObjects(). Together, they enable one thread to wait for child exit, FD readiness, and timeout. This may allow removing the existing Win32Helper.pm process, reducing today's overhead on Windows; search the source tree for WaitForMultipleObjects(). No CPAN module covers all those functions. This might pay off if Win32::API suffices or if bypassing Win32Helper.pm gains considerable efficiency. Otherwise, it might pay off on a long time horizon, if we can get its C code in the default installation of Windows Perl.

Are there other tactics that might outperform these? Are there other key considerations for evaluating the tactics listed?


I feel the combination of [SIGCHLD=sub{}], [low-select-timeout-w32], and [waitpid-blocking] has only minor drawbacks, hence that stopping point. The delay remains when SIGCHLD is blocked or arrives just before select(). Windows incurs CPU overhead via shorter select() timeouts, only when there's a harness timeout and zero FDs. The most-plausible alternative was to replace [SIGCHLD=sub{}] with [self-pipe], accepting a bit of CPU overhead and developer-facing complexity to remove the "just before select()" race condition. [Windows-wait] is a more-speculative alternative.

Since tests can't assume much about the passage of real time, I'm not adding test cases. Existing tests caught bugs in earlier versions of the waitpid-related commits. You can see the responsiveness of several cases via the output of temporary tests I had in t/eintr.t here: https://github.com/nmisch/IPC-Run/actions/runs/8219780209/job/22478157192.

nmisch commented 5 months ago

waitpid-blocking will change $? in some cases, e.g. the first one in the following program. I am inclined to accept that. As the other two examples show, $? already has been sensitive to implementation details. When a pipeline has multiple kids, use result(), results(), full_result(), or full_results() for stable behavior. The main alternative would be to drop waitpid-blocking, keeping the other two fixes. That would lose a probably-negligible bit of efficiency, more noticeable on Windows than elsewhere. Would that be better?

use IPC::Run 'run';

my $in;
my $out;

# master : print 2, because _select_loop -> pumpable -> pump_nb reacts in order of actual termination
# patched: print 1, because finish -> _cleanup -> waitpid follows $self->{KIDS} order
run([$^X, '-e', 'sleep 2; exit 2'], '&',
    [$^X, '-e', 'sleep 1; exit 1']);
printf "%d\n", $? >> 8;

# master : print 1, because the >$out channel remains open until both kids have
#   terminated, and _select_loop -> pumpable doesn't call pump_nb due to the
#   open channel
# patched : same
run([$^X, '-e', 'sleep 2; exit 2'], '<', \$in, '>', \$out, '&',
    [$^X, '-e', 'sleep 1; exit 1']);
printf "%d\n", $? >> 8;

# master : print 2, because _select_loop -> pumpable -> pump_nb reacts in order of actual termination
# patched: same
run([$^X, '-e', 'close STDOUT; sleep 2; exit 2'], '>', \$out, '&',
    [$^X, '-e', 'close STDOUT; sleep 1; exit 1']);
printf "%d\n", $? >> 8;