brettwooldridge / NuProcess

Low-overhead, non-blocking I/O, external Process implementation for Java
Apache License 2.0
710 stars 84 forks source link

Prevent onStart/onExit race for fast-exiting processes. #158

Closed bturner closed 2 months ago

bturner commented 3 months ago

Immediately after a LinuxProcess is registered with ProcessEpoll, it's possible for soft exit detection to detect process termination. If a process exits quickly, this means it's possible for NuProcessHandler.onExit to be called before onStart. When that happens, BasePosixProcess.callStart throws a NullPointerException because exit cleanup clears the process handler.

The top of such a call stack looks something like this:

May 01, 2024 1:38:24 AM com.zaxxer.nuprocess.internal.BasePosixProcess callStart
WARNING: Exception thrown from handler
java.lang.NullPointerException
    at com.zaxxer.nuprocess.internal.BasePosixProcess.callStart(BasePosixProcess.java:587)
    at com.zaxxer.nuprocess.linux.LinuxProcess.start(LinuxProcess.java:80)
    at com.zaxxer.nuprocess.linux.LinProcessFactory.createProcess(LinProcessFactory.java:40)
    at com.zaxxer.nuprocess.NuProcessBuilder.start(NuProcessBuilder.java:260)

This change splits the existing IEventProcessor.registerProcess method into two: registerProcess, and queueRead (a parallel to queueWrite). registerProcess sets up mappings (pidToProcessMap, fildesToProcessMap), and then queueRead does the epoll (or kevent, on macOS) operations.

This split ensures the process is registered before onStart is called, which is necessary to allow process handlers to call wantWrite during onStart, while also ensuring stdout/stderr aren't consumed until after onStart is called (which, in turn, prevents early soft-exit detection).

bturner commented 3 months ago

@wkritzinger-atlassian, if you have a moment, could you test this with Bitbucket?

I've verified it locally on Linux (x64 only) and macOS (x64 and aarch64, both on native hardware) and Windows (x64 only, not that there should be any risk of a regression on Windows here), so I know all the basics work, but I'd value the more thorough check.

wkritzinger-atlassian commented 3 months ago

@bturner Apologies for the delay in my response. I ran all of our internal tests against the version that is in this change and it is all green.

I also had a look at this change and it seems good as well. I'll attempt to reproduce our zombie process problem that we're still having and keep you updated when I find something.

bturner commented 3 months ago

No worries, @wkritzinger-atlassian. I appreciate you taking the time to run through it. I'm going to look at a couple other pull requests and then see if I can work through a 2.1 release. Even if it doesn't fix your zombie process issue (and if I'm honest I suspect it won't), at least it will clean up the NullPointerExceptions on racy starts for some of the faster git processes.

fanf commented 1 week ago

Hello, Do you know when a release with that correction will be cut ? We believe we saw it two times IRL now, but it is extremely hard to test/reproduce. Having a version with the patch would help decide if it's the same problem.