brettwooldridge / NuProcess

Low-overhead, non-blocking I/O, external Process implementation for Java
Apache License 2.0
710 stars 84 forks source link

Interrupts while registering processes with ProcessEpoll can leak processes #126

Closed bturner closed 2 years ago

bturner commented 3 years ago

ProcessEpoll maintains a BlockingQueue of EpollEvent instances. When processes are registered, an instance is pulled off the queue, used to register events for that process, and then put back onto the queue. The queue has 64 EpollEvent instances in it, so In a system that's lightly loaded there's always an instance available. In a heavily loaded Bitbucket Server instance, though, spikes can easily trigger the creation of 100+ processes concurrently.

If the thread is interrupted (e.g. Thread.interrupt()) while it's blocked in take(), a RuntimeException is thrown and caught by LinuxProcess's start or run code, which logs the exception with a message that the process couldn't be started and the process is abandoned. But that's not actually accurate. The process was started, but NuProcess couldn't register it. That means NuProcess never reaps the process and it either hangs (if it needs stdin or blocks after writing too much to stdout or stderr) or terminates and becomes a zombie.

bturner commented 3 years ago
2021-05-25 11:11:15,409 WARN  [threadpool:thread-19] jdoe *D51P0Nx671x3323004x13 1ygxqot 1.1.1.1,2.2.2.2 "GET /rest/api/latest/projects/KEY/repos/slug/branches HTTP/1.1" c.z.n.internal.BasePosixProcess Failed to start process
java.lang.RuntimeException: Interrupted while registering 7224
        at com.zaxxer.nuprocess.linux.ProcessEpoll.registerProcess(ProcessEpoll.java:139)
        at com.zaxxer.nuprocess.linux.ProcessEpoll.<init>(ProcessEpoll.java:74)
        at com.zaxxer.nuprocess.linux.LinuxProcess.run(LinuxProcess.java:102)
        at com.zaxxer.nuprocess.linux.LinProcessFactory.runProcess(LinProcessFactory.java:50)
        at com.zaxxer.nuprocess.NuProcessBuilder.run(NuProcessBuilder.java:273)
        at com.atlassian.bitbucket.internal.process.nu.NuNioProcessHelper.run(NuNioProcessHelper.java:75)
        at com.atlassian.bitbucket.internal.process.NioCommand.call(NioCommand.java:46)
        at com.atlassian.stash.internal.scm.git.command.TransformedGitCommand.call(TransformedGitCommand.java:45)
        at com.atlassian.bitbucket.internal.branch.LatestCommitMetadataProvider.getMetadata(LatestCommitMetadataProvider.java:45)
        at com.atlassian.stash.internal.repository.PluginRefMetadataMapProvider.lambda$null$3(PluginRefMetadataMapProvider.java:99)
        at com.atlassian.stash.internal.concurrent.StateTransferringCallable.call(StateTransferringCallable.java:33)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.InterruptedException: null
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
        at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
        at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:400)
        at com.zaxxer.nuprocess.linux.ProcessEpoll.registerProcess(ProcessEpoll.java:116)
        ... 18 common frames omitted

Note that the line numbers for ProcessEpoll above won't line up with any NuProcess release. We had to switch to running a forked build of NuProcess while waiting for pull request #118 and #120 to be merged. What is ProcessEpoll.java:116 in our fork is line 112 In the canonical NuProcess codebase.

$ ps -ef | grep defunct
atlbitb+  7224 26859  0 May25 ?        00:00:00 [git] <defunct>
bturner commented 3 years ago

I'm happy to put together a pull request for this, but before I do I'm curious how you'd like to see it addressed, @brettwooldridge. I can think of a couple options:

Or maybe there's something else you prefer. I can put together whatever you think is the best option.