Pump on calling thread - Githubissues

bturner commented 6 years ago

The way NuProcess reuses a fixed set of pump threads for handling process output is a great feature, and helps control thread bloat. It's one of the reasons we want to migrate Bitbucket Server over to it. However, as I make progress switching from Java's ProcessBuilder to NuProcess, I'm finding it's not always the pumping approach that we want.

When processing requests in Bitbucket Server, the most common scenario is that the system starts a git process and then blocks in waitFor, on that thread, to let it complete. During that time, that calling thread is doing nothing while one of the fixed pool of pump threads interleaves pumping for the started process, and most likely other processes, until the process completes. At that point, waitFor returns and I can let the calling thread continue.

What I'd like to propose is an alternate method, in addition to the current NuProcessBuilder.start that builds a NuProcess and, instead of calling registerProcess to assign it to one of the shared pump threads, (essentially) creates a dedicated ProcessEpoll (or equivalent) and runs the processing loop directly on the calling thread, only returning when the process completes (or some timeout expires).

I'm willing to build this myself. I'm not asking for someone else to dedicate their free time to enabling my use case. But if such a change has no chance of getting accepted, well then there's not much point.

Some context on why I feel this is important

Bitbucket Server has essentially three "classes" of processes:

Hosting processes (git receive-pack, git upload-pack, git http-backend)
Command processes (all your "simple", generally short-lived processes like git diff, git log, git rev-list)
Custom processes, started by third-party plugins which are installed into Bitbucket Server
- We provide a public API and SPI, exported via OSGi, which third parties are able to use to build their own custom integrations. Those integrations are installed in process with the rest of Bitbucket Server
- One of the things such integrations can do is use our API to spawn their own git (and other) processes, with whatever command line they need
- We're not able to limit or control the number of such processes third party integrations start

The numbers of hosting and command processes Bitbucket Server allows to run concurrently are configured separately, defaulting to 1.5*CPUs for hosting (e.g. 6 processes on a 4 core server) and 25 (a flat number) for commands. (Again, we can't actually control the number of processes started by third-party code.) With our current ProcessBuilder-based approach, every hosting process is guaranteed to use four threads: An HTTP or SSH request thread, waiting while the process runs, and three additional threads for pumping stdin, stdout and stderr. Command (and third party) processes use 2, 3 or 4 threads--many don't need a stdin threads, since they provide all their arguments on the command line, and some are run "asynchronously" so they don't block the requesting thread.

For hosting processes, the system implements "adaptive throttling", which considers the available RAM on the server during startup and then monitors the CPU load over time to increase or decrease the number of allowed processes. The default lower bound is 1CPU, and the default upper bound is 4CPU. (4-16 processes for a 4 core server, but potentially much higher for larger servers.)

Hosting and command processes have extremely different load profiles and durations. Hosting processes can run for hours, moving gigabytes of data across (comparatively slow) socket connections (which is done using blocking I/O, for the moment; more on this below). Command processes, on the other hand, move relatively small amounts of data, generally ranging between a few KB and a few MB, and almost always have a hard 2 minute timeout applied--either the process completes within 2 minutes or it's interrupted/killed.

Hosting processes effectively distill down to starting the right git process and then copying bytes back and forth between the socket (HTTP or SSH) and the executing process. (It's often more nuanced than that, especially for SSH hosting, but at a basic level that's how it works.) Reading from and writing to sockets is handled via blocking I/O. That means the onStdout, onStderr and onStdinReady callbacks, running on NuProcess's finite processing threads, are almost certain to make blocking socket calls. In general, with 4-8K buffers in place, those blocking calls are likely to be reasonably "fast", but there's certainly no guarantee of that. (When a developer in your India office clones over their slow WAN link to your Bitbucket Data Center instance hosted in US East on AWS, for example.) That means it's very likely hosting processes will "dominate" NuProcess's pump threads.

Additionally, NuProcess's internal handling for ready processes, at least on Linux, is not "fair". A single ready descriptor can be returned from epoll_wait for each processing loop, and there's no guarantee that ready descriptors are returned in any given order. That means if 10 processes are ready, over 10 iterations of the loop some of those processes may be returned multiple times while others are never returned at all. For Bitbucket Server, one possible manifestation of this would be fast-executing (and highly user-visible, since they're almost always part of rendering a web page) command processes trying to move a few KB getting held up behind hosting processes moving hundreds of MB--and that's even more likely if hosting processes are delaying pump threads by making slow-running, blocking socket calls.

By having control over whether a given process is assigned to the shared pool of pump threads or directly pumped on the thread that's starting the process, most, if not all, of these considerations can be addressed, and we can still achieve a substantial reduction in thread count:

Command processes can all be pumped by the shared pool
- For the default 25 processes, this could reduce 50-100 threads down to 30-40, a 40-60% reduction
Hosting processes can be pumped individually, directly on their request threads
- This reduces threads from 4 per hosting process to 1, a 75% reduction
- This also sidesteps most (but not all) of the blocking I/O issues, in that slow clients only impact themselves--pumping for other hosting processes, and command processes, is never held up

(Long term, especially for hosting operations, we need to switch to NIO sockets, and that's another enhancement on our backlog. Part of what I'm trying to avoid here is coupling those two changes together, as doing so reduces the odds of either improvement ever shipping.)

((I want to add that I'm sensitive to the fact that this is a general use process library, not Bitbucket Server's dedicated process library. It's not my goal to bloat it with additional functionality that doesn't serve common use cases. However, in this instance, I have to believe it's not uncommon to have code that can't move forward until an external process has completed.))

brettwooldridge commented 6 years ago

@bturner In principle I have no objection to the proposal of an alternate path that allows the user thread to act as the service thread. Of course, acceptableness of the solution comes down to implementation.

As a side note, NuProcess's use of EPOLLONESHOT with a poll of one event is, according to the internets, supposed to improve fairness (at some cost of throughput) compared to standard epoll usage.

bturner commented 6 years ago

As a side note, NuProcess's use of EPOLLONESHOT with a poll of one event is, according to the internets, supposed to improve fairness (at some cost of throughput) compared to standard epoll usage.

I hadn't considered the one-shot behavior. Thanks for pointing it out. That said, though, it seems like NuProcess only applies that for stdin, not stdout or stderr. Does that mean output-heavy processes (like the server-side git upload-pack or git http-backend for a clone or fetch, which require minimal stdin beyond the initial ref advertisement and mostly just stream pack data to stdout) are not disarmed/rearmed to improve "fairness"?

Of course, acceptableness of the solution comes down to implementation.

I'll take a stab at it, and see if I can work up an implementation that makes the grade. Thanks for your consideration!

brettwooldridge commented 6 years ago

It’s a good point, I forgot that oneshot is only used for stdin. I think I have a branch somewhere that uses it throughout.

Having said that, I would expect the Linux scheduler to result in roughly interleaved output, but you can measure it empirically to see.

bturner commented 3 years ago

This has been released as of NuProcess 2.0.0

brettwooldridge / NuProcess

Pump on calling thread #95

Some context on why I feel this is important