Closed bturner closed 6 years ago
We're investigating using NuProcess in Bitbucket Server, so I put together a simple jmh benchmark to try and get a feel for its performance when pumping a large amount of data from a process (serving Git hosting operations can require pumping an enormous amount of data). My benchmark was very simple:
public class NuProcessBenchmark {
public static void main(String[] args) throws Exception {
System.setProperty("com.zaxxer.nuprocess.threads", "cores");
Options options = new OptionsBuilder()
.addProfiler(GCProfiler.class)
//.forks(0) //Set when running with JProfiler
.forks(2)
.jvmArgsAppend("-Dcom.zaxxer.nuprocess.threads=cores")
.include(NuProcessBenchmark.class.getSimpleName())
//.measurementIterations(20) // Set when running with JProfiler
.measurementIterations(50)
//.threads(5) // Set when running with JProfiler
.threads(10)
.timeUnit(TimeUnit.SECONDS)
.warmupIterations(1)
.build();
new Runner(options).run();
}
@Benchmark
@Test
public void cat() throws Exception {
NuProcessBuilder builder = new NuProcessBuilder("cat", "/tmp/random.dat");
builder.setProcessListener(new NuAbstractProcessHandler() {
private final CRC32 crc32 = new CRC32();
@Override
public void onExit(int statusCode) {
assertEquals(3407529827L, crc32.getValue());
}
@Override
public void onStdout(ByteBuffer buffer, boolean closed) {
crc32.update(buffer);
}
});
NuProcess process = builder.start();
assertEquals(0, process.waitFor(1L, TimeUnit.DAYS));
}
}
(/tmp/random.dat
is a 250MB file created using using head -c 250M /dev/urandom >/tmp/random.dat
)
When running the benchmark, I noticed that the GC churn appeared to be extremely high, triggering heavy GC. (jmh showed ~34MB/s churn, compared to <200KB/s churn from a basically identical benchmark using our existing processutils library)
Running JProfiler (my benchmark above has some comments with the values I used when profiling), analyzing allocation hotspots showed ProcessEpoll.process
, with a graph like this:
JNA is the primary culprit, triggering well over 500MB in allocations, as well as allocating a fair number of Integer
s.
With the changes above, profiling the same benchmark now shows:
Notice that:
Integer
allocations has been massively reducedPointer
allocations have been entirely removedThis reduces the reported churn rate from 34MB/s to 14MB/s, which results in significantly less GC runs.
The churn rate is extremely important to Bitbucket Server because we run an enormous number of Git processes and, on larger servers, can easily have over 100 git
processes running in parallel. Not all of the processes will be pumping hundreds of megabytes of data, but such processes are anything but rare (and 250MB is far from the largest; repositories that are tens of gigabytes, while not common, are not exactly rare, either).
(Edit: For reference, the benchmark was run on a Linux Mint 18.3 x86_64 VM, with 8 VCPUs (AMD Threadripper 1950X) and 16GB of RAM, using Oracle Java 8u151 and Maven 3.3.9.)
I'm not sure about the Travis failure. That doesn't look like anything I broke.
Feel free to ignore my .gitignore
and pom.xml
changes. I tried not to adjust whitespace or other formatting, to keep my changes down and make them easier to review.
(Also, I guess I should note that this will fix #79)
I love these changes, Thanks!
It looks like travis-ci no longer supports "oraclejdk7". Would you mind changing the .travis.yml to point at "openjdk7" instead so that we can get the build green?
Everything looks like it's fine on my local box here.
@lfbayer Thanks for the speedy review! Happy to tweak the Travis config
Looks like CatTest.changeCwd
is failing in Travis, but I'm not sure why. Platform, maybe? I tried Oracle Java 7u80 and 8u151 and OpenJDK 8u151 and all tests passed on all three.
It definitely has nothing to do with any of your changes.
I think error: 1
would indicate EPERM, meaning there is some permission error.
The internets indicate that travis-ci does not give CAP_SYS_ADMIN, which would mean that this call would fail. But it isn't clear to me how the build ever succeeded if that is the case. I'm guessing now that it must be something that changed in travis-ci. But it's strange to me that we even require additional permissions. Unfortunately I'm not really familiar with unshare or CLONE_FS, maybe we're actually using it wrong.
It looks like @brettwooldridge is committing some changes that he was sitting on. I've chatted briefly about this issue with him and there are a couple things we will investigate.
If you rebase onto master your travis-ci build should work now (you're likely to have a conflict in .travis.yml).
@bturner Finally, can you create a "benchmarks" directory and include your jmh benchmark?
@brettwooldridge Sure, happy to add that in. Would you like it in a Maven profile to keep it out of the main build?
@bturner If you have no further changes, I am prepared to merge this.
Sorry, had to step away to help put my son to bed. I'm just finishing up adding the benchmark. I can open a separate pull request for it if you prefer.
@bturner No it's fine, no hurry, just wait until the pull request is ready.
Sorry for the delay, @brettwooldridge and @lfbayer . I got hung up on some details that I couldn't get quite how I wanted them.
For my benchmark, it requires Java 8 because it uses CRC32.update(ByteBuffer)
, which is new in 8. Other approaches skew the benchmark because they require allocating byte[]
s to copy everything from the ByteBuffer
.
MessageDigest
has had update(ByteBuffer)
since Java 5, but all of the digest algorithms are quite slow compared to a simple CRC32, and skew the benchmark as a result. I don't need a cryptographicaly-strong hash; I'm just using the CRC32 a) as something to "do" with the output and b) to perform a low-overhead check that the output was pumped "correctly"
It’s fine if the benchmark requires Java8.
@bturner Checkout these memory results after my latest refactor:
Iteration 4: 21.312 ops/s
·gc.alloc.rate: 0.656 MB/sec
·gc.alloc.rate.norm: 52814.261 B/op
·gc.count: ≈ 0 counts
·threads.alive: 11.000 threads
·threads.daemon: 10.000 threads
·threads.started: 11.000 threads
...
Secondary result "com.zaxxer.nuprocess.benchmark.NuProcessBenchmark.cat:·gc.alloc.rate":
0.610 ±(99.9%) 0.127 MB/sec [Average]
(min, avg, max) = (0.411, 0.610, 0.834), stdev = 0.124
CI (99.9%): [0.484, 0.737] (assumes normal distribution)
Secondary result "com.zaxxer.nuprocess.benchmark.NuProcessBenchmark.cat:·gc.alloc.rate.norm":
54039.498 ±(99.9%) 2071.499 B/op [Average]
(min, avg, max) = (52798.857, 54039.498, 59384.800), stdev = 2034.489
CI (99.9%): [51967.999, 56110.998] (assumes normal distribution)
Compared to previous (on my Linux VM):
Iteration 4: 14.460 ops/s
·gc.alloc.rate: 6.760 MB/sec
·gc.alloc.rate.norm: 693246.857 B/op
·gc.count: ≈ 0 counts
·threads.alive: 11.000 threads
·threads.daemon: 10.000 threads
·threads.started: 11.000 threads
...
Secondary result "com.zaxxer.nuprocess.benchmark.NuProcessBenchmark.cat:·gc.alloc.rate":
7.175 ±(99.9%) 1.383 MB/sec [Average]
(min, avg, max) = (3.776, 7.175, 9.336), stdev = 1.358
CI (99.9%): [5.793, 8.558] (assumes normal distribution)
Secondary result "com.zaxxer.nuprocess.benchmark.NuProcessBenchmark.cat:·gc.alloc.rate.norm":
694515.796 ±(99.9%) 2264.247 B/op [Average]
(min, avg, max) = (693200.000, 694515.796, 699813.600), stdev = 2223.793
CI (99.9%): [692251.549, 696780.043] (assumes normal distribution)
I profiled a jmh run with jProfiler ... ./benchmark.sh profile
... and as far as I can tell, 99% of the allocation was related to the benchmark itself. NuProcess itself, at least on Linux where these changes reside, should be running very close to zero-allocation; with the exception of process startup where things like the environment are cloned, and the 3 I/O buffers created.
@bturner I'd be interested to hear how this affects your throughput numbers.
@bturner Also fixed up the benchmark to self-calculate the CRC32.
@bturner The moral of the story seems to be that JNA is extremely inefficient by default, but can be efficient if used carefully.
Structures are extremely inefficient, with every field access throwing off Iterator
instances.
And (possibly biggest impact) passing a ByteBuffer
that was allocated by ByteBuffer.allocateDirect()
, compared to new Memory(xxx).getByteBuffer()
, across the JNA interface causes allocations of Memory$SharedMemory
objects on every trip.
That's an epic improvement, @brettwooldridge! Sensational. My benchmark run now shows:
# Run complete. Total time: 00:03:51
Benchmark Mode Cnt Score Error Units
NuProcessBenchmark.cat thrpt 100 27.591 ± 0.678 ops/s
NuProcessBenchmark.cat:·gc.alloc.rate thrpt 100 4.168 ± 0.098 MB/sec
NuProcessBenchmark.cat:·gc.alloc.rate.norm thrpt 100 231161.994 ± 387.435 B/op
NuProcessBenchmark.cat:·gc.churn.PS_Eden_Space thrpt 100 4.503 ± 3.528 MB/sec
NuProcessBenchmark.cat:·gc.churn.PS_Eden_Space.norm thrpt 100 254029.703 ± 198985.617 B/op
NuProcessBenchmark.cat:·gc.churn.PS_Survivor_Space thrpt 100 0.025 ± 0.048 MB/sec
NuProcessBenchmark.cat:·gc.churn.PS_Survivor_Space.norm thrpt 100 1382.581 ± 2662.888 B/op
NuProcessBenchmark.cat:·gc.count thrpt 100 16.000 counts
NuProcessBenchmark.cat:·gc.time thrpt 100 194.000 ms
NuProcessBenchmark.cat:·threads.alive thrpt 100 23.510 ± 0.170 threads
NuProcessBenchmark.cat:·threads.daemon thrpt 100 22.510 ± 0.170 threads
NuProcessBenchmark.cat:·threads.started thrpt 100 24.000 threads
(Note that while the name is the same this isn't quite the same benchmark as the one in the repository. For consistency I need to keep using my existing benchmark, otherwise this result wouldn't really be a relevant comparison for me.)
The throughput shows a marginal improvement, about 2 ops/s but big enough to be statistically relevant, but the real win is definitely in that new allocation rate.
As a potentially relevant aside, this is the result of a similar (identical within my ability to make it so) benchmark run against jnr-process
:
# Run complete. Total time: 00:03:39
Benchmark Mode Cnt Score Error Units
JnrProcessBenchmark.cat thrpt 100 28.234 ± 0.222 ops/s
JnrProcessBenchmark.cat:·gc.alloc.rate thrpt 100 4.388 ± 0.046 MB/sec
JnrProcessBenchmark.cat:·gc.alloc.rate.norm thrpt 100 229441.708 ± 515.061 B/op
JnrProcessBenchmark.cat:·gc.churn.PS_Eden_Space thrpt 100 4.554 ± 3.559 MB/sec
JnrProcessBenchmark.cat:·gc.churn.PS_Eden_Space.norm thrpt 100 239207.232 ± 187075.828 B/op
JnrProcessBenchmark.cat:·gc.churn.PS_Survivor_Space thrpt 100 0.072 ± 0.109 MB/sec
JnrProcessBenchmark.cat:·gc.churn.PS_Survivor_Space.norm thrpt 100 3699.963 ± 5587.508 B/op
JnrProcessBenchmark.cat:·gc.count thrpt 100 16.000 counts
JnrProcessBenchmark.cat:·gc.time thrpt 100 103.000 ms
JnrProcessBenchmark.cat:·threads.alive thrpt 100 16.980 ± 0.048 threads
JnrProcessBenchmark.cat:·threads.daemon thrpt 100 15.980 ± 0.048 threads
JnrProcessBenchmark.cat:·threads.started thrpt 100 17.000 threads
The allocation rates between the two are really close now, suggesting a transition over to JNR might not really have much "win" to it, from that perspective, now that NuProcess's JNA approach has been so aggressively tuned.
@bturner Thanks for the feedback! Very encouraging.
As I think about the allocation rate, I thought about pooling. Currently, when a process is started and a new instance of BasePosixProcess
(LinuxProcess
) is created, I allocate the three buffers for stdin/stdout/stderr. These are 64K each. When the process ends, these become garbage to be collected. At a rate of 20+ processes per second, this results in a several MB/sec allocation rate.
NuProcess could likely benefit from a pooling scheme for these buffers, whereby when the process ends, the buffers are returned to a pool, to be picked up by the next process run.
A pool could be associated with each processing thread, which would have two benefits. First, because the pool would only be accessed by a single thread, it could be non-concurrent/lock-free. Second, processing threads have a finite lifetime, so if no processes are run for 2.5s (default) the thread terminates and the pool would go with it, allowing memory to be given back during idle periods.
During busy periods, this would result in a near zero allocation rate, except for a handful of minor objects; the buffers clearly being the bulk of the memory overhead of a process.
I burned some midnight oil and implemented pooling of Memory/ByteBuffer objects. Much to my astonishment, the benchmark showed almost no difference at all in terms of allocation rate.
Either JNA or the JVM itself is playing a game behind the scenes to reuse the objects in the non-pooling case.
I’d like to measure one of the non-jmh tests with some of the heap analysis tools to eliminate jmh ‘s own allocation from the picture.
autoRead
and/orautoWrite
onEpollEvent
structs depending on how they're used, inProcessEpoll
EpollEvent
s in theeventPool
haveautoRead
disabled because they are used a write-only context (Callingepoll_ctl
)triggeredEvent
EpollEvent
has bothautoRead
/autoWrite
disabledepoll_wait
returns 1,read()
is called explicitly, but if it returns 0 native memory is not readepoll_ctl
to register for further stdin events,write()
is called explicitlyEpollEvent
and added a unit test to verify the size of the struct is 12 bytesEpollEvent.EpollData.ptr
because, even though it's never used, JNA still allocates itInteger
s withint
s usingInteger.MIN_VALUE
as a marker, when acquiring descriptors, to avoid allocating 3 perprocess()
loopInteger
s will still be created thenExtra bits:
.gitignore