Optimize epoll_wait loop to avoid memory churn.

bturner commented 6 years ago

Disabled autoRead and/or autoWrite on EpollEvent structs depending on how they're used, in ProcessEpoll
- EpollEvents in the eventPool have autoRead disabled because they are used a write-only context (Calling epoll_ctl)
- The triggeredEvent EpollEvent has both autoRead/autoWrite disabled
- When epoll_wait returns 1, read() is called explicitly, but if it returns 0 native memory is not read
- Prior to calling epoll_ctl to register for further stdin events, write() is called explicitly
Fixed memory alignment on EpollEvent and added a unit test to verify the size of the struct is 12 bytes
Removed EpollEvent.EpollData.ptr because, even though it's never used, JNA still allocates it
Replaced Integers with ints using Integer.MIN_VALUE as a marker, when acquiring descriptors, to avoid allocating 3 per process() loop
- If the process soft-exits, Integers will still be created then

Extra bits:

Added IntelliJ IDEA files to .gitignore
Updated JNA to 4.5.1

bturner commented 6 years ago

We're investigating using NuProcess in Bitbucket Server, so I put together a simple jmh benchmark to try and get a feel for its performance when pumping a large amount of data from a process (serving Git hosting operations can require pumping an enormous amount of data). My benchmark was very simple:

public class NuProcessBenchmark {

    public static void main(String[] args) throws Exception {
        System.setProperty("com.zaxxer.nuprocess.threads", "cores");

        Options options = new OptionsBuilder()
                .addProfiler(GCProfiler.class)
                //.forks(0) //Set when running with JProfiler
                .forks(2)
                .jvmArgsAppend("-Dcom.zaxxer.nuprocess.threads=cores")
                .include(NuProcessBenchmark.class.getSimpleName())
                //.measurementIterations(20) // Set when running with JProfiler
                .measurementIterations(50)
                //.threads(5) // Set when running with JProfiler
                .threads(10)
                .timeUnit(TimeUnit.SECONDS)
                .warmupIterations(1)
                .build();

        new Runner(options).run();
    }

    @Benchmark
    @Test
    public void cat() throws Exception {
        NuProcessBuilder builder = new NuProcessBuilder("cat", "/tmp/random.dat");
        builder.setProcessListener(new NuAbstractProcessHandler() {

            private final CRC32 crc32 = new CRC32();

            @Override
            public void onExit(int statusCode) {
                assertEquals(3407529827L, crc32.getValue());
            }

            @Override
            public void onStdout(ByteBuffer buffer, boolean closed) {
                crc32.update(buffer);
            }
        });

        NuProcess process = builder.start();
        assertEquals(0, process.waitFor(1L, TimeUnit.DAYS));
    }
}

(/tmp/random.dat is a 250MB file created using using head -c 250M /dev/urandom >/tmp/random.dat)

When running the benchmark, I noticed that the GC churn appeared to be extremely high, triggering heavy GC. (jmh showed ~34MB/s churn, compared to <200KB/s churn from a basically identical benchmark using our existing processutils library)

Running JProfiler (my benchmark above has some comments with the values I used when profiling), analyzing allocation hotspots showed ProcessEpoll.process, with a graph like this:

JNA is the primary culprit, triggering well over 500MB in allocations, as well as allocating a fair number of Integers.

With the changes above, profiling the same benchmark now shows:

Notice that:

The number of Integer allocations has been massively reduced
The number of JNA allocations has been cut approximately in half, except for
Pointer allocations have been entirely removed

This reduces the reported churn rate from 34MB/s to 14MB/s, which results in significantly less GC runs.

The churn rate is extremely important to Bitbucket Server because we run an enormous number of Git processes and, on larger servers, can easily have over 100 git processes running in parallel. Not all of the processes will be pumping hundreds of megabytes of data, but such processes are anything but rare (and 250MB is far from the largest; repositories that are tens of gigabytes, while not common, are not exactly rare, either).

(Edit: For reference, the benchmark was run on a Linux Mint 18.3 x86_64 VM, with 8 VCPUs (AMD Threadripper 1950X) and 16GB of RAM, using Oracle Java 8u151 and Maven 3.3.9.)

bturner commented 6 years ago

I'm not sure about the Travis failure. That doesn't look like anything I broke.

Feel free to ignore my .gitignore and pom.xml changes. I tried not to adjust whitespace or other formatting, to keep my changes down and make them easier to review.

bturner commented 6 years ago

(Also, I guess I should note that this will fix #79)

lfbayer commented 6 years ago

I love these changes, Thanks!

It looks like travis-ci no longer supports "oraclejdk7". Would you mind changing the .travis.yml to point at "openjdk7" instead so that we can get the build green?

Everything looks like it's fine on my local box here.

bturner commented 6 years ago

@lfbayer Thanks for the speedy review! Happy to tweak the Travis config

bturner commented 6 years ago

Looks like CatTest.changeCwd is failing in Travis, but I'm not sure why. Platform, maybe? I tried Oracle Java 7u80 and 8u151 and OpenJDK 8u151 and all tests passed on all three.

lfbayer commented 6 years ago

It definitely has nothing to do with any of your changes.

I think error: 1 would indicate EPERM, meaning there is some permission error.

The internets indicate that travis-ci does not give CAP_SYS_ADMIN, which would mean that this call would fail. But it isn't clear to me how the build ever succeeded if that is the case. I'm guessing now that it must be something that changed in travis-ci. But it's strange to me that we even require additional permissions. Unfortunately I'm not really familiar with unshare or CLONE_FS, maybe we're actually using it wrong.

It looks like @brettwooldridge is committing some changes that he was sitting on. I've chatted briefly about this issue with him and there are a couple things we will investigate.

brettwooldridge commented 6 years ago

If you rebase onto master your travis-ci build should work now (you're likely to have a conflict in .travis.yml).

brettwooldridge commented 6 years ago

@bturner Finally, can you create a "benchmarks" directory and include your jmh benchmark?

bturner commented 6 years ago

@brettwooldridge Sure, happy to add that in. Would you like it in a Maven profile to keep it out of the main build?

brettwooldridge commented 6 years ago

@bturner If you have no further changes, I am prepared to merge this.

bturner commented 6 years ago

Sorry, had to step away to help put my son to bed. I'm just finishing up adding the benchmark. I can open a separate pull request for it if you prefer.

brettwooldridge commented 6 years ago

@bturner No it's fine, no hurry, just wait until the pull request is ready.

bturner commented 6 years ago

Sorry for the delay, @brettwooldridge and @lfbayer . I got hung up on some details that I couldn't get quite how I wanted them.

For my benchmark, it requires Java 8 because it uses CRC32.update(ByteBuffer), which is new in 8. Other approaches skew the benchmark because they require allocating byte[]s to copy everything from the ByteBuffer.

MessageDigest has had update(ByteBuffer) since Java 5, but all of the digest algorithms are quite slow compared to a simple CRC32, and skew the benchmark as a result. I don't need a cryptographicaly-strong hash; I'm just using the CRC32 a) as something to "do" with the output and b) to perform a low-overhead check that the output was pumped "correctly"

brettwooldridge commented 6 years ago

It’s fine if the benchmark requires Java8.

brettwooldridge commented 6 years ago

@bturner Checkout these memory results after my latest refactor:

Iteration   4: 21.312 ops/s
                 ·gc.alloc.rate:      0.656 MB/sec
                 ·gc.alloc.rate.norm: 52814.261 B/op
                 ·gc.count:           ≈ 0 counts
                 ·threads.alive:      11.000 threads
                 ·threads.daemon:     10.000 threads
                 ·threads.started:    11.000 threads
...
Secondary result "com.zaxxer.nuprocess.benchmark.NuProcessBenchmark.cat:·gc.alloc.rate":
  0.610 ±(99.9%) 0.127 MB/sec [Average]
  (min, avg, max) = (0.411, 0.610, 0.834), stdev = 0.124
  CI (99.9%): [0.484, 0.737] (assumes normal distribution)

Secondary result "com.zaxxer.nuprocess.benchmark.NuProcessBenchmark.cat:·gc.alloc.rate.norm":
  54039.498 ±(99.9%) 2071.499 B/op [Average]
  (min, avg, max) = (52798.857, 54039.498, 59384.800), stdev = 2034.489
  CI (99.9%): [51967.999, 56110.998] (assumes normal distribution)

Compared to previous (on my Linux VM):

Iteration   4: 14.460 ops/s
                 ·gc.alloc.rate:      6.760 MB/sec
                 ·gc.alloc.rate.norm: 693246.857 B/op
                 ·gc.count:           ≈ 0 counts
                 ·threads.alive:      11.000 threads
                 ·threads.daemon:     10.000 threads
                 ·threads.started:    11.000 threads
...
Secondary result "com.zaxxer.nuprocess.benchmark.NuProcessBenchmark.cat:·gc.alloc.rate":
  7.175 ±(99.9%) 1.383 MB/sec [Average]
  (min, avg, max) = (3.776, 7.175, 9.336), stdev = 1.358
  CI (99.9%): [5.793, 8.558] (assumes normal distribution)

Secondary result "com.zaxxer.nuprocess.benchmark.NuProcessBenchmark.cat:·gc.alloc.rate.norm":
  694515.796 ±(99.9%) 2264.247 B/op [Average]
  (min, avg, max) = (693200.000, 694515.796, 699813.600), stdev = 2223.793
  CI (99.9%): [692251.549, 696780.043] (assumes normal distribution)

I profiled a jmh run with jProfiler ... ./benchmark.sh profile ... and as far as I can tell, 99% of the allocation was related to the benchmark itself. NuProcess itself, at least on Linux where these changes reside, should be running very close to zero-allocation; with the exception of process startup where things like the environment are cloned, and the 3 I/O buffers created.

@bturner I'd be interested to hear how this affects your throughput numbers.

brettwooldridge commented 6 years ago

@bturner Also fixed up the benchmark to self-calculate the CRC32.

brettwooldridge commented 6 years ago

@bturner The moral of the story seems to be that JNA is extremely inefficient by default, but can be efficient if used carefully.

Structures are extremely inefficient, with every field access throwing off Iterator instances.

And (possibly biggest impact) passing a ByteBuffer that was allocated by ByteBuffer.allocateDirect(), compared to new Memory(xxx).getByteBuffer(), across the JNA interface causes allocations of Memory$SharedMemory objects on every trip.

bturner commented 6 years ago

That's an epic improvement, @brettwooldridge! Sensational. My benchmark run now shows:

# Run complete. Total time: 00:03:51

Benchmark                                                 Mode  Cnt       Score        Error    Units
NuProcessBenchmark.cat                                   thrpt  100      27.591 ±      0.678    ops/s
NuProcessBenchmark.cat:·gc.alloc.rate                    thrpt  100       4.168 ±      0.098   MB/sec
NuProcessBenchmark.cat:·gc.alloc.rate.norm               thrpt  100  231161.994 ±    387.435     B/op
NuProcessBenchmark.cat:·gc.churn.PS_Eden_Space           thrpt  100       4.503 ±      3.528   MB/sec
NuProcessBenchmark.cat:·gc.churn.PS_Eden_Space.norm      thrpt  100  254029.703 ± 198985.617     B/op
NuProcessBenchmark.cat:·gc.churn.PS_Survivor_Space       thrpt  100       0.025 ±      0.048   MB/sec
NuProcessBenchmark.cat:·gc.churn.PS_Survivor_Space.norm  thrpt  100    1382.581 ±   2662.888     B/op
NuProcessBenchmark.cat:·gc.count                         thrpt  100      16.000                counts
NuProcessBenchmark.cat:·gc.time                          thrpt  100     194.000                    ms
NuProcessBenchmark.cat:·threads.alive                    thrpt  100      23.510 ±      0.170  threads
NuProcessBenchmark.cat:·threads.daemon                   thrpt  100      22.510 ±      0.170  threads
NuProcessBenchmark.cat:·threads.started                  thrpt  100      24.000               threads

(Note that while the name is the same this isn't quite the same benchmark as the one in the repository. For consistency I need to keep using my existing benchmark, otherwise this result wouldn't really be a relevant comparison for me.)

The throughput shows a marginal improvement, about 2 ops/s but big enough to be statistically relevant, but the real win is definitely in that new allocation rate.

As a potentially relevant aside, this is the result of a similar (identical within my ability to make it so) benchmark run against jnr-process:

# Run complete. Total time: 00:03:39

Benchmark                                                  Mode  Cnt       Score        Error    Units
JnrProcessBenchmark.cat                                   thrpt  100      28.234 ±      0.222    ops/s
JnrProcessBenchmark.cat:·gc.alloc.rate                    thrpt  100       4.388 ±      0.046   MB/sec
JnrProcessBenchmark.cat:·gc.alloc.rate.norm               thrpt  100  229441.708 ±    515.061     B/op
JnrProcessBenchmark.cat:·gc.churn.PS_Eden_Space           thrpt  100       4.554 ±      3.559   MB/sec
JnrProcessBenchmark.cat:·gc.churn.PS_Eden_Space.norm      thrpt  100  239207.232 ± 187075.828     B/op
JnrProcessBenchmark.cat:·gc.churn.PS_Survivor_Space       thrpt  100       0.072 ±      0.109   MB/sec
JnrProcessBenchmark.cat:·gc.churn.PS_Survivor_Space.norm  thrpt  100    3699.963 ±   5587.508     B/op
JnrProcessBenchmark.cat:·gc.count                         thrpt  100      16.000                counts
JnrProcessBenchmark.cat:·gc.time                          thrpt  100     103.000                    ms
JnrProcessBenchmark.cat:·threads.alive                    thrpt  100      16.980 ±      0.048  threads
JnrProcessBenchmark.cat:·threads.daemon                   thrpt  100      15.980 ±      0.048  threads
JnrProcessBenchmark.cat:·threads.started                  thrpt  100      17.000               threads

The allocation rates between the two are really close now, suggesting a transition over to JNR might not really have much "win" to it, from that perspective, now that NuProcess's JNA approach has been so aggressively tuned.

brettwooldridge commented 6 years ago

@bturner Thanks for the feedback! Very encouraging.

As I think about the allocation rate, I thought about pooling. Currently, when a process is started and a new instance of BasePosixProcess (LinuxProcess) is created, I allocate the three buffers for stdin/stdout/stderr. These are 64K each. When the process ends, these become garbage to be collected. At a rate of 20+ processes per second, this results in a several MB/sec allocation rate.

NuProcess could likely benefit from a pooling scheme for these buffers, whereby when the process ends, the buffers are returned to a pool, to be picked up by the next process run.

A pool could be associated with each processing thread, which would have two benefits. First, because the pool would only be accessed by a single thread, it could be non-concurrent/lock-free. Second, processing threads have a finite lifetime, so if no processes are run for 2.5s (default) the thread terminates and the pool would go with it, allowing memory to be given back during idle periods.

During busy periods, this would result in a near zero allocation rate, except for a handful of minor objects; the buffers clearly being the bulk of the memory overhead of a process.

brettwooldridge commented 6 years ago

I burned some midnight oil and implemented pooling of Memory/ByteBuffer objects. Much to my astonishment, the benchmark showed almost no difference at all in terms of allocation rate.

Either JNA or the JVM itself is playing a game behind the scenes to reuse the objects in the non-pooling case.

I’d like to measure one of the non-jmh tests with some of the heap analysis tools to eliminate jmh ‘s own allocation from the picture.

brettwooldridge / NuProcess

Optimize epoll_wait loop to avoid memory churn. #80