JDK12: OpenJ9 Peak State Performance half that of Hotspot

kuttyb commented 5 years ago

It appears OpenJ9 fully warmed up state performance is less than half that of Hotspot.

The test uses Eclipse Collection's MutableIntBag. It creates 100K entries and then profiles the forEach iterator that goes through each entry and adds the entry to an external counter.

The two builds I compared the results against are as follows (on MacOS)

[1] openjdk version "12" 2019-03-19 OpenJDK Runtime Environment AdoptOpenJDK (build 12+33) OpenJDK 64-Bit Server VM AdoptOpenJDK (build 12+33, mixed mode, sharing)

[2] openjdk version "12" 2019-03-19 OpenJDK Runtime Environment AdoptOpenJDK (build 12+33) Eclipse OpenJ9 VM AdoptOpenJDK (build openj9-0.13.0, JRE 12 Mac OS X amd64-64-Bit Compressed References 20190320_32 (JIT enabled, AOT enabled) OpenJ9 - caeb51f87 OMR - 33a33ff2

Repro steps: [1] mvn exec:java -Dexec.mainClass=local.scratch.ESCollections

Tested on Model Name: MacBook Air Model Identifier: MacBookAir6,2 Processor Name: Intel Core i7 Processor Speed: 1.7 GHz Number of Processors: 1 Total Number of Cores: 2 L2 Cache (per Core): 256 KB L3 Cache: 4 MB Memory: 8 GB

kuttyb commented 5 years ago

repro.tar.gz

Source code attached

kuttyb commented 5 years ago

With OpenJ9:

starting test Warmup completed 25.0 Percentile :9401 50.0 Percentile :9522 75.0 Percentile :9942 99.99 Percentile :17121

With Hotspot

starting test Warmup completed 25.0 Percentile :3157 50.0 Percentile :3242 75.0 Percentile :3318 99.99 Percentile :4520

kuttyb commented 5 years ago

FWIW with the below, the numbers appear to catch up :

MAVEN_OPTS="-Xjit:quickProfile" mexec local.scratch.ESCollections starting test Warmup completed 25.0 Percentile :3112 50.0 Percentile :3249 75.0 Percentile :3533 99.99 Percentile :5330

I am not fully aware of what quickProfile does but it would appear its helping us bridge the gap. Perhaps we are not reaching steady state in OpenJ9 fast enough despite the warmup stage.

pshipton commented 5 years ago

@vijaysun-omr @andrewcraik

andrewcraik commented 5 years ago

@kuttyb it is unlikely, based on the source, that this benchmark will reach true steady state throughput. There are some great benchmarking framework in the form of BumbleBench (https://github.com/AdoptOpenJDK/bumblebench) and JMH (https://openjdk.java.net/projects/code-tools/jmh/) which help you to write tests that will test the performance you want to test without beng affected by startup/rampup characteristics of the different JVM implementations. Had you considered using one of these to ensure you are really measuring steady state?

I would also note that your benchmark is using JSR292 in the form of a lambda. OpenJ9 has an open performance investigation under issue #4837 where we are working to fix a few known issues with our optimization of lambdas. I'd like to suggest you monitor #4837 and once those changes are available in a build maybe we can retry the benchmark and possibly consider factoring it into BumbleBench of JMH to make sure the performance you are measuring is what you want to measure.

kuttyb commented 5 years ago

@andrewcraik Thanks for the summary above. I will rewrite the benchmark using the above frameworks.

[1] The benchmark's intention is to measure the performance of an internal iterator which appears to be a fairly common paradigm these days. Is that what is triggering invoke dynamic?

For internal iterators, I would expect the type to be specialized and inlined at the call site once the code is hot enough. Is that not supported currently.

[2] Using Xjit:quickProfile gets me to ~Hotspot numbers. Is this helping me reach steady state artificially? I picked up this arg from one of @mstoodle 's performance analysis reports.

andrewcraik commented 5 years ago

@kuttyb the invokedynamic will be generated when there is a lambda (anonymous function using the -> syntax). The -Xjit:quickProfile is forcing the compiler to do some aggressive things to get to steady state, but this 'steady-state' may not be as optimal as if the compiler reached this state naturally. OpenJ9 tends to take a stepped approach to optimization. We compile a lot of stuff quickly early on to give great startup/rampup and then we spend a bit more time analyzing and compiling the stuff that is running a lot to get some great performance. In short running benchmarks this second step may not complete. We are working to help improve this with a new profiling technology called JProfiling. If you have a recent build then you could try -Xjit:enableJProfilingInProfilingCompilations which may help without needing the -Xjit:quickProfile (note this is still experimental and not something you should run any production system with, but for performance experiments it is an interesting data point). We are going to enable JProfiling by default for profiling compilations in the next roughly month or so provided final performance testing and bug fixing works out.

mstoodle commented 5 years ago

First off, thanks for engaging here on the issue you encountered, @kuttyb! An issue reported is always better than silent "suffering" that steers people away from using our project!

I think @andrewcraik has you mostly sorted here, or at least on a path to understanding where the steady-state performance is more likely to end up. One thing to keep in mind is that, in the OpenJ9 project, we tend to value code performance on full application scenarios more than micro benchmarks because we have historically been led down many a garden path via micro benchmark inspired "solutions" (some of which continue to survive mostly unused in our code base). Nevertheless, as you've pointed out your test boils down to a common scenario and now that you've hit on it, we should do our due diligence to make sure there isn't a real performance problem at its core.

I would also like to add to @andrecraik's comments about thequickProfile option. I would characterize this option as a diagnostic/testing option we use to put more pressure on our recompilation mechanisms that promote methods to the "scorching" (or sometimes very hot or even hot) compilation level. The usual way to "force" the JIT to compile methods at a particular opt level actually doesn't compile a scorching method in the same way it would using default compilation heuristics (because it prevents "JIT profiling" from happening so there's no new profile data to drive the high opt compile). With quickProfile, even for shorter running tests we get closer to exercising both the compiler and the compiled code in realistic scenarios (thereby increasing our chances to find bugs that users would be more likely to experience, hopefully before they experience them :) ). It's certainly not designed to be equally capable, though, so it still has the "may not be as optimal" aspect Andrew mentioned. From a testing perspective, it's better at what it's designed for than the other alternatives :) .

High level point is that quickProfile isn't something we recommend people to use in production or even outside of diagnostic/testing scenarios, but I'm glad it provided some insight into what was going on in your case!

kuttyb commented 5 years ago

Thanks everybody for all the responses. I am in the process of rewriting my benchmark using BumbleBench and jmh. The former crashes with hotspot and the latter complains OpenJ9 is not a supported VM, compiler hints are disabled and that the results maybe completely unreliable.

Is there a good Cross-VM way of writing these microbenchmarks?

kuttyb commented 5 years ago

Here is output from JMH. For Hotspot the numbers are fairly close to my own microbenchmark up to the 99th percentile . OpenJ9 numbers seemed to have gotten way worse with JMH.

Given both VMs have similar numbers when I use quickProfile with OpenJ9, I would imagine there is a profile and reach steady state problem with OpenJ9. Thoughts?

Hotspot Percentiles, us/op: p(0.0000) = 2879.488 us/op p(50.0000) = 3145.728 us/op p(90.0000) = 3534.848 us/op p(95.0000) = 3776.512 us/op p(99.0000) = 4317.184 us/op p(99.9000) = 5570.560 us/op p(99.9900) = 8164.049 us/op p(99.9990) = 10862.592 us/op p(99.9999) = 10862.592 us/op p(100.0000) = 10862.592 us/op

OpenJ9 Percentiles, us/op: p(0.0000) = 15564.800 us/op p(50.0000) = 16236.544 us/op p(90.0000) = 17563.648 us/op p(95.0000) = 18481.152 us/op p(99.0000) = 22613.852 us/op p(99.9000) = 35271.999 us/op p(99.9900) = 48578.219 us/op p(99.9990) = 49545.216 us/op p(99.9999) = 49545.216 us/op p(100.0000) = 49545.216 us/op

kuttyb commented 5 years ago

Here are results with -Xjit:enableJProfilingInProfilingCompilations. Almost 2x better.

Short summary: With JMH OpenJ9 in this case is 1/3~1/4th Hotspot performance.

Percentiles, us/op: p(0.0000) = 8847.360 us/op p(50.0000) = 8978.432 us/op p(90.0000) = 9879.552 us/op p(95.0000) = 10305.536 us/op p(99.0000) = 11564.810 us/op p(99.9000) = 15253.504 us/op p(99.9900) = 19377.711 us/op p(99.9990) = 28442.624 us/op p(99.9999) = 28442.624 us/op p(100.0000) = 28442.624 us/op

mstoodle commented 5 years ago

@kuttyb would you be wiling to share the benchmark code that uses JMH?

I tried out your original code but am seeing much lower numbers than you (even with Hotspot) despite using a more powerful Macbook than the one you listed. Mind you, I was just trying it out so there's a bazillion other things alive on my laptop atm so that could be affecting my results :) .

kuttyb commented 5 years ago

int.tar.gz jmh.tar.gz

@mstoodle Attached jmh.tar.gz which contains the benchmark and int.tar.gz that contains the code that it is profiling (ESProfile.java)

kuttyb commented 5 years ago

Seeing a few other 2x performance differences with Hotspot. Any updates here?

DanHeidinga commented 5 years ago

ping @andrewcraik @mstoodle - any update on this?

andrewcraik commented 5 years ago

So a few comments: a) I am interestedin looking at this but I haven't yet had time - my team and I are busy with other tasks right now b) As stated previously it is likely there will be some change in the performance once the fixes for #4837 are delivered - this delivery is in progress. @liqunl and @cathyzhyi are doing the work. They may try out a few prototypes to see if they affect this benchmark. c) While the gaps on these microbenchmarks are certainly something we need to address I would note that general experience with microbenches says that while they can indicate opportunities for improvement, they do not necessarily predict the mangitude of the performance gain to be achieved in real applications by addressing the opportunity. This makes it a good tool for finding opportunities but not always a great predictor of overall application performance.

Knowing if the performance is significantly different with Java 8 would be an interesting data point since it would make much less use of MethodHandles and VarHandles which are what we are addressing with #4837.

So no progress yet, but I hope some of the other stuff we are delivering will help and we'll try to get to looking at this when we can.

eclipse-openj9 / openj9

JDK12: OpenJ9 Peak State Performance half that of Hotspot #5452

MAVEN_OPTS="-Xjit:quickProfile" mexec local.scratch.ESCollections starting test Warmup completed 25.0 Percentile :3112 50.0 Percentile :3249 75.0 Percentile :3533 99.99 Percentile :5330