eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.24k stars 713 forks source link

Slow application performance when -Xaggressive and/or -Xtune:virtualized JVM options are enabled #14366

Open dblinn opened 2 years ago

dblinn commented 2 years ago

Java -version output

$ java -version
openjdk version "11.0.13" 2021-10-19
IBM Semeru Runtime Open Edition 11.0.13.0 (build 11.0.13+8)
Eclipse OpenJ9 VM 11.0.13.0 (build openj9-0.29.0, JRE 11 Linux amd64-64-Bit Compressed References 20211022_282 (JIT enabled, AOT enabled)
OpenJ9   - e1e72c497
OMR      - 299b6a2d2
JCL      - 2d83aa3b76 based on jdk-11.0.13+8)

Summary of problem

Our application, running on OpenJ9, is experiencing intermittent performance problems. We execute a series of computations on a single thread, that performs no network or disk I/O and acquires no locks that takes about 50 seconds on a particular dataset running on our performance testing hardware. However, we periodically see this same operation on the same dataset take 50-100% longer. Profiling the CPU during this operation shows the only busy threads during this operation other than the main application thread tend to be garbage-collection related (ie the "Concurrent Mark Helper" and various "GC Worker" threads).

Some of the JVM options we pass when starting our application include:

              -verbose:sizes -Xtune:virtualized -XX:+ExitOnOutOfMemoryError
              -Xaggressive -XX:+ClassRelationshipVerifier -XX:+CompactStrings
              -XX:-HeapDumpOnOutOfMemoryError -Xdump:what -Xdump:heap:none  -Xdump:java:file=/STDOUT/
              -XshowSettings

@jonoommen and @amicic recommended removing the -Xaggressive and -Xtune:virtualized options and seeing what impact this had and it seems to always result in the "fast" application performance in my test runs so far. Below are verbose garbage collection logs (-verbose:gc) together with resize tracelogs (-Xtrace:iprint=j9mm{resize}) for a "fast" and "slow" run.

Diagnostic files

Statistics for the "fast" run:

N   min     max     sum     mean    stddev
24  44.8633 59.2198 1236.81 51.5339 4.21769

fast_trace_logs.zip

Statistics for the "slow" run:

N       min     max     sum     mean    stddev
15  62.0912 106.604 1297.42 86.4947 18.0917

slow_trace_logs.zip

To interpret the statistics, N is the number of trials run (ie the number of times we executed the long-running computation in our application), min is the minimum time taken for a trial, max is the maximum time taken, etc.

pshipton commented 2 years ago

@vijaysun-omr @mpirvu @0xdaryl fyi

dblinn commented 2 years ago

Here is some additional data after running some more experiments. The .zip files below each experiment description are the GC verbose logs and JVM trace logs captured for each run.

When I remove the -Xtune:virtualized flag, but leave -Xaggressive, the garbage collector appears well-behaved but I still get the slow performance (although not quite as slow as having both flags together).

N   min max sum mean    stddev
9   59.2    99.3    701.05  77.8944 14.5956
no_tune_virtualized_slow

no_tune_virtualized_slow.zip

When I remove the -Xaggressive flag, but leave -Xtune:virtualized, the garbage collector is poorly-behaved and I also get the slow performance (although not quite as slow as having both flags together).

N   min max sum mean    stddev
9   59.2    99.3    701 77.8889 14.5944
no_aggressive_slow

no_aggressive_slow.zip

When I remove both flags, the garbage collector is well-behaved and I get the "fast" performance.

N   min max sum mean    stddev
10  47.7    74.3    554.85  55.485  8.23043
no_aggressive_or_tune_virtualized_fast

no_aggressive_or_tune_fast.zip

dblinn commented 2 years ago

~Also, while this is somewhat unscientific it may be helpful in diagnosing the issue - in order to get the "slow" performance with any set of flags, I have noticed that I have to let the application idle for about 30+ minutes at startup before loading the data that triggering the expensive computation. From what I have seen, if I trigger the computation sooner after startup, I seem to get the "fast" performance profile.~

Striking through the above comment because I managed to get a run where I loaded data and triggered the computation very shortly after application startup and still got the "slow" performance.

vijaysun-omr commented 2 years ago

-Xtune:virtualized does have differences in what happens wrt JVM's idle state detection/handling. @mpirvu are there options to disable that aspect of -Xtune:virtualized or any other suggestions on how to advance this issue ?

mpirvu commented 2 years ago

-Xtune:virtualized has many changes under the hood, changed geared biased towards small footprint and low CPU consumed by JIT compilation (possibly at the expense of a small throughput loss). From the GC point of view, -Xtune:virtualized implements this change:

    /* if tuned for virtualized environment, we compromise a bit of performance for lower footprint */
    if (getJavaVM()->runtimeFlags & J9_RUNTIME_TUNE_VIRTUALIZED) {
        heapFreeMinimumRatioMultiplier = 20;
    }

and also instructs the GC to trigger a collection when the JVM idles for more than ~4 minutes. This last behavior can be disabled with -XX:-IdleTuningGcOnIdle.

From the JIT point of view there are many changes implemented by -Xtune:virtualized, but the main ones are:

-Xaggressive is geared towards peak throughput. The following changes are implemented under the hood:

There is some interaction between -Xtune:virtualized and -Xaggressive: when -Xaggressive is used, the JIT changes from -Xtune:virtualized will not apply (the GC changes will still take effect).

I would start by adding -XX:-IdleTuningGcOnIdle and see if the performance improves.

dblinn commented 2 years ago

Thanks very much @mpirvu for the details that is very helpful. Given your explanation, I am a bit puzzled why I was still able to see slow performance with only the -Xaggressive option enabled.

I will try adding -XX:-IdleTuningGcOnIdle and see if the performance improves as you suggested.

mpirvu commented 2 years ago

I am a bit puzzled why I was still able to see slow performance with only the -Xaggressive option enabled.

My guess is that the extra profiling enabled by -Xaggressive adds transient overhead which causes the performance to fluctuate. Could you please generate some compilation verbose logs by adding -Xjit:verbose={compilePerformance},vlog=vlog.txt to the JVM command options and run with and without -Xaggressive to compare the two?

dblinn commented 2 years ago

I tried -XX:-IdleTuningGcOnIdle and did not see any performance improvement. I will try collecting the verbose JIT compilation logs next.

dblinn commented 2 years ago

@mpirvu how long would you expect the "transient overhead" from -Xaggressive to last? I had a test run going that ran all night (usually we do not run our performance tests this long because we are just running the same computation over and over in a loop and don't expect the results to change very much by letting them run longer than a few hours) and performance was "slow" from 2022-01-27 02:28:03,394 until about 2022-01-27 10:01:48,395 (7.5 hours) and then from then until I killed the experiment at 2022-01-27 14:25:55,977 (a little over 4 hours), performance was "fast".

Statistics for when the application was slow:

N   min max sum mean    stddev
46  62.3611 114.503 4254.4  92.4869 13.2671

And when the application was "fast":

N   min max sum mean    stddev
28  47.7557 65.9103 1522.06 54.3593 4.77273
mpirvu commented 2 years ago

Profiling should last seconds, not hours. I would need to see a verbose log to determine what happens in that interval or what happens (from JIT compilation point of view) that makes the app be fast.

dblinn commented 2 years ago

@mpirvu I've collected the JIT logs although it looks like they contain some application data that I would prefer not to share in a public forum in case any of it is sensitive so I will send them to you via another channel if that is ok. I collected logs for two runs, once with -Xaggressive (and also -Xtune:virtualized and -XX:-IdleTuningGcOnIdle) that was "slow" (mean time of 81.34s), and another with all 3 of those options removed that was "fast" (mean time of 59.9s).

mpirvu commented 2 years ago

Looking at the verbose logs I see that

  1. There is a lot of processing power in terms of CPU cores and memory available to the JVM
  2. The JVM consumes less than the number of CPU cores available (does not run at saturation)
  3. Compilation queue size is quite small compared to other applications that I am used to

Based on the above there is no need to use -Xtune:virtualized. This option is a good choice in a resource constrained environment because it reduces the footprint and the CPU consumed by the JIT at the expense of peak throughput. This environment is quite generous in terms of resources, hence -Xtune:virtualized should be avoided.

-Xaggressive generates a lot more profiling compilations. The JIT spends 190 more seconds for compilation, but I don't think this is the reason for the slowdown. I'll have to parse the log to see for how long methods run in profiling mode. While collecting profiling info the performance is expected to be somewhat slower.

dblinn commented 2 years ago

@mpirvu Thanks for taking the time to look at the logs. With respect to your thoughts on the environment, to give you a little background, we are running these experiments in our performance testing environment with very powerful hardware and in fairly isolated conditions to ensure that any performance issues we see are unlikely to be due to resource constraints or contention. This particular environment does not reflect the deployment conditions of our application in production, which is often on customer resources that we don't control beyond stating some minimum requirements depending on how much data they want to process.

If your recommendation is still to remove the -Xtune:virtualized, I will definitely bring that up with my team although in our testing, we have to remove both -Xtune:virtualized and -Xaggressive to reliably see an improvement in application throughput.

The JIT spends 190 more seconds for compilation, but I don't think this is the reason for the slowdown

Agreed, it seems far too little to explain the performance loss over a period of hours even if the compilation is all happening at the worst times for us.

dblinn commented 2 years ago

Another point I would like to share is that we deploy a number of different services at my company, all running on OpenJ9 with the same set of JVM options (including -Xaggressive and -Xtune:virtualized). We have only been looking at one service so far in this issue, but they all seem to be suffering this intermittent slow performance. Here is a screenshot of grafana graphs for timings of important operations performed by 4 different services when run with the -Xaggressive and -Xtune:virtualized options captured from our performance testbed.

SlowServices

All of our discussion so far has been about Service "A" in the above screenshot. When it is behaving well, we expect this service to take e a little over 6 minutes plus or minus ~10% for the operation being timed. When we remove the -Xagressive and -Xtune:virtualized options, here are the same numbers for Service "A" running the same code on the same data and same hardware.

FastService

In general, for all the services in the first screenshot the average operation timings should be near the minimum numbers in the respective graph and the maximum should be maybe 10-15% higher than the minimum, not double as we often see. I have also tried removing -Xagressive and -Xtune:virtualized from Service D in the screenshot above and it also restores performance to where we would expect it to be.

mpirvu commented 2 years ago

The verbose logs show a large number of profiling compilations with -Xaggressive. I wrote a script to compute how much time (elapsed time) is spent profiling and it turns out it's a lot. For instance, java/util/concurrent/ConcurrentHashMap$KeySetView.forEach(Ljava/util/function/Consumer;)V stays for more than 500 seconds in profiling. It's possible that the application wasn't doing much at the time, but still, I have my suspicions. @dblinn could you please try with -Xjit:disableJProfilingInProfilingCompilations,verbose={compilePerformance}? This will revert to the old style of profiling. Meanwhile I am going to look for a way to make profiling compilations less aggressive.

dblinn commented 2 years ago

could you please try with -Xjit:disableJProfilingInProfilingCompilations,verbose={compilePerformance}?

Do you need this only with the -Xaggressive option enabled? Or do you need it both enabled and disabled?

mpirvu commented 2 years ago

Let's change one thing at a time, thus please keep both -Xaggressive and -Xtune:virtualized

dblinn commented 2 years ago

@mpirvu makes sense. I have collected the logs and uploaded them to you through a separate channel again. The timings for this run were pretty all over the fast with performance sometimes fairly fast (51 seconds) and sometimes quite slow (115 seconds).

mpirvu commented 2 years ago

I looked at the new verbose logs and I don't see any improvement: there are quite a few methods that spend a large amount of time (500-600 sec) in profiling mode. One of them spends half an hour and that's a lot.

I have found an option to reduce the number of methods that perform profiling: bigAppThreshold=1. This says that, if the JVM loads more than 1 class, the application will be considered "big" and profiling will be done to a lesser degree, to avoid overhead. The default value is 2000 and frankly I find it hard to believe that the app consideredred here loads less than 2000 classes. Anyway, let's give this a try with: -Xjit:bigAppThreshold=1,verbose={compilePerformance} -Xaggressive -Xtune:virtualized

dblinn commented 2 years ago

OK thanks I'll give this a try.

dblinn commented 2 years ago

@mpirvu Sending you the logs separately once again. The timings for this run were all over the place. The first 1hour 40min seem more in line with the "slow" behavior and then the next hour or so I let it run after that it looks "fast".

mpirvu commented 2 years ago

For some reason the number of profiled compilations hasn't decreased at all. We are working on redefining what goes under -Xaggressive and we hope to have a build to try in a week or so.

dblinn commented 2 years ago

OK thank you @mpirvu . In the meantime my team and I are looking into removing the -Xaggressive and -Xtune:virtualized flags because the performance penalty we are seeing with either enabled is so high.

For our information, can you provide a bit more detail on why the large amount of time spent running profiled compilations is problematic to application performance? My understanding is that the time spent compiling methods is added to their execution time rather than being performed in the background, is that correct? Or is it done in the background, but any calls to the method made while that method is also being compiled have to wait until compilation is finished? Or does method compilation interact with method execution in some other way?