Open dblinn opened 2 years ago
@vijaysun-omr @mpirvu @0xdaryl fyi
Here is some additional data after running some more experiments. The .zip files below each experiment description are the GC verbose logs and JVM trace logs captured for each run.
When I remove the -Xtune:virtualized
flag, but leave -Xaggressive
, the garbage collector appears well-behaved but I still get the slow performance (although not quite as slow as having both flags together).
N min max sum mean stddev
9 59.2 99.3 701.05 77.8944 14.5956
When I remove the -Xaggressive
flag, but leave -Xtune:virtualized
, the garbage collector is poorly-behaved and I also get the slow performance (although not quite as slow as having both flags together).
N min max sum mean stddev
9 59.2 99.3 701 77.8889 14.5944
When I remove both flags, the garbage collector is well-behaved and I get the "fast" performance.
N min max sum mean stddev
10 47.7 74.3 554.85 55.485 8.23043
~Also, while this is somewhat unscientific it may be helpful in diagnosing the issue - in order to get the "slow" performance with any set of flags, I have noticed that I have to let the application idle for about 30+ minutes at startup before loading the data that triggering the expensive computation. From what I have seen, if I trigger the computation sooner after startup, I seem to get the "fast" performance profile.~
Striking through the above comment because I managed to get a run where I loaded data and triggered the computation very shortly after application startup and still got the "slow" performance.
-Xtune:virtualized
does have differences in what happens wrt JVM's idle state detection/handling. @mpirvu are there options to disable that aspect of -Xtune:virtualized
or any other suggestions on how to advance this issue ?
-Xtune:virtualized
has many changes under the hood, changed geared biased towards small footprint and low CPU consumed by JIT compilation (possibly at the expense of a small throughput loss).
From the GC point of view, -Xtune:virtualized
implements this change:
/* if tuned for virtualized environment, we compromise a bit of performance for lower footprint */
if (getJavaVM()->runtimeFlags & J9_RUNTIME_TUNE_VIRTUALIZED) {
heapFreeMinimumRatioMultiplier = 20;
}
and also instructs the GC to trigger a collection when the JVM idles for more than ~4 minutes. This last behavior can be disabled with -XX:-IdleTuningGcOnIdle
.
From the JIT point of view there are many changes implemented by -Xtune:virtualized
, but the main ones are:
-Xaot:forceaot
to generate as much AOT code as possible (this happens only if -Xshareclasses option is present)-Xaggressive
is geared towards peak throughput. The following changes are implemented under the hood:
There is some interaction between -Xtune:virtualized
and -Xaggressive
: when -Xaggressive
is used, the JIT changes from -Xtune:virtualized
will not apply (the GC changes will still take effect).
I would start by adding -XX:-IdleTuningGcOnIdle
and see if the performance improves.
Thanks very much @mpirvu for the details that is very helpful. Given your explanation, I am a bit puzzled why I was still able to see slow performance with only the -Xaggressive
option enabled.
I will try adding -XX:-IdleTuningGcOnIdle
and see if the performance improves as you suggested.
I am a bit puzzled why I was still able to see slow performance with only the -Xaggressive option enabled.
My guess is that the extra profiling enabled by -Xaggressive adds transient overhead which causes the performance to fluctuate. Could you please generate some compilation verbose logs by adding -Xjit:verbose={compilePerformance},vlog=vlog.txt
to the JVM command options and run with and without -Xaggressive
to compare the two?
I tried -XX:-IdleTuningGcOnIdle
and did not see any performance improvement. I will try collecting the verbose JIT compilation logs next.
@mpirvu how long would you expect the "transient overhead" from -Xaggressive
to last? I had a test run going that ran all night (usually we do not run our performance tests this long because we are just running the same computation over and over in a loop and don't expect the results to change very much by letting them run longer than a few hours) and performance was "slow" from 2022-01-27 02:28:03,394
until about 2022-01-27 10:01:48,395
(7.5 hours) and then from then until I killed the experiment at 2022-01-27 14:25:55,977
(a little over 4 hours), performance was "fast".
Statistics for when the application was slow:
N min max sum mean stddev
46 62.3611 114.503 4254.4 92.4869 13.2671
And when the application was "fast":
N min max sum mean stddev
28 47.7557 65.9103 1522.06 54.3593 4.77273
Profiling should last seconds, not hours. I would need to see a verbose log to determine what happens in that interval or what happens (from JIT compilation point of view) that makes the app be fast.
@mpirvu I've collected the JIT logs although it looks like they contain some application data that I would prefer not to share in a public forum in case any of it is sensitive so I will send them to you via another channel if that is ok. I collected logs for two runs, once with -Xaggressive
(and also -Xtune:virtualized
and -XX:-IdleTuningGcOnIdle
) that was "slow" (mean time of 81.34s), and another with all 3 of those options removed that was "fast" (mean time of 59.9s).
Looking at the verbose logs I see that
Based on the above there is no need to use -Xtune:virtualized
. This option is a good choice in a resource constrained environment because it reduces the footprint and the CPU consumed by the JIT at the expense of peak throughput. This environment is quite generous in terms of resources, hence -Xtune:virtualized
should be avoided.
-Xaggressive
generates a lot more profiling compilations. The JIT spends 190 more seconds for compilation, but I don't think this is the reason for the slowdown. I'll have to parse the log to see for how long methods run in profiling mode. While collecting profiling info the performance is expected to be somewhat slower.
@mpirvu Thanks for taking the time to look at the logs. With respect to your thoughts on the environment, to give you a little background, we are running these experiments in our performance testing environment with very powerful hardware and in fairly isolated conditions to ensure that any performance issues we see are unlikely to be due to resource constraints or contention. This particular environment does not reflect the deployment conditions of our application in production, which is often on customer resources that we don't control beyond stating some minimum requirements depending on how much data they want to process.
If your recommendation is still to remove the -Xtune:virtualized
, I will definitely bring that up with my team although in our testing, we have to remove both -Xtune:virtualized
and -Xaggressive
to reliably see an improvement in application throughput.
The JIT spends 190 more seconds for compilation, but I don't think this is the reason for the slowdown
Agreed, it seems far too little to explain the performance loss over a period of hours even if the compilation is all happening at the worst times for us.
Another point I would like to share is that we deploy a number of different services at my company, all running on OpenJ9 with the same set of JVM options (including -Xaggressive
and -Xtune:virtualized
). We have only been looking at one service so far in this issue, but they all seem to be suffering this intermittent slow performance. Here is a screenshot of grafana graphs for timings of important operations performed by 4 different services when run with the -Xaggressive
and -Xtune:virtualized
options captured from our performance testbed.
All of our discussion so far has been about Service "A" in the above screenshot. When it is behaving well, we expect this service to take e a little over 6 minutes plus or minus ~10% for the operation being timed. When we remove the -Xagressive
and -Xtune:virtualized
options, here are the same numbers for Service "A" running the same code on the same data and same hardware.
In general, for all the services in the first screenshot the average operation timings should be near the minimum numbers in the respective graph and the maximum should be maybe 10-15% higher than the minimum, not double as we often see. I have also tried removing -Xagressive
and -Xtune:virtualized
from Service D in the screenshot above and it also restores performance to where we would expect it to be.
The verbose logs show a large number of profiling compilations with -Xaggressive
. I wrote a script to compute how much time (elapsed time) is spent profiling and it turns out it's a lot. For instance, java/util/concurrent/ConcurrentHashMap$KeySetView.forEach(Ljava/util/function/Consumer;)V
stays for more than 500 seconds in profiling. It's possible that the application wasn't doing much at the time, but still, I have my suspicions.
@dblinn could you please try with -Xjit:disableJProfilingInProfilingCompilations,verbose={compilePerformance}
? This will revert to the old style of profiling. Meanwhile I am going to look for a way to make profiling compilations less aggressive.
could you please try with -Xjit:disableJProfilingInProfilingCompilations,verbose={compilePerformance}?
Do you need this only with the -Xaggressive
option enabled? Or do you need it both enabled and disabled?
Let's change one thing at a time, thus please keep both -Xaggressive and -Xtune:virtualized
@mpirvu makes sense. I have collected the logs and uploaded them to you through a separate channel again. The timings for this run were pretty all over the fast with performance sometimes fairly fast (51 seconds) and sometimes quite slow (115 seconds).
I looked at the new verbose logs and I don't see any improvement: there are quite a few methods that spend a large amount of time (500-600 sec) in profiling mode. One of them spends half an hour and that's a lot.
I have found an option to reduce the number of methods that perform profiling: bigAppThreshold=1
. This says that, if the JVM loads more than 1 class, the application will be considered "big" and profiling will be done to a lesser degree, to avoid overhead. The default value is 2000 and frankly I find it hard to believe that the app consideredred here loads less than 2000 classes. Anyway, let's give this a try with:
-Xjit:bigAppThreshold=1,verbose={compilePerformance} -Xaggressive -Xtune:virtualized
OK thanks I'll give this a try.
@mpirvu Sending you the logs separately once again. The timings for this run were all over the place. The first 1hour 40min seem more in line with the "slow" behavior and then the next hour or so I let it run after that it looks "fast".
For some reason the number of profiled compilations hasn't decreased at all. We are working on redefining what goes under -Xaggressive and we hope to have a build to try in a week or so.
OK thank you @mpirvu . In the meantime my team and I are looking into removing the -Xaggressive
and -Xtune:virtualized
flags because the performance penalty we are seeing with either enabled is so high.
For our information, can you provide a bit more detail on why the large amount of time spent running profiled compilations is problematic to application performance? My understanding is that the time spent compiling methods is added to their execution time rather than being performed in the background, is that correct? Or is it done in the background, but any calls to the method made while that method is also being compiled have to wait until compilation is finished? Or does method compilation interact with method execution in some other way?
Java -version output
Summary of problem
Our application, running on OpenJ9, is experiencing intermittent performance problems. We execute a series of computations on a single thread, that performs no network or disk I/O and acquires no locks that takes about 50 seconds on a particular dataset running on our performance testing hardware. However, we periodically see this same operation on the same dataset take 50-100% longer. Profiling the CPU during this operation shows the only busy threads during this operation other than the main application thread tend to be garbage-collection related (ie the "Concurrent Mark Helper" and various "GC Worker" threads).
Some of the JVM options we pass when starting our application include:
@jonoommen and @amicic recommended removing the
-Xaggressive
and-Xtune:virtualized
options and seeing what impact this had and it seems to always result in the "fast" application performance in my test runs so far. Below are verbose garbage collection logs (-verbose:gc
) together with resize tracelogs (-Xtrace:iprint=j9mm{resize}
) for a "fast" and "slow" run.Diagnostic files
Statistics for the "fast" run:
fast_trace_logs.zip
Statistics for the "slow" run:
slow_trace_logs.zip
To interpret the statistics,
N
is the number of trials run (ie the number of times we executed the long-running computation in our application),min
is the minimum time taken for a trial,max
is the maximum time taken, etc.