Create -Xtune:throughput option to favor throughput improvements

mpirvu commented 2 years ago

This option would give users an easy way to tune the OpenJ9 JVM towards better throughput scores, possibly at the expense of other performance metrics like footprint, start-up time or ramp-up time.

Some of the changes that could be done under the covers include:

Set the initial Java heap size (-Xms) to the maximum value (-Xmx)
Make inliner more aggressive with -Xjit:enableAggressiveInlining
Allow bigger methods with -Xjit:acceptHugeMethods
Allow more memory for the JIT compiler with -Xjit:scratchSpaceLimit=524288
Do not compile at cold: -Xjit:initialOptLevel=warm
Do not use AOT: -Xnoaot
Do not enable noServer during start-up: -Xjit:disableSelectiveNoServer

Other changes that may not be part of the default -Xtune:throughput mode include -XX:-EnableHCR and -Xjit:scorchingSampleThreshold=3000

mpirvu commented 2 years ago

@vijaysun-omr if you have other ideas on how throughput can be improved through tuning.

vijaysun-omr commented 2 years ago

I do feel that some way to "encourage" recompilations may need to be built into this option, i.e. if not -Xjit:scorchingSampleThreshold=3000 then we still need something else, e.g. maybe some reduction in the sampleInterval ? Note that sometimes throughput depends on how quickly we ramp up for short running applications and so we cannot go overboard with recompilations either.

vijaysun-omr commented 2 years ago

You may know the history of the -Xnoquickstart logic and whether there is anything from it to be borrowed for -Xtune:throughput. Having said that, I am interested in getting something delivered soon with the excellent list you had to start with as the basis for it. So I don't really want to hold up that initial and likely most impactful delivery to pick up every oiece of logic we want to include. We can always add that stuff later as we think of it.

mpirvu commented 2 years ago

-Xnoquickstart is no longer used and it had a different motivation: since Testarossa didn't have DLT at that time, it was a hack to boost performance for benchmarks that have the most important method called just once. Under the hood, -Xnoquickstart enabled count=0 for loopy methods.

mpirvu commented 2 years ago

@vijaysun-omr I wonder about the interaction between -Xaggressive and -Xtune:throughput. We could alias them, but based on the public description of -Xaggressive ("Enables performance optimizations and new platform exploitation that are expected to be the default in future releases."), I would keep the two options independent. For now, they don't clash with each other.

On the other hand, -Xquickstart does clash with -Xtune:throughput. We have two options here:

-Xquickstart always wins because it's a more mature that customers may be familiar with and used in practice
Whichever option happens last wins. This makes sense, but I wonder if customers may try to use both to have good start-up time AND throughput.

vijaysun-omr commented 2 years ago

I agree with respect to keeping -Xaggressive and -Xtune:throughput different since the latter is really more about changing the heuristics that we have internally to advantage one metric (throughput) over others, whereas the former is what you mentioned about new features expected to become default in the future.

Regarding your question about conflicting behavior between -Xquickstart and -Xtune:throughput, do we actually have precedence in this regard ? i.e. my understanding was that we don't really detect conflicts like this and would just give the user what they asked for, i.e. enable different pieces of code as per what the user asked for even if they are conflicting. For example, we would not disable -Xquickstart (or vice versa) if one specified acceptHugeMethods until now. So, one basic question is whether we need to do conflict resolution at all in the JVM. i.e. do we really expect users to misuse this new option any more than they might have misused existing options ?

mpirvu commented 2 years ago

-Xquickstart disables IProfiler, sets low counts and performs all initial compilations at cold. -Xtune:throughput does the exact opposite: wants a lot of IProfiler info (and therefore large counts) and performs all initial compilations at warm. In this case I think we need to do conflict resolution and pick one of them.

vijaysun-omr commented 2 years ago

Okay, but where would this stop ? Will we also do conflict resolution between -Xtune:virtualized and -Xtune:throughput ? What about sub-options ? I mean, what if a user specified dontDowngradeToCold and -Xquickstart at the same time ?

One position is that we will process the command line left to right and set whatever internal flags are implied by the options being processed in that order (if there is a conflict, so be it). Is there a precedent for conflict resolution on the JIT side ? Maybe there are precedents in the VM like different GC policies or specifying -Xshareclasses and -Xshareclasses:none at the same time. But in those cases, it is simply not possible to implement what is being requested, whereas the difference in your case is that it is possible to implement, since it is just heuristics maybe ?

mpirvu commented 2 years ago

In my current prototype I choose between -Xtune:virtualized and -Xtune:throughput based on which option appears last on the command line. This is going to be documented since it's a public/supported option. I have to choose between the two because they have antagonistic effects: -Xtune:virtualized uses less memory and less CPU than the default config, at the expense of a small throughput loss; -Xtune:throughput provides better throughput than the default config, at the expense of a footprint and CPU increase. The options set internally by -Xtune:virtualized and -Xtune:throughput can be overridden with -Xjit options, but those are not supported and not supposed to be used directly by customers. If they do and the JVM does not behave as expected, it's their fault. For supported options we want clarity as to what the user should expect from an option.

Currently, if -Xquickstart is on the command line, the JIT will ignore -Xtune:virtualized. My original question was whether we should do the same for -Xtune:throughput, or follow the more general principle: the last conflicting option on the command line wins. I am leaning towards the latter because it's easier more generic and easier to remember. I am also thinking about aliasing -Xquickstart to -Xtune:startup in a future PR

vijaysun-omr commented 2 years ago

Restricting conflict resolution to publicly supported options makes sense to me. I am fine with obeying the "last one wins" rule rather than get fancier.

In terms of -Xquickstart though, if it gets mapped to -Xtune:startup won't we then have a separate rule for -Xtune:startup compared to all the other -Xtune options ? Is it possible to drop the special rule for -Xquickstart and -Xtune:virtualized if it is not externally documented behavior ?

mpirvu commented 2 years ago

I looked at the throughput of DT7 when -Xtune:throughput is specified and it doesn't look good. The problem is excruciating large rampup due to extremely high CPU overhead (caused by compilation of course). After a full hour, the throughput is still half what it needs to be and the amount of CPU spent compiling is 1.8 hours (4P run, so we can compile on several threads).

I added options one by one to see which one is the most detrimental and found that enableAggressiveInlining produces the most damage. There is some relief if we keep enableAggressiveInlining but remove acceptHugeMethods,scratchSpaceLimit=524288 because some of the expensive methods will fail.

If we go ahead with the current implementation, some users are going to be disappointed. We can document the fact that -Xtune:throughput is not good for large apps, but how do you clearly define a large app? Maybe we need to dial down what enableAggressiveInlining does.

Some compilation stats for a DT7 run with -Xtune:throughput:

        Samples MAX(ms)         TOTAL(ms)       AVG(usec)       MIN(usec)
Total   50567   129232.9        8113089.0       160442          149
cold    692     6647.6          37434.5         54096           357
warm    49662   129232.9        8073458.0       162568          149
hot     4       923.4           2090.1          522516          183560
jni     209     20.7            106.5           509             151
inv      recompilations = 6
sync compilations = 6
DLT compilations  = 23
numFailures       = 695
max_Q_SZ          = 178553

As seen above, it's the warm compilations that cause the high CPU. The most expensive one that succeeded, took 129 sec.

This longer run actually reveals something interesting: the number of warm compilations is larger than what it needs to be. The default case has about 22K compilations, but this throughput oriented case shows 50K compilations. I will need to write some scripts to understand why this is the case.

vijaysun-omr commented 2 years ago

Interesting, but can we detect "a large app" as we have tried to do in the past and dial down aspects that we don't believe are appropriate ?

Are you able to reason about why we seem to have cold compilations despite setting initial opt level to warm ?

mpirvu commented 2 years ago

We already have a knob for "largeApp" based on the number of classes loaded, so I'll exploit that. The cold compilations are compilations that failed at warm.

mpirvu commented 2 years ago

One problem with the large app detection is that's a binary decision. I remember that a while ago we had a SOABench regression which was caused (indirectly) by a new of version of WAS that was loading a few extra classes, just enough to make the transition from a small app to a large app.

vijaysun-omr commented 2 years ago

How would the DT7 run perform without enableAggressiveInlining ? If the problem is largely down to that one option, we know that option itself maps to several sub options underneath and we can add it to -Xtune:throughput later once we have tuned it better but go ahead with the rest of your PR for now if the performance looks reasonable.

mpirvu commented 2 years ago

Without enableAggressiveInlining performance of DT7 looks reasonable. I am determining which option under enableAggressiveInlining produces the most damage and it appears to be enableInliningDuringVPAtWarm. Without it, the throughput is sort of OK, though obviously there is an increase in CPU (more than 2X) from the other inliner changes. For some reason, If I reduce maxSizeForVPInliningAtWarm=600 to 0, I don't get the full benefit that I get from disableInliningDuringVPAtWarm

mpirvu commented 2 years ago

SPECjbb2005 seems to degrade from enableAggressiveInlining but also from acceptHugeMethods. xalan (from dacapo suite) regresses if scratchSpaceLimit=524288 is used.

The only options that are consistently helping are: Xms=Xmx and -Xjit:dontDowngradeToCold,disableSelectiveNoServer

mpirvu commented 2 years ago

I ran Dacapo suite with Java8 and the code delivered in PR #14277 and OMR https://github.com/eclipse/omr/pull/6294 Each benchmark used 20 iterations (19 for warmup and 1 for measurement) and each experiment was repeated 10 times to compute the mean. For the benchmarks that ran successfully I see the following running times in ms(negative delta means an improvement).

Benchmark	Default	Xtune:throughput	Delta
avrora	2122	2134	+0.5%
eclipse	14572	14503	-0.5%
fop	213	168	-21%
h2	3137	3137	0%
jython	1184	1206	+1.8%
luindex	503	466	-7.4%
lusearch-fix	518	456	-12%
pmd	567	553	-2.5%
sunflow	1381	1277	-7.6%
xalan	413	357	-13.8%

mpirvu commented 2 years ago

I reran jython and this time I got only a 1% regression due to -Xtune:throughput. I also computed the 95% confidence interval which turned out to be 0.94%, thus, statistically speaking, the two configurations behave the same.

vijaysun-omr commented 2 years ago

Those are very good results, and I am not worried about any of the minor regressions we see. How much of these gains come from the -Xmx == -Xms setting and how much from everything else ? I am curious, since it is probably easy to run these apps and get that.

mpirvu commented 2 years ago

vijaysun-omr commented 2 years ago

I wonder what would happen to throughput if we avoided doing the logic enabled by enableAggressiveInlining at even the opt levels higher than warm. i.e. I wonder if we can just put our faith in the frequencies as we do at warm and keep inlining within reasonable bounds for higher opt levels. Since they already use higher absolute thresholds for inlining budget etc. I feel we would get more inlining at higher opt levels as we want, but by avoiding the logic enabled by enableAggressiveInlining it would be more focused.

mpirvu commented 2 years ago

enableAggressiveInlining does this:

   self()->setOption(TR_RestrictInlinerDuringStartup, false);
   self()->setOption(TR_DisableInliningDuringVPAtWarm, false);
   self()->setOption(TR_DisableConservativeColdInlining, true);
   self()->setOption(TR_DisableConservativeInlining, true);
   self()->setOption(TR_InlineVeryLargeCompiledMethods, true);

   _trivialInlinerMaxSize = 100;
   _bigCalleeFreqCutoffAtWarm = 0;
   _bigCalleeFreqCutoffAtHot = 0;
   _bigCalleeThreshold = 600;
   _bigCalleeHotOptThreshold = 600;
   _bigCalleeThresholdForColdCallsAtWarm = 600;
   _bigCalleeThresholdForColdCallsAtHot = 600;
   _maxSzForVPInliningWarm = 600;

TR_RestrictInlinerDuringStartup is only enabled during -Xquickstart and applies to trivial inliner. There is no need to disable it in -Xtune:throughput because it is already disabled (besides, -Xtune:throughput does not perform cold compilations). TR_DisableInliningDuringVPAtWarm: as the name suggests, this does not apply to hot compilations. Experiments showed that it leads to a substantial increase in CPU overhead. It should stay 'true' for now (i.e. do not inline during VP in warm bodies) TR_DisableConservativeColdInlining only applies to cold compilations, so it is not relevant for -Xtune:throughput TR_DisableConservativeInlining: conservativeInlining changes "border frequencies", but it has no effect on hot/scorching compilations. conservativeInlining is used by default. TR_InlineVeryLargeCompiledMethods enabled by default on Z, disabled on X/P. Applies mostly to warm compilations. For hot compilations it allows the inlining of scorching bodies into hot bodies. _trivialInlinerMaxSize only applies to cold compilations, so it's not important to Xtune:throughput _bigCalleeFreqCutoffAtWarm As the name suggests, it only applies to warm compilations. The default value is 0 anyway. _bigCalleeFreqCutoffAtHot Applies to hot and scorching compilations. The default value is 40 (the aggressive value is 0) _bigCalleeThreshold has different default values depending on the architecture. On x86 the default value is 400. The aggressive value is 600. Only used for warm compilations. _bigCalleeThresholdForColdCallsAtWarm Default value 400, aggressive value is 600. Only applies to warm compilations. _bigCalleeThresholdForColdCallsAtHot Default value is 500, aggressive value is 600. Applies to hot/scorching compilations. _maxSzForVPInliningWarm Default value is 8, aggressive value is 600. Only applies to warm compilations.

So, the only knobs that affect hot, scorching compilations are:

TR_InlineVeryLargeCompiledMethods
_bigCalleeFreqCutoffAtHot
_bigCalleeThresholdForColdCallsAtHot

mpirvu commented 2 years ago

I tested the effect of acceptHugeOptions plus the 3 options mentioned above that can affect hot/scorching compilation and I can say that none of them have a definite influence on DaCapo or SPECjbb2005. acceptHugeMethods may improve h2 by 2.4% and SPECjbb2005 by 2.4% inlineVeryLargeCompiledMethods may improve jython by 2.3% and regress avrora by 0/9% and regress SPECjbb2005 by 1.8% bigCalleeThresholdForColdCallsAtHot may improve luindex by 2.8% and may regress sunflow by 1.5% _bigCalleeFreqCutoffAtHot does nothing Above I say "may" because the changes in performance are statistically insignificant.

tajila commented 1 year ago

@mpirvu was this option ever documented?

mpirvu commented 1 year ago

No because we wanted to tune it some more before telling users about it

eclipse-openj9 / openj9

Create -Xtune:throughput option to favor throughput improvements #14246