Open mpirvu opened 2 years ago
@vijaysun-omr if you have other ideas on how throughput can be improved through tuning.
I do feel that some way to "encourage" recompilations may need to be built into this option, i.e. if not -Xjit:scorchingSampleThreshold=3000
then we still need something else, e.g. maybe some reduction in the sampleInterval
?
Note that sometimes throughput depends on how quickly we ramp up for short running applications and so we cannot go overboard with recompilations either.
You may know the history of the -Xnoquickstart logic and whether there is anything from it to be borrowed for -Xtune:throughput. Having said that, I am interested in getting something delivered soon with the excellent list you had to start with as the basis for it. So I don't really want to hold up that initial and likely most impactful delivery to pick up every oiece of logic we want to include. We can always add that stuff later as we think of it.
-Xnoquickstart
is no longer used and it had a different motivation: since Testarossa didn't have DLT at that time, it was a hack to boost performance for benchmarks that have the most important method called just once. Under the hood, -Xnoquickstart
enabled count=0
for loopy methods.
@vijaysun-omr I wonder about the interaction between -Xaggressive
and -Xtune:throughput
. We could alias them, but based on the public description of -Xaggressive
("Enables performance optimizations and new platform exploitation that are expected to be the default in future releases."), I would keep the two options independent. For now, they don't clash with each other.
On the other hand, -Xquickstart
does clash with -Xtune:throughput
. We have two options here:
I agree with respect to keeping -Xaggressive and -Xtune:throughput different since the latter is really more about changing the heuristics that we have internally to advantage one metric (throughput) over others, whereas the former is what you mentioned about new features expected to become default in the future.
Regarding your question about conflicting behavior between -Xquickstart and -Xtune:throughput, do we actually have precedence in this regard ? i.e. my understanding was that we don't really detect conflicts like this and would just give the user what they asked for, i.e. enable different pieces of code as per what the user asked for even if they are conflicting. For example, we would not disable -Xquickstart (or vice versa) if one specified acceptHugeMethods until now. So, one basic question is whether we need to do conflict resolution at all in the JVM. i.e. do we really expect users to misuse this new option any more than they might have misused existing options ?
-Xquickstart disables IProfiler, sets low counts and performs all initial compilations at cold. -Xtune:throughput does the exact opposite: wants a lot of IProfiler info (and therefore large counts) and performs all initial compilations at warm. In this case I think we need to do conflict resolution and pick one of them.
Okay, but where would this stop ? Will we also do conflict resolution between -Xtune:virtualized and -Xtune:throughput ? What about sub-options ? I mean, what if a user specified dontDowngradeToCold and -Xquickstart at the same time ?
One position is that we will process the command line left to right and set whatever internal flags are implied by the options being processed in that order (if there is a conflict, so be it). Is there a precedent for conflict resolution on the JIT side ? Maybe there are precedents in the VM like different GC policies or specifying -Xshareclasses and -Xshareclasses:none at the same time. But in those cases, it is simply not possible to implement what is being requested, whereas the difference in your case is that it is possible to implement, since it is just heuristics maybe ?
In my current prototype I choose between -Xtune:virtualized
and -Xtune:throughput
based on which option appears last on the command line. This is going to be documented since it's a public/supported option. I have to choose between the two because they have antagonistic effects: -Xtune:virtualized
uses less memory and less CPU than the default config, at the expense of a small throughput loss; -Xtune:throughput
provides better throughput than the default config, at the expense of a footprint and CPU increase.
The options set internally by -Xtune:virtualized
and -Xtune:throughput
can be overridden with -Xjit options, but those are not supported and not supposed to be used directly by customers. If they do and the JVM does not behave as expected, it's their fault. For supported options we want clarity as to what the user should expect from an option.
Currently, if -Xquickstart
is on the command line, the JIT will ignore -Xtune:virtualized
. My original question was whether we should do the same for -Xtune:throughput
, or follow the more general principle: the last conflicting option on the command line wins. I am leaning towards the latter because it's easier more generic and easier to remember.
I am also thinking about aliasing -Xquickstart
to -Xtune:startup
in a future PR
Restricting conflict resolution to publicly supported options makes sense to me. I am fine with obeying the "last one wins" rule rather than get fancier.
In terms of -Xquickstart
though, if it gets mapped to -Xtune:startup
won't we then have a separate rule for -Xtune:startup
compared to all the other -Xtune
options ? Is it possible to drop the special rule for -Xquickstart
and -Xtune:virtualized
if it is not externally documented behavior ?
I looked at the throughput of DT7 when -Xtune:throughput is specified and it doesn't look good. The problem is excruciating large rampup due to extremely high CPU overhead (caused by compilation of course). After a full hour, the throughput is still half what it needs to be and the amount of CPU spent compiling is 1.8 hours (4P run, so we can compile on several threads).
I added options one by one to see which one is the most detrimental and found that enableAggressiveInlining
produces the most damage. There is some relief if we keep enableAggressiveInlining
but remove acceptHugeMethods,scratchSpaceLimit=524288
because some of the expensive methods will fail.
If we go ahead with the current implementation, some users are going to be disappointed. We can document the fact that -Xtune:throughput is not good for large apps, but how do you clearly define a large app? Maybe we need to dial down what enableAggressiveInlining
does.
Some compilation stats for a DT7 run with -Xtune:throughput
:
Samples MAX(ms) TOTAL(ms) AVG(usec) MIN(usec)
Total 50567 129232.9 8113089.0 160442 149
cold 692 6647.6 37434.5 54096 357
warm 49662 129232.9 8073458.0 162568 149
hot 4 923.4 2090.1 522516 183560
jni 209 20.7 106.5 509 151
inv recompilations = 6
sync compilations = 6
DLT compilations = 23
numFailures = 695
max_Q_SZ = 178553
As seen above, it's the warm compilations that cause the high CPU. The most expensive one that succeeded, took 129 sec.
This longer run actually reveals something interesting: the number of warm compilations is larger than what it needs to be. The default case has about 22K compilations, but this throughput oriented case shows 50K compilations. I will need to write some scripts to understand why this is the case.
Interesting, but can we detect "a large app" as we have tried to do in the past and dial down aspects that we don't believe are appropriate ?
Are you able to reason about why we seem to have cold compilations despite setting initial opt level to warm ?
We already have a knob for "largeApp" based on the number of classes loaded, so I'll exploit that. The cold compilations are compilations that failed at warm.
One problem with the large app detection is that's a binary decision. I remember that a while ago we had a SOABench regression which was caused (indirectly) by a new of version of WAS that was loading a few extra classes, just enough to make the transition from a small app to a large app.
How would the DT7 run perform without enableAggressiveInlining ? If the problem is largely down to that one option, we know that option itself maps to several sub options underneath and we can add it to -Xtune:throughput later once we have tuned it better but go ahead with the rest of your PR for now if the performance looks reasonable.
Without enableAggressiveInlining
performance of DT7 looks reasonable.
I am determining which option under enableAggressiveInlining produces the most damage and it appears to be enableInliningDuringVPAtWarm
. Without it, the throughput is sort of OK, though obviously there is an increase in CPU (more than 2X) from the other inliner changes. For some reason, If I reduce maxSizeForVPInliningAtWarm=600
to 0, I don't get the full benefit that I get from disableInliningDuringVPAtWarm
SPECjbb2005 seems to degrade from enableAggressiveInlining
but also from acceptHugeMethods
.
xalan (from dacapo suite) regresses if scratchSpaceLimit=524288
is used.
The only options that are consistently helping are: Xms=Xmx and -Xjit:dontDowngradeToCold,disableSelectiveNoServer
I ran Dacapo suite with Java8 and the code delivered in PR #14277 and OMR https://github.com/eclipse/omr/pull/6294 Each benchmark used 20 iterations (19 for warmup and 1 for measurement) and each experiment was repeated 10 times to compute the mean. For the benchmarks that ran successfully I see the following running times in ms(negative delta means an improvement).
Benchmark | Default | Xtune:throughput | Delta |
---|---|---|---|
avrora | 2122 | 2134 | +0.5% |
eclipse | 14572 | 14503 | -0.5% |
fop | 213 | 168 | -21% |
h2 | 3137 | 3137 | 0% |
jython | 1184 | 1206 | +1.8% |
luindex | 503 | 466 | -7.4% |
lusearch-fix | 518 | 456 | -12% |
pmd | 567 | 553 | -2.5% |
sunflow | 1381 | 1277 | -7.6% |
xalan | 413 | 357 | -13.8% |
I reran jython and this time I got only a 1% regression due to -Xtune:throughput
. I also computed the 95% confidence interval which turned out to be 0.94%, thus, statistically speaking, the two configurations behave the same.
Those are very good results, and I am not worried about any of the minor regressions we see. How much of these gains come from the -Xmx == -Xms setting and how much from everything else ? I am curious, since it is probably easy to run these apps and get that.
I wonder what would happen to throughput if we avoided doing the logic enabled by enableAggressiveInlining
at even the opt levels higher than warm. i.e. I wonder if we can just put our faith in the frequencies as we do at warm and keep inlining within reasonable bounds for higher opt levels. Since they already use higher absolute thresholds for inlining budget etc. I feel we would get more inlining at higher opt levels as we want, but by avoiding the logic enabled by enableAggressiveInlining
it would be more focused.
enableAggressiveInlining does this:
self()->setOption(TR_RestrictInlinerDuringStartup, false);
self()->setOption(TR_DisableInliningDuringVPAtWarm, false);
self()->setOption(TR_DisableConservativeColdInlining, true);
self()->setOption(TR_DisableConservativeInlining, true);
self()->setOption(TR_InlineVeryLargeCompiledMethods, true);
_trivialInlinerMaxSize = 100;
_bigCalleeFreqCutoffAtWarm = 0;
_bigCalleeFreqCutoffAtHot = 0;
_bigCalleeThreshold = 600;
_bigCalleeHotOptThreshold = 600;
_bigCalleeThresholdForColdCallsAtWarm = 600;
_bigCalleeThresholdForColdCallsAtHot = 600;
_maxSzForVPInliningWarm = 600;
TR_RestrictInlinerDuringStartup
is only enabled during -Xquickstart and applies to trivial inliner. There is no need to disable it in -Xtune:throughput because it is already disabled (besides, -Xtune:throughput does not perform cold compilations).
TR_DisableInliningDuringVPAtWarm
: as the name suggests, this does not apply to hot compilations. Experiments showed that it leads to a substantial increase in CPU overhead. It should stay 'true' for now (i.e. do not inline during VP in warm bodies)
TR_DisableConservativeColdInlining
only applies to cold compilations, so it is not relevant for -Xtune:throughput
TR_DisableConservativeInlining
: conservativeInlining changes "border frequencies", but it has no effect on hot/scorching compilations. conservativeInlining is used by default.
TR_InlineVeryLargeCompiledMethods
enabled by default on Z, disabled on X/P. Applies mostly to warm compilations. For hot compilations it allows the inlining of scorching bodies into hot bodies.
_trivialInlinerMaxSize
only applies to cold compilations, so it's not important to Xtune:throughput
_bigCalleeFreqCutoffAtWarm
As the name suggests, it only applies to warm compilations. The default value is 0 anyway.
_bigCalleeFreqCutoffAtHot
Applies to hot and scorching compilations. The default value is 40 (the aggressive value is 0)
_bigCalleeThreshold
has different default values depending on the architecture. On x86 the default value is 400. The aggressive value is 600. Only used for warm compilations.
_bigCalleeThresholdForColdCallsAtWarm
Default value 400, aggressive value is 600. Only applies to warm compilations.
_bigCalleeThresholdForColdCallsAtHot
Default value is 500, aggressive value is 600. Applies to hot/scorching compilations.
_maxSzForVPInliningWarm
Default value is 8, aggressive value is 600. Only applies to warm compilations.
So, the only knobs that affect hot, scorching compilations are:
I tested the effect of acceptHugeOptions
plus the 3 options mentioned above that can affect hot/scorching compilation and I can say that none of them have a definite influence on DaCapo or SPECjbb2005.
acceptHugeMethods
may improve h2
by 2.4% and SPECjbb2005
by 2.4%
inlineVeryLargeCompiledMethods
may improve jython
by 2.3% and regress avrora
by 0/9% and regress SPECjbb2005
by 1.8%
bigCalleeThresholdForColdCallsAtHot
may improve luindex
by 2.8% and may regress sunflow
by 1.5%
_bigCalleeFreqCutoffAtHot
does nothing
Above I say "may" because the changes in performance are statistically insignificant.
@mpirvu was this option ever documented?
No because we wanted to tune it some more before telling users about it
This option would give users an easy way to tune the OpenJ9 JVM towards better throughput scores, possibly at the expense of other performance metrics like footprint, start-up time or ramp-up time.
Some of the changes that could be done under the covers include:
Xjit:enableAggressiveInlining
-Xjit:acceptHugeMethods
-Xjit:scratchSpaceLimit=524288
-Xjit:initialOptLevel=warm
-Xnoaot
-Xjit:disableSelectiveNoServer
Other changes that may not be part of the default
-Xtune:throughput
mode include-XX:-EnableHCR
and-Xjit:scorchingSampleThreshold=3000