CMake Performance regressions

dnakamura commented 4 years ago

Issue for tracking perf regressions on cmake builds

	xlinux	plinux	aix	zlinux	windows
Startup	0%	0%	0%	0%	0%
Footprint	0%	0%	0%	0%	0%
Throughput	0%	*1	0%	0%	0%

Simplified so perf gains, or measurements within noise are 0

*1 - Instability in benchmark environment.

Known / Suspected Issues:

[ ] By default the jit gets compiled with -O0 (this has been patched out for the xlinux and plinux benchmarks)
[ ] Not all libraries are using split debug info
[x] Symbol visibility is not properly being handled on gcc, ie all symbols are being marked as public. This will negatively dynamic link time during program startup/ dlopen() eclipse/omr#4676 ~
[ ] We likely aren't building with identical configuration options. Need to compare ddr blobs to check for diffs
[ ] #8314

Platforms with perf >= uma:

[x] xlinux
[ ] plinux
[x] zlinux
[x] aix
[ ] windows

dnakamura commented 4 years ago

Rebuilt xlinux with patches for symbol exports (eclipse/omr#4676) and ATOMIC_FREE_JNI (#8316). These changes result in liberty startup improvement of 2.1% and a footprint growth of 1.7%. Both represent less than 0.5 standard deviations from the uma builds. Trying to figure out the aix regressions. Obviously the gnu toolchain fixes aren't relevant, and I don't expect the ATOMIC_FREE_JNI to yield 6%

dnakamura commented 4 years ago

Digging into the diffs on the aix ddr blobs, it looks like there are significant configuration differences. Looking into patching the builds and perf testing. Notable config differences:

OMR_GC_CONCURRENT_SCAVENGER
OMR_GC_HEAP_CARD_TABLE
OMR_GC_TLH_PREFETCH_FTA
OMR_THR_SPIN_WAKE_CONTROL
OMR_THR_THREE_TIER_LOCKING
OMR_THR_YIELD_ALG
J9VM_GC_SUBPOOLS_ALIAS

Also as a side not it appears that LONG_MAX, LONG_MIN, and ULONG_MAX are getting different values ( ~~likely due to a bug in eclipse/omr#4645~~ no seems unrelated)

dnakamura commented 4 years ago

Still seeing some startup regressions on AIX, however according to the ddr blob it looks like the configuration changes aren't taking effect. Digging in to see if its ana actual configuration issue, or a bug with the cmake ddr handling.

dnakamura commented 4 years ago

Looks like the aix configuration issues seems to have gone away after rebasing all my branches and doing a fresh jenkins build. I suspect a was passing a stale SHA in. Now working on resolving some other issues that seem to have popped up with the rebase

dnakamura commented 4 years ago

Still seeing regressions with the updated build on aix. Started writing a small python script to help compare compile command lines for differences

dnakamura commented 4 years ago

While crunching numbers I noticed that there were some large outliers in my baseline runs which were throwing off the numbers. With a new baseline, startup is within the noise, however footprint is up 50%. Looking into the issue, but I suspect its a trivial mistake on my part. Also working on getting a zlinux build for benchmarking

dnakamura commented 4 years ago

Wind: Fighting through some issues to get windows to build properly on jenkins (local build works fine).

Zlinux: seeing ~7% regression in start up time, running throughput tests now.

AIX: SOlved issue with debug symbols which drops foorprint regressions from ~50% down to a more reasonable 2.5%. However in doing so startup is now at a 4.7% regression. I don't think its a result of my changes, but rather just some noise in the benchmark. I believe the issue lies in some jit configuration which is passed on the command line. Doing some experimentation to see if it resolves the regression. Also looking at where our 2.5% footprint is coming from

dnakamura commented 4 years ago

After much digging I've found a number of difference on zlinux command line options. Running test on the new build now. Digging into the aix command line options

dnakamura commented 4 years ago

Zlinux startup and footprint is now within the noise, however throughtput is taking a 16% hit on cold runs, and 4% on warm runs. Looking into differences in the jit and interpreter components.

I have a windows runs, however I realized that optimizations are disabled, and the interpreter is still built with msvc. Currently have a prototype with the interpreter building with clang, and working on getting optimizations turned on (due to some quirks with cmake it looks like it is trying to force a debug build)

fjeremic commented 4 years ago

Zlinux startup and footprint is now within the noise, however throughtput is taking a 16% hit on cold runs, and 4% on warm runs. Looking into differences in the jit and interpreter components.

What benchmark are you using? It would be hard to imagine the JIT causing a throughput regression for this type of work. Even if the compilation is slower we should still be generating the same code at the end.

dnakamura commented 4 years ago

@fjeremic I'm using the liberty throughput benchmark. IIRC I had similar numbers on xlinux when the jit was built with -O0

Determined cause of windows issue. Problem is in omrsig. There is a bug in the logic we use to determine the name of the standard library, causing it to fail when you build in release /w debug symbols. Have a functioning build now and working on packaging up to get perf numbers on
Plinux perf benchmarks are still broken. There was an issue with the scripts which got resolved on tuesday, but there a networking issue which is currently being investigated

dnakamura commented 4 years ago

Zlinux: Found what I believe is the issue for zlinux. Turns out that compiler configuration on the build machines had changed, but I had my changes based on old tooling scripts. This meant that the builds were getting reconfigured and used different versions of the compiler (due to slight differences in the fallback behavior of the uma and cmake builds. I have new benchmark runs to confirm this was the only issue, however currently the server is acting up and does not allow me to access the result data. Since prior to this I went over all the options with a fine toothed comb, I am confident that this is the only remaining issue.

Pliunx: Benchmark scripts still broken. Keeping an on the issues / tickets for a resolution

Windows: Benchmarks have been run, but again I am currently unable to access the data. However the data I got from before looks suspicious so I am investigating further (20% startup improvement with 1% footprint penalty)

dnakamura commented 4 years ago

Zlinux: With the compiler change we are now seeing an improvement of troughput vs the uma build.

Windows: Issue with the perf site turned out to be a database error. Database was rolled back to last backup which means I lost my most recent perf runs. However after re-running on the same build results in segfaults. Trying to investigate, but the problem is proving difficult to reproduce locally

dnakamura commented 4 years ago

Windows: identified missing defines in the jit, and a misconfiguration of the port library. With fixes applied segfaults are gone. Throughput shows ~ -3% on cold runs, +1.4% on warm runs. However on some of the cold runs the script failed to extract throughput measurements, investigating the issue.

Zlinux: Startup test on new zlinux builds still show 0% regression

dnakamura commented 4 years ago

Still looking at the footprint issues on windows. Took a while to get the tests to reproduce on my local machine, but I have it now and am looking at doing a more detailed analysis of the footprint.

Also spent some time cleaning up code in preparation for opening PRs with the various performance related changes

dnakamura commented 4 years ago

So the windows footprint issues are traceable back to extra data in the shared classes cache. For whatever reason the cmake runs are ending with with more methods in the shared classes. Not sure exactly what our heuristics are for adding methods to the cache, or how I could have affected that. Moving on to aix until I get some answers in that department

DanHeidinga commented 4 years ago

So the windows footprint issues are traceable back to extra data in the shared classes cache. For whatever reason the cmake runs are ending with with more methods in the shared classes. Not sure exactly what our heuristics are for adding methods to the cache, or how I could have affected that.

@mpirvu @hangshao0 Can you weigh in on why we'd be seeing more methods in the SCC with a cmake-based build?

@dnakamura Can you share details on the differences you're seeing? Is this a couple of methods or a large num? Detailed output from -Xshareclasses:printStats may help track this down

hangshao0 commented 4 years ago

I will let Marius comment on the AOT part, but cmake should have no impact on the romclasses. Make sure to run -Xshareclasses:destroy before the test to clear the possible shared cache left behind by another JVM.

mpirvu commented 4 years ago

Same for AOT: there should not be any differences in the number of AOT methods based on whether cmake is used or not.

dnakamura commented 4 years ago

From uma build:

Current statistics for cache "liberty-sufp-OpenJ9-JDK11-x86-64_windows_uma-20200213": 

Cache created with:
    -Xnolinenumbers                      = false
    BCI Enabled                          = true
    Restrict Classpaths                  = false
    Feature                              = cr

Cache contains only classes with line numbers

base address                         = 0x000000C392D09000
end address                          = 0x000000C3968B0000
allocation pointer                   = 0x000000C393F0FB98

cache layer                          = 0
cache size                           = 62913952
softmx bytes                         = 62913952
free bytes                           = 29958032
Reserved space for AOT bytes         = -1
Maximum space for AOT bytes          = -1
Reserved space for JIT data bytes    = -1
Maximum space for JIT data bytes     = -1
Metadata bytes                       = 975568
Metadata % used                      = 2%
Class debug area size                = 5001216
Class debug area used bytes          = 3131738
Class debug area % used              = 62%

ROMClass bytes                       = 18901912
AOT bytes                            = 7648112
JIT data bytes                       = 64936
Zip cache bytes                      = 0
Startup hint bytes                   = 240
Data bytes                           = 363936

# ROMClasses                         = 7051
# AOT Methods                        = 3684
# Classpaths                         = 4
# URLs                               = 822
# Tokens                             = 0
# Zip caches                         = 0
# Startup hints                      = 2
# Stale classes                      = 0
% Stale classes                      = 0%

Cache is 52% full

Cache is accessible to current user = true

From cmake build:

.--------------------------
| Printing shared class cache stats:

D:/perffarm/sdks/OpenJ9-JDK11-x86-64_windows_cm-20200224/sdk/bin/../bin/java -Xshareclasses:name=liberty-sufp-OpenJ9-JDK11-x86-64_windows_cm-20200224,printStats

Current statistics for cache "liberty-sufp-OpenJ9-JDK11-x86-64_windows_cm-20200224": 

Cache created with:
    -Xnolinenumbers                      = false
    BCI Enabled                          = true
    Restrict Classpaths                  = false
    Feature                              = cr

Cache contains only classes with line numbers

base address                         = 0x000000AFE9639000
end address                          = 0x000000AFED1E0000
allocation pointer                   = 0x000000AFEA83F570

cache layer                          = 0
cache size                           = 62913952
softmx bytes                         = 62913952
free bytes                           = 28826400
Reserved space for AOT bytes         = -1
Maximum space for AOT bytes          = -1
Reserved space for JIT data bytes    = -1
Maximum space for JIT data bytes     = -1
Metadata bytes                       = 1015336
Metadata % used                      = 2%
Class debug area size                = 5001216
Class debug area used bytes          = 3131662
Class debug area % used              = 62%

ROMClass bytes                       = 18900336
AOT bytes                            = 8732584
JIT data bytes                       = 73904
Zip cache bytes                      = 0
Startup hint bytes                   = 240
Data bytes                           = 363936

# ROMClasses                         = 7050
# AOT Methods                        = 4355
# Classpaths                         = 4
# URLs                               = 821
# Tokens                             = 0
# Zip caches                         = 0
# Startup hints                      = 2
# Stale classes                      = 0
% Stale classes                      = 0%

Cache is 54% full

Cache is accessible to current user = true

DanHeidinga commented 4 years ago

@mpirvu does the data above seem outside the normal run-to-run variation you'd expect in numbers of AOT methods?

mpirvu commented 4 years ago

The difference in number of AOT methods is quite big: 3684 vs 4355. I don't think this is run-to-run variation. Do these numbers stay the same on a re-run? Is it possible that with cmake the flags passed to the JIT compiler are somewhat different or that some configuration flag has a different value?

dnakamura commented 4 years ago

Run to run variation looks to be about +/-30 for both cmake and uma builds. I don't see anything significant in the build configuration dif. There are some unimportant diffs, eg. J9VM_BUILD_DROP_TO_HURSLEY etc

mpirvu commented 4 years ago

@dnakamura Could you please generate some verbose logs with -Xjit:verbose={compilePerformance},vlog=PathToVlog from both configurations and send them to me? The runs should be done with an empty SCC (i.e. cold runs). Thanks

dnakamura commented 4 years ago

@mpirvu jitlogs.zip

mpirvu commented 4 years ago

There is a larger number of compilations in the cmake run, but I don't see a reason for it. One strange thing that happens in both cases is that the number of compilations that occur after start-up is over is larger the the number of compilations that happen after start-up. Since it happens in both cases it cannot explain the discrepancy in the number of compilations.

My suspicion is that somehow the invocation counts are different. @dnakamura could you please generate another verbose log with -Xjit:verbose={counts},vlog=PathToVlog >

dnakamura commented 4 years ago

jitlog.zip

mpirvu commented 4 years ago

These logs show the difference. In the cmake case all methods receive counts (450,125), while for UMA bootstrap methods have counts (450,125) while non-bootstrap one have counts (1000,250). Thus cmake approach will generate more compilations.

Relevant code is the following:

                  if (J9_SHARED_CACHE_DEFAULT_BOOT_SHARING(jitConfig->javaVM) &&
                      J9_ARE_ALL_BITS_SET(jitConfig->javaVM->sharedClassConfig->runtimeFlags, J9SHR_RUNTIMEFLAG_ENABLE_CACHE_NON_BOOT_CLASSES))
                     {
                     useLowerCountsForAOTCold = true;
                     }
                  else if (compInfo->getPersistentInfo()->getElapsedTime() <= (uint64_t)compInfo->getPersistentInfo()->getClassLoadingPhaseGracePeriod() &&
                           TR::Options::isQuickstartDetected() &&
                           fe->isClassLibraryMethod((TR_OpaqueMethodBlock *)method))
                     {
                     useLowerCountsForAOTCold = true;
                     }

The intent was to take the if part when using the non-default SCC. However, due to a bug which does not show wit cmake, we take the else part. The solution was delivered here: https://github.com/eclipse/openj9/pull/8668

fjeremic commented 4 years ago

So this is just a case of running with old code? i.e. it should be fixed when rebased on latest master branch?

mpirvu commented 4 years ago

Yes, the problem will not show when rebased. The footprint increase (together with the start-up improvement) is expected.

dnakamura commented 4 years ago

rebased and building now. Also going back and rebasing other platforms and making sure that it doesn't erase footprint parity we see there

dnakamura commented 4 years ago

Windows regressions have gone away. Re-ran all other platforms and they are still looking ok with the changes. Looking into the aix footprint, looks like the issue is the size of the jit library. Going over the compile/link options there to figure it out

dnakamura commented 4 years ago

Been sick the last few days, so I've had limited bandwidth. However I have identified an issue where exports were not properly identified for the jit on aix. After fixing both footprint and startup are within the noise

mleipe commented 4 years ago

Sounds like this is progressing well! Can you please ensure that you capture your performance test results for all of the platforms you've completed? There's a table at the start of the issue but it's not clear to me how current that data is, so a new table in a new comment below to capture the current point in time might be useful. Thanks!

dnakamura commented 4 years ago

I've done a full update on the table at the top and putting a copy below

	xlinux	plinux	aix	zlinux	windows
Startup	0%	0%	0%	0%	0%
Footprint	0%	0%	0%	0%	0%
Throughput	0%	*1	0%	0% *2	-0.8%

Simplified so perf gains, or measurements within noise are 0

1 - Instability in benchmark environment. 2 - Showing 15% gains. Seems worth investigating

There is a workaround for the plinux benchmark issue, unfortunately its still not fully stable making it hard to get an un-interrupted benchmark run. I've managed to successfully capture a clean baseline run, and am now fighting to get a clean cmake run.
Windows throughput regression of about 0.8%. Working on hunting it down, but given the magnitude, I suspect it will be tricky to find
zlinux is showing throughput gains of 15% which seems unusual and worth investigating. However, given that its a perf gain, it has been sitting at the bottom of my priority list

dnakamura commented 4 years ago

Digging into the issues on plinux, it appears that there are different issues. The main issue was that the the network interface on the database server would go down, resulting in timeouts while attempting to connect to the database. The new issue is a bit more complicated (but still results in timeouts while connecting to the database). The perf benchmark script has several layers of invoking subscripts on different machines (some of which with time limits). Unfortunately, because all the output from the scripts is buffered, and due to limits in the remote management interfaces used, a script exceeding its time limit is killed, and all partial output from stdout is silently discarded. My current working theory is that one of the scripts which is responsible for starting the database server is timing out. However the top level script ignores this and attempts to run the benchmark anyway, which results in timeout errors when liberty attempts to connect to the database server

dnakamura commented 4 years ago

Windows: Re-ran benchmarks to try to get a larger sample size. 0.8% regression remains, now with a 0.2% confidence interval. Identified an issue where the method handle interpreter was being built with MSVC instead of clang. Spinning up a fixed build to benchmark.
z linux - Identified an outlier in the uma runs that was dragging the scores down. Excluding the outlier brings the cmake build to within ~1% of uma builds. Working on re-running to confirm it is just the outlier, however both builds are failing about 50% of the time. Appears to be an issue with the perf machine

dnakamura commented 4 years ago

windows First run of windows benchmark came back inconclusive. Value was within noise, but the confidence interval was larger. Ran a second set of benchmarks and the value is still behind the uma builds. Went over the build options again, and found a couple of differing options in the jit. Working on getting a new build and tests

dnakamura commented 4 years ago

Zlinux numbers are in and no regressions shown there
Windows: Ive tweaked some of the benchmark parameters to try and capture some of the variability I'm seeing the results. Normally a single benchmark run consists of 1 warm up iteration with a single client, followed by 3 full perf warm up ierations, and a single measurement iteration. I've changed to 2 warm up runs (as the third warmup run has throughput numbers which are the same or below the normal measured run), and increased the number of measured runs to 3. Throughput numbers on the cmake build are still outside the noise margin, but only just. However I believe there may be an issue with the benchmark machine as the benchmark results are showing a strong time dependent variation.

dnakamura commented 4 years ago

Windows: For some background, the benchmark tool interleaves runs from the baseline jvm, and the jvm under test eg. B-T-B-T-B-T .... Note that the baseline jvm is always run first. I hypothesized that this, combined with the fact that the perf numbers showed an overall downward trend would give the baseline build a very slight edge over the test build. To test this I re-ran the benchmark with the cmake build marked as the "baseline", and the situation flips with the cmake build coming out marginally ahead of the uma build. Pooling results with a run with the uma build marked as the baseline shows the perf to be essentially identical (cmake build ahead by 0.3%)

fjeremic commented 4 years ago

Sounds like noise to me.

mleipe commented 4 years ago

Who can tell us for sure, @fjeremic ?

fjeremic commented 4 years ago

Who can tell us for sure, @fjeremic ?

The data from the previous comment. It seems to imply that interleaving the runs A-B-A-B vs. B-A-B-A yielded two different results, but with the same trend, which to me implies that A == B in terms of performance.

dnakamura commented 4 years ago

Plinux throughput numbers are in using alternate benchmark. They appear to be within noise (although the noise seems larger on this test, investigating if that is just the nature of the benchmark, or there was something going on with the machine

dnakamura commented 4 years ago

Not a lot to report today

Currently re-running plinux test to try to reduce noise margin
Cleaning up patches for prs
work arround with an issue on xlc on zos where something which is documented to work, apparently doesn't (I believe to due to a bug in their c library implementation)

dnakamura commented 4 years ago

Extra Plinux tests are in and perf is still identical, but with a much smaller noise margin now.

Initial throughput tests on osx are showing no regressions, working on getting startup/footprint measurements

dnakamura commented 4 years ago

After a little bit of hacking I've managed to get the liberty startup/footprint scripts running on osx. Looks like the cmake build there is also showing no regressions, meaning osx build is at parity, and we should be good to switch our osx builds over once cmake is installed on the build machines (ticket already open)

eclipse-openj9 / openj9

CMake Performance regressions #8183