Open PhilWakelin opened 2 years ago
@PhilWakelin In the box folder above, the verbose log from jitlog6 is a copy of the one from jitlog5 (the cold run)
@mpirvu jitlog6 now updated with correct files.
I've rerun some startup test for the SCC with the additional JVM options for Xtune:virtualized
and with Xaot:forceaot
Charts attached, but in summary both options increased startup time with the SCC, although the increase with the Xaot option was not statistically significant.
However, both options did significantly decrease CPU usage
@PhilWakelin Could you please try a build after 2021/09/11? There is a change in there https://github.com/eclipse-openj9/openj9/pull/13489 that may help.
Also, in the vlogs I see many AOT load failures due to compilationAotClassReloFailure. These may subside with additional runs, so we need to try 2-3 warm runs in succession.
Finally, zOS uses AOT compression which adds a bit of overhead. Since the warm run has a very large number of AOT loads, maybe we should to disable AOT compression with -Xaot:forceaot,disableAOTBytesCompression
Thanks @mpirvu I will try out a later build to see how this impacts startup and will also try -Xaot:forceaot,disableAOTBytesCompression All the mean startup figures are the averages from 50 startup iterations, with a clean Java SCC and Liberty cache. The first 2 runs were discarded from the figures, so I don't think running more runs will help the figures. However, the data in the JITlogs I provided was the first 2 (normally discarded) runs. I can gather JITlogs from a later iteration if needed let me know if you want this.
I can gather JITlogs from a later iteration if needed let me know if you want this.
Yes please. I expect to see a reduction in the number of AOT load failures.
Another interesting experiment would be to run with SCC but without AOT (-Xnoaot). This will allow sharing of the ROMClasses and, in theory, should be faster than no SCC. If that's not the case, then we should start looking at lock contention for the SCC.
I've run some measurements in a distributed (LinTel) environment to see if Phil's zOS findings are reproducible on other platforms, and see something similar ... as the number of apps in the configuration grows, along with the number of classes loaded at startup, the benefit of using the SCC drops, and SCC use becomes a disadvantage when the class count gets very large.
My focus was on reducing time/CPU spent jitting, but I started to think that the bulk of the overhead is elsewhere: The no SCC case finishes in 167 sec and uses 95 sec of CPU for compilation. When using SCC, the elapsed time doubles (300-320 sec), but the CPU spent for compilation is less than half (44 sec). A profile would be helpful in determining what to study next. If there is a lot of CPU spent in jitted code, then maybe code quality matters a lot. If there is a lot of CPU burnt in synchronization (spinning) then we should look into ways of reducing that. if there is a lot of time spent in interpretation, then accelerating the transition from interpreted to jitted may still make sense.
Based on Marius' suggestion above, I'm currently running the 100-app startup with -Xnoaot and watching it in perf top. There is heavy locking at the top of the list.
Samples: 7M of event 'cycles:ppp', 4000 Hz, Event count (approx.): 220255438418 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
18.65% libj9thr29.so [.] omrthread_spinlock_acquire
17.42% libjclse29.so [.] copyString
7.42% libj9jit29.so [.] TR_IProfiler::findOrCreateMethodEntry
2.65% [kernel] [k] sysret_check
2.15% libj9vm29.so [.] VM_BytecodeInterpreterCompressed::run
1.92% libj9jit29.so [.] jitMethodTranslated
1.75% libj9jit29.so [.] TR_IPMethodHashTableEntry::add
1.63% libj9gc29.so [.] MM_ConcurrentGC::localMark
1.31% libj9shr29.so [.] SH_ROMClassManagerImpl::locateROMClass
1.27% libj9vm29.so [.] allocateRAMClassFragmentFromFreeList
1.01% libj9shr29.so [.] ClasspathItem::compare
0.90% libj9gc29.so [.] MM_MarkingScheme::markObject
0.90% [kernel] [k] system_call_after_swapgs
0.75% libj9shr29.so [.] ClasspathItem::itemAt
@gjdeval Could you please generate a JLM report to identify the lock that is contended?
@gjdeval collected a JLM from a run with SCC enabled but AOT disabled. Here's an excerpt for the system monitors:
%MISS GETS NONREC SLOW REC TIER2 TIER3 %UTIL AVER_HTM MON-NAME
91 23 23 21 0 30627 936 0 49665 [00007FB5F80DF4B8] MM_Scavenger::freeCacheMonitor
68 1561372 1561372 1056357 0 92909924 2829445 15 246487 [00007FB5F8181708] &(tempConfig->jclCacheMutex)
15 314693 314693 47888 0 1512543 45911 2 175013 [00007FB5F80DF828] BCVD verifier
8 1288 1288 103 0 175755 5153 0 4984 [00007FB5F80DDD58] MM_GCExtensions::gcStats
4 106176 106061 4553 115 6810485 207901 3 607385 [00007FB5F80E07F8] JIT-CompilationQueueMonitor
3 238662 238662 6129 0 9361417 284958 0 0 [00007FB5F8121130] MM_MemoryPoolAddressOrderedList:_heapLock
1 10172 10172 148 0 93181 2595 0 2660 [00007FB5F8005118] MM_ParallelDispatcher::synchronize
1 8183 8183 114 0 203179 6209 0 12171 [00007FB5F80DE228] VM exclusive access
1 9424296 8399840 86596 1024456 557909404 16884653 8 23090 [00007FB5F80DEB18] VM class table
1 705 705 6 0 10711 300 0 3785 [00007FB5F80051C8] MM_WorkPackets::inputList
1 24883 24883 142 0 470369 13730 0 5678 [00007FB5F8004AE8] MM_SublistPool
1 55164 55159 304 5 876542 26609 0 151670 [00007FB5F80DE178] VM thread list
0 312239 312239 1491 0 2395582 71664 0 0 [00007FB5F811B640] MM_MemoryPoolAddressOrderedList:_heapLock
0 2856 2856 13 0 85929 2339 0 10264 [00007FB5F8004FB8] MM_ParallelDispatcher::workerThread
0 1636019 1580256 5304 55763 9006816 273924 2 30356 [00007FB5F80DF358] VM mem segment list
0 4020450 4020450 9775 0 18967315 561835 0 0 [00007FB5F80EF180] J9ModronFreeList:_lock
0 1886568 1886568 1844 0 3716645 108189 0 0 [00007FB5F80EF680] J9ModronFreeList:_lock
0 1889 1737 1 152 1473 45 0 2237 [00007FB4B803A7F8] Thread public flags mutex
0 2269 2114 1 155 1473 45 0 1093 [00007FB4B8038648] Thread public flags mutex
0 1414526 1414526 646 0 1010486 30732 0 4803 [00007FB5F8181E98] &_htMutex
0 54643 54643 17 0 47257 1424 2 726268 [00007FB5F8004148] &(PPG_mem_mem32_subAllocHeapMem32.monitor)
0 7689 3871 1 3818 1473 45 0 232493 [00007FB5F80DEA68] VM class loader blocks
0 86083 86083 16 0 31167 919 0 3253 [00007FB5F80DF1F8] VM monitor table
0 23077 23077 3 0 178636 4629 0 3605 [00007FB5F80DF408] MM_Scavenger::scanCacheMonitor
0 283913 283913 18 0 977863 11238 0 0 [00007FB5F811A9E8] MM_CopyScanCacheList:_sublists[]._cacheLock
0 638179 638179 39 0 6164954 161318 0 13125 [00007FB56C0011C8] JVM_RawMonitor
0 259087 259087 8 0 583278 5004 0 0 [00007FB5F811A738] MM_CopyScanCacheList:_sublists[]._cacheLock
0 557716 557716 14 0 21555 622 0 1011 [00007FB5F80DEC78] VM JNI frame
0 248385 248385 6 0 946128 13474 0 0 [00007FB5F811AB40] MM_CopyScanCacheList:_sublists[]._cacheLock
0 286171 286171 5 0 470113 2459 0 0 [00007FB5F811A890] MM_CopyScanCacheList:_sublists[]._cacheLock
0 760636 760636 12 0 82423 2294 0 5201 [00007FB5F8181D38] &_htMutex
0 2969028 2969028 41 0 307371 8686 0 1737 [00007FB5F81817B8] &_refreshMutex
0 2737948 2737948 17 0 100317 3048 0 117 [00007FB5F80DF2A8] VM mem segment list
0 248636 248636 1 0 112750 690 0 0 [00007FB5F811B1B0] MM_CopyScanCacheList:_sublists[]._cacheLock
0 2567726 2567726 8 0 37793 1072 0 240 [00007FB5F80DE018] JIT-PersistentAllocatorSegmentMonitor
0 1976423 1976423 5 0 129621 3578 0 2022 [00007FB5F8181FF8] &_htMutex
0 1903889 1903889 2 0 263470 7382 0 4281 [00007FB5F80E05E8] JIT-AssumptionTableMutex
0 7936782 7936782 0 0 0 0 0 494 [00007FB5F8004098] Thread global
0 6405561 6405442 0 119 0 0 0 71 [00007FB5F81805D8] Thread public flags mutex
0 5552670 5552670 0 0 116127 3139 0 618 [00007FB5F8182788] JIT-iprofilerMonitor
0 4931684 4931684 0 0 25252 670 0 34 [00007FB5F80E03D8] JIT-IProfilerPersistenceMonitor
0 4337964 4337841 0 123 0 0 0 74 [00007FB5F81803C8] Thread public flags mutex
0 2795901 2795901 0 0 6042 37 0 0 [00007FB5F807CC88] MM_PacketList:_sublists[]._lock
0 2719855 2719645 0 210 0 0 0 97 [00007FB56C0008D8] Thread public flags mutex
0 2582459 2582459 0 0 4504 126 0 59 [00007FB5F80DDEB8] JIT-PersistentAllocatorSmallBlockMonitor
0 1763987 1763987 0 0 25039 246 0 0 [00007FB5F807BFD8] MM_PacketList:_sublists[]._lock
The hottest monitors are &(tempConfig->jclCacheMutex)
and VM class table
.
Searching the code base, jclCacheMutex
appears in JavaVM struct, but also in J9SharedClassConfig
. In this case it's probably the latter, so we are hitting some SCC locking and class related locking.
BCVD verifier
lock also hits the slow path quite often. I am not sure whether -XX:+ClassRelationshipVerifier
can improve the situation.
@hangshao0 Any comment on the &(tempConfig->jclCacheMutex)
monitor shown in the JLM above?
I believe most of the SCC calls to enter the jclCacheMutex
is on the finder path for classes that are stored using:
I've see cases that classes are stored using the actual URL of each class file (URL:path/packageName/ClassName.class
). If that is the case here, each class file will have a unique URL, and the benefit of caching the URL is lost. Every time we will enter the jclCacheMutex
and create/cache a URL that will never be reused. They should be stored using URL:path
only (not including the package and class name).
Using java -Xshareclasses:cacheDir=<directory>,name=<CacheName>,printStats=url
should show all the URLs.
Judging based on the number of URLs in the javacore, I think that we are hitting the case described by Hang above:
2SCLTEXTNRC Number ROMClasses = 280707
2SCLTEXTNAM Number AOT Methods = 9250
2SCLTEXTNAD Number AOT Data Entries = 1
2SCLTEXTNAH Number AOT Class Hierarchy = 4715
2SCLTEXTNAT Number AOT Thunks = 2103
2SCLTEXTNJH Number JIT Hints = 4815
2SCLTEXTNJP Number JIT Profiles = 4794
2SCLTEXTNCP Number Classpaths = 3
2SCLTEXTNUR Number URLs = 263203
@mpirvu here is a log of printstats URLs from one of the performance runs with 80 apps https://ibm.box.com/s/po2o2sjxio02l015x3q5y4ik0wh8ryn1
The logs show the problem reported by Hang above. I understand that Liberty team is looking into a solution.
Jared has provided to me a quick-and-dirty patch of the Liberty AppClassloader which trims the URLs stored in the SCC as per Hang's guidance. The patch is not complete, it was provided just to see how much benefit we could get from improving handling of the URLs.
The patch reduces the number of URLs stored for my largest config (100 apps) by about two orders of magnitude.
before patch:
OL-210011-GA-cics-spring-multi-100-j9-828-sr6-fp36-javacores-cpus-28-thr-unl-runs-50-try-1-logs/run-20/javacore.20211031.153323.36106.0001.txt
2SCLTEXTNRC Number ROMClasses = 394351
2SCLTEXTNUR Number URLs = 332221
after patch:
usr/servers/cics-spring-multi-100/javacore.20211102.095643.869.0001.txt
2SCLTEXTNRC Number ROMClasses = 387611
2SCLTEXTNUR Number URLs = 3701
The classloader SCC patch appears to shift the main lock contention point to the AOT.
using AOT:
Startup: Avg: 185595 ms, Min: 125248 ms, Max: 339294 ms, StdDev: 83970 ms, SDev/Avg: 45.2%
Footprint: Avg: 4106 MB, Min: 4014 MB, Max: 4183 MB, StdDev: 51 MB, SDev/Avg: 1.2%
CPU usage: Avg: 1710.41 secs, Min: 802.30 secs, Max: 3814.94 secs, StdDev: 1273.91 secs, SDev/Avg: 74.48%
SU: 185595 FR: n/a FP: 4106 CPU: 1710.41 app: cics-spring-multi-100
without AOT:
Startup: Avg: 114807 ms, Min: 112549 ms, Max: 117885 ms, StdDev: 1461 ms, SDev/Avg: 1.3%
Footprint: Avg: 3840 MB, Min: 3792 MB, Max: 3873 MB, StdDev: 22 MB, SDev/Avg: 0.6%
CPU usage: Avg: 927.98 secs, Min: 918.05 secs, Max: 949.37 secs, StdDev: 7.16 secs, SDev/Avg: 0.77%
SU: 114807 FR: n/a FP: 3840 CPU: 927.98 app: cics-spring-multi-100
without AOT plus ClassRelationshipVerifier:
Startup: Avg: 102886 ms, Min: 100834 ms, Max: 104778 ms, StdDev: 1274 ms, SDev/Avg: 1.2%
Footprint: Avg: 3831 MB, Min: 3782 MB, Max: 3866 MB, StdDev: 23 MB, SDev/Avg: 0.6%
CPU usage: Avg: 875.75 secs, Min: 862.07 secs, Max: 886.58 secs, StdDev: 6.52 secs, SDev/Avg: 0.74%
SU: 102886 FR: n/a FP: 3831 CPU: 875.75 app: cics-spring-multi-100
I've provided Marius with a configuration matching the one which produced the above results (including Jared's test patch) so that he can investigate the AOT directly, without waiting for me to run tests at his direction.
A summary of experiments I performed using Jared's patch and -XX:+ClassRelationshipVerifier
(tests were done on a skylake machine with 8 HWthreads and loaded 100 apps in Liberty).
(1) No SCC ==> 128 sec (2) -Xnoaot ==> cold=148 sec, warm=110 sec (3) default ==> cold=236 sec, warm=114 sec
Since perf profiles showed a relatively large number of ticks in requestExistsInCompilationQueue()
I investigated why we take what is supposed to be a very rare path. It turned out that we run out of code cache (256 MB) and then we reject all compilation requests.
I increased the size of the code cache to 512 MB, which allowed more compilations to happen, but this increased start-up time
(4) -Xcodecachetotal512m with AOT ==> cold=245 sec, warm=180 sec
Perf profiles showed a large amount of time in:
9.69% 1286399 [.] TR::CompilationInfo::addMethodToBeCompiled
7.96% 1056792 [.] TR::CompilationInfo::queueEntry
which happened because the compilation queue size reaches huge values: 40K-46K entries and the JIT does a linear scan to look for duplicates and to find the appropriate insertion point based on priority. The large compilation queue size (Q_SZ) is due to the large number of AOT loads (~135 K) and the amount of time to perform such an AOT load. Typically an AOT load should take 40-80 usec, but in this case I see the average AOT load time around 2ms, occasionally going over 300 ms. I suspect this happens because AOT loads needs to acquire the VM-class-table lock at the end, and this is very contented due to the class loading activity. All in all, compilation activity cannot keep up with the rate at which application threads generate compilation requests, this translates into large queuing times which in turn makes the compilation process even slower. I have modified the code to address the two hot methods above and the new start-up times become:
(5) -Xcodecachetotal512m with AOT +hacks ==> cold=185 sec, warm=119 sec
This is an improvement over (4), but still not as good as the case when we failed compilations due to insufficient code cache. At this point, the most expensive methods are:
24.88% 3066018 [.] omrthread_spinlock_acquire
4.79% 589870 [.] TR_RuntimeAssumptionTable::getBucketPtr
3.72% 458125 [.] jitMethodTranslated
3.67% 451755 [.] VM_BytecodeInterpreterCompressed::run
3.05% 376128 [.] TR::CompilationInfo::queueEntry
2.05% 252881 [k] do_syscall_64
1.87% 230901 [.] TR_IProfiler::findOrCreateMethodEntry
1.71% 210224 [.] MM_Scavenger::incrementalScanCacheBySlot
1.67% 205414 [.] SH_ROMClassManagerImpl::locateROMClass
1.57% 193825 [.] TR_PersistentCHTable::methodGotOverridden
0.96% 118205 [.] allocateRAMClassFragmentFromFreeList
0.86% 106273 [.] ClasspathItem::compare
0.85% 105219 [.] J9::MonitorTable::removeAndDestroy
One idea that I wanted to try is: when the compilation queue size grows too much, instead of adding another entry to the queue, replenish the invocation counter of the j9method with some value to delay its compilation. Ideally we would eliminate the chtable.commit() step from the AOT loads to speed them up, but I am not sure that is possible.
I implemented the change the limits the compilation queue size to a certain value by postponing any compilation requests once the limit is reached. Postponing is done by adding to the invocation count of the methods 3000 (so methods become again eligible for compilation after 3000 more invocations). This change however, did not improve start-up time any further. I can see that the compilation queue size is within the chosen limit (I tried 8000 and 10000), but the total number of compilations is still very high ~140K. In this configuration the hottest locks are:
%MISS GETS NONREC SLOW REC TIER2 TIER3 %UTIL AVER_HTM MON-NAME
63 245829 245829 153991 0 8464089 257416 40 1515242 [00007FE49409E438] BCVD verifier
13 370836 370724 46569 112 34700447 1062765 10 247274 [00007FE49409F408] JIT-CompilationQueueMonitor
3 17918798 16680340 475833 1238458 550920284 16697009 29 16528 [00007FE494007E88] VM class table
0 5788727 5788727 16313 0 134010832 3959981 17 26987 [00007FE49409F1F8] JIT-AssumptionTableMutex
I implemented another change where AOT loads don't need to acquire the CHTable mutex when they finish, but there is no improvement out of it. However, if I compile AOT bodies with -Xaot:disableCHOpts
I see a small improvement where performance is more or less on par with -Xnoaot case. I guess, during AOT relocations we need to acquire CHTable mutex briefly, but when disableCHOpts
is used, that is no longer the case, providing less strain on this highly contended monitor.
It should be noted that I still see come AOT loads taking hundreds of ms, so there is another lock that delays their processing.
I achieve the best start-up time results with -Xaot:disableCHOpts -Xjit:disableCHOpts
which minimizes the pressure on the 'VM class table' lock from the JIT.
(6) -Xaot:disableCHOpts -Xjit:disableCHOpts ==> 103 sec
Lock statistics become:
53 245850 245850 129398 0 9703915 295209 42 951746 [00007F4DA009EFE8] BCVD verifier
9 397976 397809 37278 167 37361231 1146932 17 239792 [00007F4DA009FFB8] JIT-CompilationQueueMonitor
2 17173679 16409013 295367 764666 278602468 8360356 32 10881 [00007F4DA0007E88] VM class table
0 3328880 3328880 2727 0 27458762 809515 7 11753 [00007F4DA009FDA8] JIT-AssumptionTableMutex
Okay these options may affect throughput and so I wanted to point out that tradeoff exists.
With the changes from PR #14050, #14052 and #14056, the configuration with AOT has almost the same start-up as the -Xnoaot configuration, but the CPU consumed by the compilation threads is much better:
-Xnoaot | With AOT (500M) | |
---|---|---|
Java11 | 107 | 140 |
Java11+fixes | 107 | 110 |
-Xnoaot | With AOT (500M) | |
---|---|---|
Java11 | 161 | 164 |
Java11+fixes | 164 | 108 |
All these experiments included -Xscmx500m -XX:+ClassRelationshipVerifier -Xcodecachetotal512m
and were using the Liberty fix with respect to URLs. 100 apps are loaded.
@mpirvu are these figures for z/OS?
@mpirvu are these figures for z/OS?
No. Intel Skylake (4 cores with HT on).
The changes for the file URLs have now been delivered in Liberty 21.0.0.12 and have shown an expected improvement in startup time benchmarks with the SCC active. Is there any further code to deliver or can we now close this @mpirvu @gjdeval
There is nothing else to deliver on my side. The 3 OpenJ9 PR that I mentioned in https://github.com/eclipse-openj9/openj9/issues/13807#issuecomment-985017951 are already merged.
Java -version output
java version "1.8.0_281" Java(TM) SE Runtime Environment (build 8.0.6.26 - pmz6480sr6fp26-20210226_01(SR6 FP26)) IBM J9 VM (build 2.9, JRE 1.8.0 z/OS s390x-64-Bit Compressed References 20210216_465732 (JIT enabled, AOT enabled) OpenJ9 - e5f4f96 OMR - 999051a IBM - 358762e) JCL - 20210108_01 based on Oracle jdk8u281-b09 z/OS V2.4
Summary of problem
Using the Java SCC does not scale in startup scenarios with high numbers of apps loading classes. In a scenario with multiple Spring Boot apps being loaded at startup by a Liberty server, then performance of the SCC is only better if ~40 or less app are deployed
Here are the results for mean startup time from 50 iterations for time to start last app (using timestamps from Liberty msg CWWKZ0001I) 1app: 2.2s with SCC, 7.1s without 40apps: 65.9s with SCC, 67.8s without 80apps: 321.4s with SCC, 167.6s without
Diagnostic files
Diagnostics logs are in this folder https://ibm.box.com/s/nd57d52bt6x866npxd0tsif3f3a2v8ca