Java shared class cache degrades startup performance for high numbers of deployed Spring Boot apps

PhilWakelin commented 2 years ago

Java -version output

java version "1.8.0_281" Java(TM) SE Runtime Environment (build 8.0.6.26 - pmz6480sr6fp26-20210226_01(SR6 FP26)) IBM J9 VM (build 2.9, JRE 1.8.0 z/OS s390x-64-Bit Compressed References 20210216_465732 (JIT enabled, AOT enabled) OpenJ9 - e5f4f96 OMR - 999051a IBM - 358762e) JCL - 20210108_01 based on Oracle jdk8u281-b09 z/OS V2.4

Summary of problem

Using the Java SCC does not scale in startup scenarios with high numbers of apps loading classes. In a scenario with multiple Spring Boot apps being loaded at startup by a Liberty server, then performance of the SCC is only better if ~40 or less app are deployed

Here are the results for mean startup time from 50 iterations for time to start last app (using timestamps from Liberty msg CWWKZ0001I) 1app: 2.2s with SCC, 7.1s without 40apps: 65.9s with SCC, 67.8s without 80apps: 321.4s with SCC, 167.6s without

Diagnostic files

Diagnostics logs are in this folder https://ibm.box.com/s/nd57d52bt6x866npxd0tsif3f3a2v8ca

mpirvu commented 2 years ago

@PhilWakelin In the box folder above, the verbose log from jitlog6 is a copy of the one from jitlog5 (the cold run)

PhilWakelin commented 2 years ago

@mpirvu jitlog6 now updated with correct files.

PhilWakelin commented 2 years ago

I've rerun some startup test for the SCC with the additional JVM options for Xtune:virtualized and with Xaot:forceaot Charts attached, but in summary both options increased startup time with the SCC, although the increase with the Xaot option was not statistically significant. However, both options did significantly decrease CPU usage

mpirvu commented 2 years ago

@PhilWakelin Could you please try a build after 2021/09/11? There is a change in there https://github.com/eclipse-openj9/openj9/pull/13489 that may help. Also, in the vlogs I see many AOT load failures due to compilationAotClassReloFailure. These may subside with additional runs, so we need to try 2-3 warm runs in succession. Finally, zOS uses AOT compression which adds a bit of overhead. Since the warm run has a very large number of AOT loads, maybe we should to disable AOT compression with -Xaot:forceaot,disableAOTBytesCompression

PhilWakelin commented 2 years ago

Thanks @mpirvu I will try out a later build to see how this impacts startup and will also try -Xaot:forceaot,disableAOTBytesCompression All the mean startup figures are the averages from 50 startup iterations, with a clean Java SCC and Liberty cache. The first 2 runs were discarded from the figures, so I don't think running more runs will help the figures. However, the data in the JITlogs I provided was the first 2 (normally discarded) runs. I can gather JITlogs from a later iteration if needed let me know if you want this.

mpirvu commented 2 years ago

I can gather JITlogs from a later iteration if needed let me know if you want this.

Yes please. I expect to see a reduction in the number of AOT load failures.

mpirvu commented 2 years ago

Another interesting experiment would be to run with SCC but without AOT (-Xnoaot). This will allow sharing of the ROMClasses and, in theory, should be faster than no SCC. If that's not the case, then we should start looking at lock contention for the SCC.

gjdeval commented 2 years ago

I've run some measurements in a distributed (LinTel) environment to see if Phil's zOS findings are reproducible on other platforms, and see something similar ... as the number of apps in the configuration grows, along with the number of classes loaded at startup, the benefit of using the SCC drops, and SCC use becomes a disadvantage when the class count gets very large. many-cics-app-lintel-startup-scc-effect-table

mpirvu commented 2 years ago

My focus was on reducing time/CPU spent jitting, but I started to think that the bulk of the overhead is elsewhere: The no SCC case finishes in 167 sec and uses 95 sec of CPU for compilation. When using SCC, the elapsed time doubles (300-320 sec), but the CPU spent for compilation is less than half (44 sec). A profile would be helpful in determining what to study next. If there is a lot of CPU spent in jitted code, then maybe code quality matters a lot. If there is a lot of CPU burnt in synchronization (spinning) then we should look into ways of reducing that. if there is a lot of time spent in interpretation, then accelerating the transition from interpreted to jitted may still make sense.

gjdeval commented 2 years ago

Based on Marius' suggestion above, I'm currently running the 100-app startup with -Xnoaot and watching it in perf top. There is heavy locking at the top of the list.

Samples: 7M of event 'cycles:ppp', 4000 Hz, Event count (approx.): 220255438418 lost: 0/0 drop: 0/0
Overhead  Shared Object                         Symbol
  18.65%  libj9thr29.so                         [.] omrthread_spinlock_acquire
  17.42%  libjclse29.so                         [.] copyString
   7.42%  libj9jit29.so                         [.] TR_IProfiler::findOrCreateMethodEntry
   2.65%  [kernel]                              [k] sysret_check
   2.15%  libj9vm29.so                          [.] VM_BytecodeInterpreterCompressed::run
   1.92%  libj9jit29.so                         [.] jitMethodTranslated
   1.75%  libj9jit29.so                         [.] TR_IPMethodHashTableEntry::add
   1.63%  libj9gc29.so                          [.] MM_ConcurrentGC::localMark
   1.31%  libj9shr29.so                         [.] SH_ROMClassManagerImpl::locateROMClass
   1.27%  libj9vm29.so                          [.] allocateRAMClassFragmentFromFreeList
   1.01%  libj9shr29.so                         [.] ClasspathItem::compare
   0.90%  libj9gc29.so                          [.] MM_MarkingScheme::markObject
   0.90%  [kernel]                              [k] system_call_after_swapgs
   0.75%  libj9shr29.so                         [.] ClasspathItem::itemAt

mpirvu commented 2 years ago

@gjdeval Could you please generate a JLM report to identify the lock that is contended?

mpirvu commented 2 years ago

@gjdeval collected a JLM from a run with SCC enabled but AOT disabled. Here's an excerpt for the system monitors:

 %MISS     GETS   NONREC     SLOW      REC    TIER2    TIER3 %UTIL AVER_HTM  MON-NAME

    91       23       23       21        0    30627      936     0    49665  [00007FB5F80DF4B8] MM_Scavenger::freeCacheMonitor
    68  1561372  1561372  1056357        0 92909924  2829445    15   246487  [00007FB5F8181708] &(tempConfig->jclCacheMutex)
    15   314693   314693    47888        0  1512543    45911     2   175013  [00007FB5F80DF828] BCVD verifier
     8     1288     1288      103        0   175755     5153     0     4984  [00007FB5F80DDD58] MM_GCExtensions::gcStats
     4   106176   106061     4553      115  6810485   207901     3   607385  [00007FB5F80E07F8] JIT-CompilationQueueMonitor
     3   238662   238662     6129        0  9361417   284958     0        0  [00007FB5F8121130] MM_MemoryPoolAddressOrderedList:_heapLock
     1    10172    10172      148        0    93181     2595     0     2660  [00007FB5F8005118] MM_ParallelDispatcher::synchronize
     1     8183     8183      114        0   203179     6209     0    12171  [00007FB5F80DE228] VM exclusive access
     1  9424296  8399840    86596  1024456 557909404 16884653     8    23090  [00007FB5F80DEB18] VM class table
     1      705      705        6        0    10711      300     0     3785  [00007FB5F80051C8] MM_WorkPackets::inputList
     1    24883    24883      142        0   470369    13730     0     5678  [00007FB5F8004AE8] MM_SublistPool
     1    55164    55159      304        5   876542    26609     0   151670  [00007FB5F80DE178] VM thread list
     0   312239   312239     1491        0  2395582    71664     0        0  [00007FB5F811B640] MM_MemoryPoolAddressOrderedList:_heapLock
     0     2856     2856       13        0    85929     2339     0    10264  [00007FB5F8004FB8] MM_ParallelDispatcher::workerThread
     0  1636019  1580256     5304    55763  9006816   273924     2    30356  [00007FB5F80DF358] VM mem segment list
     0  4020450  4020450     9775        0 18967315   561835     0        0  [00007FB5F80EF180] J9ModronFreeList:_lock
     0  1886568  1886568     1844        0  3716645   108189     0        0  [00007FB5F80EF680] J9ModronFreeList:_lock
     0     1889     1737        1      152     1473       45     0     2237  [00007FB4B803A7F8] Thread public flags mutex
     0     2269     2114        1      155     1473       45     0     1093  [00007FB4B8038648] Thread public flags mutex
     0  1414526  1414526      646        0  1010486    30732     0     4803  [00007FB5F8181E98] &_htMutex
     0    54643    54643       17        0    47257     1424     2   726268  [00007FB5F8004148] &(PPG_mem_mem32_subAllocHeapMem32.monitor)
     0     7689     3871        1     3818     1473       45     0   232493  [00007FB5F80DEA68] VM class loader blocks
     0    86083    86083       16        0    31167      919     0     3253  [00007FB5F80DF1F8] VM monitor table
     0    23077    23077        3        0   178636     4629     0     3605  [00007FB5F80DF408] MM_Scavenger::scanCacheMonitor
     0   283913   283913       18        0   977863    11238     0        0  [00007FB5F811A9E8] MM_CopyScanCacheList:_sublists[]._cacheLock
     0   638179   638179       39        0  6164954   161318     0    13125  [00007FB56C0011C8] JVM_RawMonitor
     0   259087   259087        8        0   583278     5004     0        0  [00007FB5F811A738] MM_CopyScanCacheList:_sublists[]._cacheLock
     0   557716   557716       14        0    21555      622     0     1011  [00007FB5F80DEC78] VM JNI frame
     0   248385   248385        6        0   946128    13474     0        0  [00007FB5F811AB40] MM_CopyScanCacheList:_sublists[]._cacheLock
     0   286171   286171        5        0   470113     2459     0        0  [00007FB5F811A890] MM_CopyScanCacheList:_sublists[]._cacheLock
     0   760636   760636       12        0    82423     2294     0     5201  [00007FB5F8181D38] &_htMutex
     0  2969028  2969028       41        0   307371     8686     0     1737  [00007FB5F81817B8] &_refreshMutex
     0  2737948  2737948       17        0   100317     3048     0      117  [00007FB5F80DF2A8] VM mem segment list
     0   248636   248636        1        0   112750      690     0        0  [00007FB5F811B1B0] MM_CopyScanCacheList:_sublists[]._cacheLock
     0  2567726  2567726        8        0    37793     1072     0      240  [00007FB5F80DE018] JIT-PersistentAllocatorSegmentMonitor
     0  1976423  1976423        5        0   129621     3578     0     2022  [00007FB5F8181FF8] &_htMutex
     0  1903889  1903889        2        0   263470     7382     0     4281  [00007FB5F80E05E8] JIT-AssumptionTableMutex
     0  7936782  7936782        0        0        0        0     0      494  [00007FB5F8004098] Thread global
     0  6405561  6405442        0      119        0        0     0       71  [00007FB5F81805D8] Thread public flags mutex
     0  5552670  5552670        0        0   116127     3139     0      618  [00007FB5F8182788] JIT-iprofilerMonitor
     0  4931684  4931684        0        0    25252      670     0       34  [00007FB5F80E03D8] JIT-IProfilerPersistenceMonitor
     0  4337964  4337841        0      123        0        0     0       74  [00007FB5F81803C8] Thread public flags mutex
     0  2795901  2795901        0        0     6042       37     0        0  [00007FB5F807CC88] MM_PacketList:_sublists[]._lock
     0  2719855  2719645        0      210        0        0     0       97  [00007FB56C0008D8] Thread public flags mutex
     0  2582459  2582459        0        0     4504      126     0       59  [00007FB5F80DDEB8] JIT-PersistentAllocatorSmallBlockMonitor
     0  1763987  1763987        0        0    25039      246     0        0  [00007FB5F807BFD8] MM_PacketList:_sublists[]._lock

The hottest monitors are &(tempConfig->jclCacheMutex) and VM class table. Searching the code base, jclCacheMutex appears in JavaVM struct, but also in J9SharedClassConfig. In this case it's probably the latter, so we are hitting some SCC locking and class related locking. BCVD verifier lock also hits the slow path quite often. I am not sure whether -XX:+ClassRelationshipVerifier can improve the situation.

mpirvu commented 2 years ago

@hangshao0 Any comment on the &(tempConfig->jclCacheMutex) monitor shown in the JLM above?

hangshao0 commented 2 years ago

I believe most of the SCC calls to enter the jclCacheMutex is on the finder path for classes that are stored using:

a URL
a URL class path

I've see cases that classes are stored using the actual URL of each class file (URL:path/packageName/ClassName.class). If that is the case here, each class file will have a unique URL, and the benefit of caching the URL is lost. Every time we will enter the jclCacheMutex and create/cache a URL that will never be reused. They should be stored using URL:path only (not including the package and class name).

Using java -Xshareclasses:cacheDir=<directory>,name=<CacheName>,printStats=url should show all the URLs.

mpirvu commented 2 years ago

Judging based on the number of URLs in the javacore, I think that we are hitting the case described by Hang above:

2SCLTEXTNRC            Number ROMClasses                         = 280707
2SCLTEXTNAM            Number AOT Methods                        = 9250
2SCLTEXTNAD            Number AOT Data Entries                   = 1
2SCLTEXTNAH            Number AOT Class Hierarchy                = 4715
2SCLTEXTNAT            Number AOT Thunks                         = 2103
2SCLTEXTNJH            Number JIT Hints                          = 4815
2SCLTEXTNJP            Number JIT Profiles                       = 4794
2SCLTEXTNCP            Number Classpaths                         = 3
2SCLTEXTNUR            Number URLs                               = 263203

PhilWakelin commented 2 years ago

@mpirvu here is a log of printstats URLs from one of the performance runs with 80 apps https://ibm.box.com/s/po2o2sjxio02l015x3q5y4ik0wh8ryn1

mpirvu commented 2 years ago

The logs show the problem reported by Hang above. I understand that Liberty team is looking into a solution.

gjdeval commented 2 years ago

Jared has provided to me a quick-and-dirty patch of the Liberty AppClassloader which trims the URLs stored in the SCC as per Hang's guidance. The patch is not complete, it was provided just to see how much benefit we could get from improving handling of the URLs.

The patch reduces the number of URLs stored for my largest config (100 apps) by about two orders of magnitude.

before patch:

OL-210011-GA-cics-spring-multi-100-j9-828-sr6-fp36-javacores-cpus-28-thr-unl-runs-50-try-1-logs/run-20/javacore.20211031.153323.36106.0001.txt
2SCLTEXTNRC            Number ROMClasses                         = 394351
2SCLTEXTNUR            Number URLs                               = 332221

after patch:

usr/servers/cics-spring-multi-100/javacore.20211102.095643.869.0001.txt
2SCLTEXTNRC            Number ROMClasses                         = 387611
2SCLTEXTNUR            Number URLs                               = 3701

The classloader SCC patch appears to shift the main lock contention point to the AOT.

using AOT:

Startup:    Avg: 185595 ms, Min: 125248 ms, Max: 339294 ms, StdDev: 83970 ms, SDev/Avg: 45.2%
Footprint:  Avg: 4106 MB, Min: 4014 MB, Max: 4183 MB, StdDev: 51 MB, SDev/Avg: 1.2%
CPU usage:  Avg: 1710.41 secs, Min: 802.30 secs, Max: 3814.94 secs, StdDev: 1273.91 secs, SDev/Avg: 74.48%
SU: 185595 FR: n/a FP: 4106 CPU: 1710.41 app: cics-spring-multi-100

without AOT:

Startup:    Avg: 114807 ms, Min: 112549 ms, Max: 117885 ms, StdDev: 1461 ms, SDev/Avg: 1.3%
Footprint:  Avg: 3840 MB, Min: 3792 MB, Max: 3873 MB, StdDev: 22 MB, SDev/Avg: 0.6%
CPU usage:  Avg: 927.98 secs, Min: 918.05 secs, Max: 949.37 secs, StdDev: 7.16 secs, SDev/Avg: 0.77%
SU: 114807 FR: n/a FP: 3840 CPU: 927.98 app: cics-spring-multi-100

without AOT plus ClassRelationshipVerifier:

Startup:    Avg: 102886 ms, Min: 100834 ms, Max: 104778 ms, StdDev: 1274 ms, SDev/Avg: 1.2%
Footprint:  Avg: 3831 MB, Min: 3782 MB, Max: 3866 MB, StdDev: 23 MB, SDev/Avg: 0.6%
CPU usage:  Avg: 875.75 secs, Min: 862.07 secs, Max: 886.58 secs, StdDev: 6.52 secs, SDev/Avg: 0.74%
SU: 102886 FR: n/a FP: 3831 CPU: 875.75 app: cics-spring-multi-100

I've provided Marius with a configuration matching the one which produced the above results (including Jared's test patch) so that he can investigate the AOT directly, without waiting for me to run tests at his direction.

mpirvu commented 2 years ago

A summary of experiments I performed using Jared's patch and -XX:+ClassRelationshipVerifier (tests were done on a skylake machine with 8 HWthreads and loaded 100 apps in Liberty).

(1) No SCC ==> 128 sec (2) -Xnoaot ==> cold=148 sec, warm=110 sec (3) default ==> cold=236 sec, warm=114 sec

Since perf profiles showed a relatively large number of ticks in requestExistsInCompilationQueue() I investigated why we take what is supposed to be a very rare path. It turned out that we run out of code cache (256 MB) and then we reject all compilation requests. I increased the size of the code cache to 512 MB, which allowed more compilations to happen, but this increased start-up time

(4) -Xcodecachetotal512m with AOT ==> cold=245 sec, warm=180 sec

Perf profiles showed a large amount of time in:

   9.69%       1286399  [.] TR::CompilationInfo::addMethodToBeCompiled 
   7.96%       1056792  [.] TR::CompilationInfo::queueEntry

which happened because the compilation queue size reaches huge values: 40K-46K entries and the JIT does a linear scan to look for duplicates and to find the appropriate insertion point based on priority. The large compilation queue size (Q_SZ) is due to the large number of AOT loads (~135 K) and the amount of time to perform such an AOT load. Typically an AOT load should take 40-80 usec, but in this case I see the average AOT load time around 2ms, occasionally going over 300 ms. I suspect this happens because AOT loads needs to acquire the VM-class-table lock at the end, and this is very contented due to the class loading activity. All in all, compilation activity cannot keep up with the rate at which application threads generate compilation requests, this translates into large queuing times which in turn makes the compilation process even slower. I have modified the code to address the two hot methods above and the new start-up times become:

(5) -Xcodecachetotal512m with AOT +hacks ==> cold=185 sec, warm=119 sec

This is an improvement over (4), but still not as good as the case when we failed compilations due to insufficient code cache. At this point, the most expensive methods are:

  24.88%       3066018  [.] omrthread_spinlock_acquire
   4.79%        589870  [.] TR_RuntimeAssumptionTable::getBucketPtr
   3.72%        458125  [.] jitMethodTranslated
   3.67%        451755  [.] VM_BytecodeInterpreterCompressed::run
   3.05%        376128  [.] TR::CompilationInfo::queueEntry
   2.05%        252881  [k] do_syscall_64
   1.87%        230901  [.] TR_IProfiler::findOrCreateMethodEntry
   1.71%        210224  [.] MM_Scavenger::incrementalScanCacheBySlot
   1.67%        205414  [.] SH_ROMClassManagerImpl::locateROMClass
   1.57%        193825  [.] TR_PersistentCHTable::methodGotOverridden
   0.96%        118205  [.] allocateRAMClassFragmentFromFreeList
   0.86%        106273  [.] ClasspathItem::compare
   0.85%        105219  [.] J9::MonitorTable::removeAndDestroy

One idea that I wanted to try is: when the compilation queue size grows too much, instead of adding another entry to the queue, replenish the invocation counter of the j9method with some value to delay its compilation. Ideally we would eliminate the chtable.commit() step from the AOT loads to speed them up, but I am not sure that is possible.

mpirvu commented 2 years ago

I implemented the change the limits the compilation queue size to a certain value by postponing any compilation requests once the limit is reached. Postponing is done by adding to the invocation count of the methods 3000 (so methods become again eligible for compilation after 3000 more invocations). This change however, did not improve start-up time any further. I can see that the compilation queue size is within the chosen limit (I tried 8000 and 10000), but the total number of compilations is still very high ~140K. In this configuration the hottest locks are:

 %MISS     GETS   NONREC     SLOW      REC    TIER2    TIER3 %UTIL AVER_HTM  MON-NAME
    63   245829   245829   153991        0  8464089   257416    40  1515242  [00007FE49409E438] BCVD verifier
    13   370836   370724    46569      112 34700447  1062765    10   247274  [00007FE49409F408] JIT-CompilationQueueMonitor
     3 17918798 16680340   475833  1238458 550920284 16697009   29    16528  [00007FE494007E88] VM class table
     0  5788727  5788727    16313        0 134010832  3959981   17    26987  [00007FE49409F1F8] JIT-AssumptionTableMutex

mpirvu commented 2 years ago

I implemented another change where AOT loads don't need to acquire the CHTable mutex when they finish, but there is no improvement out of it. However, if I compile AOT bodies with -Xaot:disableCHOpts I see a small improvement where performance is more or less on par with -Xnoaot case. I guess, during AOT relocations we need to acquire CHTable mutex briefly, but when disableCHOpts is used, that is no longer the case, providing less strain on this highly contended monitor. It should be noted that I still see come AOT loads taking hundreds of ms, so there is another lock that delays their processing.

mpirvu commented 2 years ago

I achieve the best start-up time results with -Xaot:disableCHOpts -Xjit:disableCHOpts which minimizes the pressure on the 'VM class table' lock from the JIT. (6) -Xaot:disableCHOpts -Xjit:disableCHOpts ==> 103 sec

Lock statistics become:

    53   245850   245850   129398        0   9703915   295209    42   951746  [00007F4DA009EFE8] BCVD verifier
     9   397976   397809    37278      167  37361231  1146932    17   239792  [00007F4DA009FFB8] JIT-CompilationQueueMonitor
     2 17173679 16409013   295367   764666 278602468  8360356    32    10881  [00007F4DA0007E88] VM class table
     0  3328880  3328880     2727        0  27458762   809515     7    11753  [00007F4DA009FDA8] JIT-AssumptionTableMutex

vijaysun-omr commented 2 years ago

Okay these options may affect throughput and so I wanted to point out that tradeoff exists.

mpirvu commented 2 years ago

With the changes from PR #14050, #14052 and #14056, the configuration with AOT has almost the same start-up as the -Xnoaot configuration, but the CPU consumed by the compilation threads is much better:

Startup time

	-Xnoaot	With AOT (500M)
Java11	107	140
Java11+fixes	107	110

CompCPU

	-Xnoaot	With AOT (500M)
Java11	161	164
Java11+fixes	164	108

All these experiments included -Xscmx500m -XX:+ClassRelationshipVerifier -Xcodecachetotal512m and were using the Liberty fix with respect to URLs. 100 apps are loaded.

PhilWakelin commented 2 years ago

@mpirvu are these figures for z/OS?

mpirvu commented 2 years ago

@mpirvu are these figures for z/OS?

No. Intel Skylake (4 cores with HT on).

PhilWakelin commented 2 years ago

The changes for the file URLs have now been delivered in Liberty 21.0.0.12 and have shown an expected improvement in startup time benchmarks with the SCC active. Is there any further code to deliver or can we now close this @mpirvu @gjdeval

mpirvu commented 2 years ago

There is nothing else to deliver on my side. The 3 OpenJ9 PR that I mentioned in https://github.com/eclipse-openj9/openj9/issues/13807#issuecomment-985017951 are already merged.

eclipse-openj9 / openj9