Open pshipton opened 2 months ago
@dmitripivkine pls take a look
Adding this item to GC queue. OOM is triggered by Excessive GC condition. However there are 673 free regions (of 2022 total), 1m each. This does not look right, we need to look closer.
There is the problem with Balanced GC - calculation of current Eden size converged to minimum value of two regions and got stuck there. This causes PGCs triggered by Allocation Failure almost back to back up to 95% time GC utilization and Excessive GC condition latch eventually. We are investigating.
Some initial observations.
We seem to be constantly doing PGCs, and that's because Eden is only 2MBs (2 regions of 1MB). We should be expanding the Eden since the heap is 2GB and fully expanded. But for some reason we don't expand. The core logic for Eden expansion (for the case of fully expanded heap) is in calculateRecommendedEdenChangeForExpandedHeap, which has some non-trivai calculatinos. Unfortuantelly a couple of trace point in that method did not show up in the dump (even though one of them is Level 1!?)
However, there are quite a few relevant parameters in SchedulingDelegate, and most of them look reasonable, except maybe _estimatedFreeTenure of about 8MB, which seems rather low (trace points report around 700MB)
> !mm_schedulingdelegate 0x00007FFF90069338 | grep _estimatedFreeTenure
0x100: U64 _estimatedFreeTenure = 0x000000000083F133 (8646963)
20:33:12.362818623 0x0 j9mm.346 Event PGC end: workstackoverflow=0 overflowcount=0 heapFree=715128832/2120220672
20:33:12.365062888 0x0 j9mm.345 Event PGC start: globalcount=4335 nextGMPIncrement=0
20:33:12.442877284 0x0 j9mm.346 Event PGC end: workstackoverflow=0 overflowcount=0 heapFree=714080256/2120220672
20:33:12.444926051 0x0 j9mm.345 Event PGC start: globalcount=4336 nextGMPIncrement=0
GC does observe high PGC over head (more like 84-89%, but still very high and we should expand Eden)
> !mm_schedulingdelegate 0x00007FFF90069338 | grep _historicalPartialGCTime
0x128: U64 _historicalPartialGCTime = 0x0000000000000043 (67)
> !mm_schedulingdelegate 0x00007FFF90069338 | grep _averagePgcInterval
0x180: U64 _averagePgcInterval = 0x00000000000129ED (76269)
(lldb) p 67 * 1000.0 / 76269
(double) 0.87846962724042532
> !mm_schedulingdelegate 0x00007FFF90069338 | grep _partialGcOverhead
0x120: double _partialGcOverhead = 0.8466813906537427
Evidence that Eden is stuck at 2 regions:
> !mm_schedulingdelegate 0x00007FFF90069338 | grep -i edenregion
0x30: U64 _idealEdenRegionCount = 0x0000000000000002 (2)
0x38: U64 _minimumEdenRegionCount = 0x0000000000000001 (1)
0x40: U64 _edenRegionCount = 0x0000000000000002 (2)
0x108: U64 _maxEdenRegionCount = 0x00000000000005EC (1516)
0x110: U64 _minEdenRegionCount = 0x0000000000000002 (2)
Evidence that heap is fully expanded:
> !MM_GCExtensions 0x00007FFF90059500 |grep memoryMax
0x5fd8: U64 memoryMax = 0x000000007E600000 (2120220672)
> !mm_schedulingdelegate 0x00007FFF90069338 | grep -i _numberOfHeapRegions
0x58: U64 _numberOfHeapRegions = 0x00000000000007E6 (2022)
(lldb) p 2022 * 1024 * 1024
(int) 2120220672
Moving to 0.49 due problem with reproducibility, so hard to make progress. Investigation is continuing.
https://hyc-runtimes-jenkins.swg-devops.com/job/Test_openjdk21_j9_special.system_ppc64le_linux_testList_0/188 DaaLoadTest_all_special_5m_17
-Xcompressedrefs -Xgcpolicy:balanced -Xquickstart -Xjit
3x x 3 grinder passed
https://na.artifactory.swg-devops.com/artifactory/sys-rt-generic-local/hyc-runtimes-jenkins.swg-devops.com/Test_openjdk21_j9_special.system_ppc64le_linux_testList_0/188/system_test_output.tar.gz