eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.27k stars 721 forks source link

DaaLoadTest BigInteger.toString OutOfMemoryError #19831

Open pshipton opened 2 months ago

pshipton commented 2 months ago

https://hyc-runtimes-jenkins.swg-devops.com/job/Test_openjdk21_j9_special.system_ppc64le_linux_testList_0/188 DaaLoadTest_all_special_5m_17 -Xcompressedrefs -Xgcpolicy:balanced -Xquickstart -Xjit

3x x 3 grinder passed

https://na.artifactory.swg-devops.com/artifactory/sys-rt-generic-local/hyc-runtimes-jenkins.swg-devops.com/Test_openjdk21_j9_special.system_ppc64le_linux_testList_0/188/system_test_output.tar.gz

16:33:07  DLT 16:33:06.663 - Completed 100.0%. Number of tests started=910 (+0)
16:33:13  DLT stderr JVMDUMP055I Processing dump event "systhrow", detail "java/lang/OutOfMemoryError", exception "Java heap space" at 2024/07/08 16:33:12 - please wait.

16:33:24  DLT testFailure: testIntrinsicPerformance(net.openj9.test.arithmetics.TestPerformance): Java heap space
16:33:24  DLT java.lang.OutOfMemoryError: Java heap space
16:33:24  DLT   at java.base/java.lang.Long.toString(Long.java:494)
16:33:24  DLT   at java.base/java.lang.Long.toString(Long.java:145)
16:33:24  DLT   at java.base/java.math.BigInteger.smallToString(BigInteger.java:4173)
16:33:24  DLT   at java.base/java.math.BigInteger.toString(BigInteger.java:4217)
16:33:24  DLT   at java.base/java.math.BigInteger.toString(BigInteger.java:4236)
16:33:24  DLT   at java.base/java.math.BigInteger.toString(BigInteger.java:4112)
16:33:24  DLT   at java.base/java.math.BigInteger.toString(BigInteger.java:4286)
16:33:24  DLT   at net.openj9.test.Utils.toPrecision(Utils.java:489)
16:33:24  DLT   at net.openj9.test.arithmetics.TestPerformance.getComparisionReferenceValue(TestPerformance.java:375)
16:33:24  DLT   at net.openj9.test.arithmetics.TestPerformance.testGE(TestPerformance.java:315)
16:33:24  DLT   at net.openj9.test.arithmetics.TestPerformance.testIntrinsicPerformance(TestPerformance.java:85)
pshipton commented 2 months ago

@dmitripivkine pls take a look

dmitripivkine commented 2 months ago

Adding this item to GC queue. OOM is triggered by Excessive GC condition. However there are 673 free regions (of 2022 total), 1m each. This does not look right, we need to look closer.

dmitripivkine commented 2 months ago

There is the problem with Balanced GC - calculation of current Eden size converged to minimum value of two regions and got stuck there. This causes PGCs triggered by Allocation Failure almost back to back up to 95% time GC utilization and Excessive GC condition latch eventually. We are investigating.

amicic commented 2 months ago

Some initial observations.

We seem to be constantly doing PGCs, and that's because Eden is only 2MBs (2 regions of 1MB). We should be expanding the Eden since the heap is 2GB and fully expanded. But for some reason we don't expand. The core logic for Eden expansion (for the case of fully expanded heap) is in calculateRecommendedEdenChangeForExpandedHeap, which has some non-trivai calculatinos. Unfortuantelly a couple of trace point in that method did not show up in the dump (even though one of them is Level 1!?)

However, there are quite a few relevant parameters in SchedulingDelegate, and most of them look reasonable, except maybe _estimatedFreeTenure of about 8MB, which seems rather low (trace points report around 700MB)

> !mm_schedulingdelegate 0x00007FFF90069338 | grep _estimatedFreeTenure
    0x100: U64 _estimatedFreeTenure = 0x000000000083F133 (8646963)  

20:33:12.362818623  0x0 j9mm.346             Event       PGC end: workstackoverflow=0 overflowcount=0 heapFree=715128832/2120220672
20:33:12.365062888  0x0 j9mm.345             Event       PGC start: globalcount=4335 nextGMPIncrement=0
20:33:12.442877284  0x0 j9mm.346             Event       PGC end: workstackoverflow=0 overflowcount=0 heapFree=714080256/2120220672
20:33:12.444926051  0x0 j9mm.345             Event       PGC start: globalcount=4336 nextGMPIncrement=0

GC does observe high PGC over head (more like 84-89%, but still very high and we should expand Eden)

> !mm_schedulingdelegate 0x00007FFF90069338 | grep _historicalPartialGCTime
    0x128: U64 _historicalPartialGCTime = 0x0000000000000043 (67)
> !mm_schedulingdelegate 0x00007FFF90069338 | grep _averagePgcInterval
    0x180: U64 _averagePgcInterval = 0x00000000000129ED (76269)

(lldb) p 67 * 1000.0 / 76269
(double) 0.87846962724042532

> !mm_schedulingdelegate 0x00007FFF90069338 | grep _partialGcOverhead
    0x120: double _partialGcOverhead = 0.8466813906537427

Evidence that Eden is stuck at 2 regions:

> !mm_schedulingdelegate 0x00007FFF90069338 | grep -i edenregion
    0x30: U64 _idealEdenRegionCount = 0x0000000000000002 (2)
    0x38: U64 _minimumEdenRegionCount = 0x0000000000000001 (1)
    0x40: U64 _edenRegionCount = 0x0000000000000002 (2)
    0x108: U64 _maxEdenRegionCount = 0x00000000000005EC (1516)
    0x110: U64 _minEdenRegionCount = 0x0000000000000002 (2)

Evidence that heap is fully expanded:

> !MM_GCExtensions 0x00007FFF90059500 |grep memoryMax
    0x5fd8: U64 memoryMax = 0x000000007E600000 (2120220672)

> !mm_schedulingdelegate 0x00007FFF90069338 | grep -i _numberOfHeapRegions
    0x58: U64 _numberOfHeapRegions = 0x00000000000007E6 (2022)

(lldb) p 2022 * 1024 * 1024
(int) 2120220672
dmitripivkine commented 1 week ago

Moving to 0.49 due problem with reproducibility, so hard to make progress. Investigation is continuing.