Open pshipton opened 4 years ago
@gita-omr can someone take a look pls.
A quick note, this test hangs at Windows platform as well https://github.com/eclipse/openj9/issues/8214.
A quick note, this test hangs at Windows platform as well #8214.
@andrewcraik fyi
Not sure the Windows problem is the same, we'd have to wait for it to re-occur and look at any diagnostic files produced (the existing information says there are no diagnostic files produced). The Windows hang is a different mode and may be related to jdwp.
Ok, we’ll start on Power.
Ran grinder 30 times and it never failed.
An update: the job is unfortunately gone by now. I took a look at the artifacts and all javacore files are empty, system core did not show anything. I guess let's wait until it fails again...
@gita-omr Not sure why you are seeing empty javacore files, I can see content in them.
Oh, maybe I ran out of disk space! Thanks.
Did it hang again on Windows? https://github.com/eclipse/openj9/issues/8214#issuecomment-700290562
Yes, although I suspect the Windows hang is a separate issue, related to the mode. It "regularly" hangs on Windows, but not that often.
Another one https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_special.system_ppc64_aix_Nightly_lambdaLoadTest/146 LambdaLoadTest_OpenJ9_NonLinux_special_27 variation: Mode612-OSRG JVM_OPTIONS: -Xcompressedrefs -Xgcpolicy:gencon -Xjit:enableOSR,enableOSROnGuardFailure,count=1,disableAsyncCompilation
Here is some investigation running on a local Windows 10 laptop. Builds used were: openj9 compressed refs:
openjdk version "1.8.0_275"
OpenJDK Runtime Environment (build 1.8.0_275-b01)
Eclipse OpenJ9 VM (build openj9-0.23.0, JRE 1.8.0 Windows 10 amd64-64-Bit Compressed References 20201112_897 (JIT enabled, AOT enabled)
OpenJ9 - 0394ef754
OMR - 582366ae5
JCL - b52d2ff7ee based on jdk8u275-b01)
openj9 non-compressed refs:
openjdk version "1.8.0_275"
OpenJDK Runtime Environment (build 1.8.0_275-b01)
Eclipse OpenJ9 VM (build openj9-0.23.0, JRE 1.8.0 Windows 10 amd64-64-Bit 20201113_491 (JIT enabled, AOT enabled)
OpenJ9 - 0394ef754
OMR - 582366ae5
JCL - b52d2ff7ee based on jdk8u275-b01)
hotspot:
openjdk version "1.8.0_275"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_275-b01)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.275-b01, mixed mode)
The LambdaLoadtest runs these junit tests on #cpus - 2
(minimum 2) threads:
<inventory>
<!-- Add the Lambda tests -->
<junit class="net.adoptopenjdk.test.lambda.ClassLibraryLambdaTests"/>
<junit class="net.adoptopenjdk.test.lambda.TestLambdaCapture"/>
<junit class="net.adoptopenjdk.test.lambda.TestLambdaContexts"/>
<junit class="net.adoptopenjdk.test.lambda.TestLambdaDefaultsAndStatics"/>
<junit class="net.adoptopenjdk.test.lambda.TestLambdaInferredTyping"/>
<junit class="net.adoptopenjdk.test.lambda.TestLambdaJavaInterfaces"/>
<junit class="net.adoptopenjdk.test.lambda.TestLambdaLexicalScoping"/>
<junit class="net.adoptopenjdk.test.lambda.TestLambdaMethodReferences"/>
<junit class="net.adoptopenjdk.test.lambda.TestLambdaMultithreaded"/>
<junit class="net.adoptopenjdk.test.lambda.TestLambdaNested"/>
<junit class="net.adoptopenjdk.test.lambda.TestLambdaRecursive"/>
<junit class="net.adoptopenjdk.test.lambda.TestLambdasUserDefinedInterface"/>
<!-- Add the Streams tests -->
<junit class="net.adoptopenjdk.test.streams.TestParallelStreamOperations"/>
<junit class="net.adoptopenjdk.test.streams.TestStreamOperations"/>
</inventory>
If only one of the tests is included in the inventory then the test passes with 100Mb heap except for the net.adoptopenjdk.test.streams.TestParallelStreamOperations classes. It is this test which causes the OOM.
The test uses parallel streams for count the occurrences of various strings and letters in a test dataset of 130000 lines of text with a total of 8746601 characters.
The tests include the constant
private int MAX_NUM_CONCURRENT_TESTS = 2;
which is used to limit the number of instances of the test which can run concurrently (if there are already two instances running the test exits immediately). So the normal case is that there are two instances running for the duration of the test (since the tests also take much longer to complete than any of the other tests in the workload).
Results of some test runs with different heap sizes shows that there is an out of memory 'boundary' where the jvm may report out of memory or may just consume all available CPU for a long period. These tests were all run with no additional -X options:
openj9 compressed refs: -Xmx490M - test completed successfully in about 30 secs -Xmx480M - test fails OOM (out of memory - heap space) in about 3 minutes -Xmx450M - test did not complete within the 60 minutes allowed. During this time the java.exe process is consuming 80% of the laptop CPU (which has 12 cores).
openj9 non-compressed refs: -Xmx780M - test completes successfully in about 30 secs -Xmx770M - test does not complete within the 60 minutes allowed. During this time the java.exe process is consuming 80% of the CPU. -Xmx750M - test failed OOM (out of memory - heap space) in about 2 minutes
hotspot -Xmx410M - test completes successfully in about 30 secs -Xmx400M - test failed OOM (out of memory - heap space) in about 6 minutes
The test currently has no -Xmx setting, so the amount of heap will vary according to test machine selected at run time. The fact that the jvm appears to become extremely slow as the heap becomes exhausted looks like the most likely explanation for the tests appearing to hang or time out.
An obvious fix is to set a specific 1Gb heap size for these tests - I'll create a PR for that and do some extensive testing.
That will 'fix' the test case failures - but does the extreme slowing of the jvm warrant further investigation?
BTW, the LambdaLoadTests are not currently running as they should due to https://github.com/AdoptOpenJDK/openjdk-systemtest/issues/379 - the test run time is deemed to have expired as soon as the workload starts. But since this issue shows up with just two instances of the offending test case running it is still seen anyway (but there may be some test runs which 'complete' before two instances of that test have been started).
-Xmx450M - test did not complete within the 60 minutes allowed. During this time the java.exe process is consuming 80% of the laptop CPU (which has 12 cores
Is it possible to get GC verbose log for this case please? I think it might be inefficient Excessive GC detection situation
It turned out to be harder than I thought to catch a run which hangs, rather than completing successfully or going out of memory. The verbose gc log attached here was from a run with the non compressed refs build with -Xmx790M. The test was set to timeout after 10 mins, at which point it is killed. verbosegc.20201201.173427.29444.txt.zip
There is a case where Excessive GC condition is missed obviously. We are going to work to fix it
As suspected by @dmitripivkine this is indeed a case of excessive GC, but for some reason not leading to appropriate OOM exceptions.
We have two criteria that has to be met to raise OOM. Besides excessively doing GC, there should be no sign of progress (reclaimed memory). A theory is that the the latter is not properly measured in this particular scenario,
The scenario is that Scavenge aborted, and is followed by a precolate Global GC that satisfies the allocation but in Tenure (not Allocate space as noramally should be). Since Allocate space is empty subsequent allocations trigger allocation failure right away and process repeats. Along the way, Global GC, where excessive GC logic is mostly maintained, appears to reclaim lots of memory, but in fact this is bogus, since memory free before Global GC was reported 0%, but it's just a side effect of aborted Scavenge preceeded. The free memory was actually high at the moment AF was raised, and Global GC in fact recovered no memory.
Example:
<exclusive-start id="23503" timestamp="2020-12-01T17:39:55.929" intervalms="596.352">
<response-info timems="0.668" idlems="0.339" threads="1" lastid="0000000003A02A00" lastname="JIT Compilation Thread-005" />
</exclusive-start>
<af-start id="23504" threadId="00000000037C8AF8" totalBytesRequested="104" timestamp="2020-12-01T17:39:55.929" intervalms="596.370" type="nursery" />
<cycle-start id="23505" type="scavenge" contextid="0" timestamp="2020-12-01T17:39:55.929" intervalms="596.380" />
<gc-start id="23506" type="scavenge" contextid="23505" timestamp="2020-12-01T17:39:55.929">
<mem-info id="23507" free="62931864" total="828375040" percent="7">
<mem type="nursery" free="0" total="207093760" percent="0">
<mem type="allocate" free="0" total="137691136" percent="0" />
<mem type="survivor" free="0" total="69402624" percent="0" />
</mem>
<mem type="tenure" free="62931864" total="621281280" percent="10">
<mem type="soa" free="56718232" total="615067648" percent="9" /> <<<< this is the right pre-GC point with lots of free memory
<mem type="loa" free="6213632" total="6213632" percent="100" />
</mem>
<remembered-set count="139186" />
</mem-info>
</gc-start>
........
aborted scavenge
.......
<cycle-end id="23511" type="scavenge" contextid="23505" timestamp="2020-12-01T17:39:56.180" /><percolate-collect id="23512" from="nursery" to="global" reason="failed tenure threshold reached" timestamp="2020-12-01T17:39:56.180"/>
<percolate-collect id="23512" from="nursery" to="global" reason="failed tenure threshold reached" timestamp="2020-12-01T17:39:56.180"/>
<cycle-start id="23513" type="global" contextid="0" timestamp="2020-12-01T17:39:56.180" intervalms="615.573" />
<gc-start id="23514" type="global" contextid="23513" timestamp="2020-12-01T17:39:56.180">
<mem type="tenure" free="6213632" total="621281280" percent="1">
<mem type="soa" free="0" total="615067648" percent="0" /> <<<<<<<<<<<<<<<<< this is a bogus pre GC-point with low free memory
<mem type="loa" free="6213632" total="6213632" percent="100" />
</mem>
<remembered-set count="139186" />
</mem-info>
</gc-start>
.......
.......
<gc-end id="23519" type="global" contextid="23513" durationms="174.455" usertimems="1937.500" systemtimems="0.000" timestamp="2020-12-01T17:39:56.354" activeThreads="12">
<mem-info id="23520" free="62931864" total="828375040" percent="7">
<mem type="nursery" free="0" total="207093760" percent="0">
<mem type="allocate" free="0" total="137691136" percent="0" />
<mem type="survivor" free="0" total="69402624" percent="0" />
</mem>
<mem type="tenure" free="62931864" total="621281280" percent="10" micro-fragmented="644464" macro-fragmented="0">
<mem type="soa" free="56718232" total="615067648" percent="9" /> <<<<<<<<<<< recovered memory high (relative to bogus pre-GC start), but really no recovered memory relative to real pre-GC start
<mem type="loa" free="6213632" total="6213632" percent="100" />
</mem>
<remembered-set count="139186" />
</mem-info>
</gc-end>
<cycle-end id="23521" type="global" contextid="23513" timestamp="2020-12-01T17:39:56.354" />
<allocation-satisfied id="23522" threadId="00000000037C8100" bytesRequested="104" />
<af-end id="23523" timestamp="2020-12-01T17:39:56.354" threadId="00000000037C8AF8" success="true" from="tenure-soa"/>
<exclusive-end id="23524" timestamp="2020-12-01T17:39:56.354" durationms="425.892" />
@jonoommen please try to reproduce this, and if the theory seems plausible, we'll work on a fix, where reclaimed memory would be measured from AF start to end, rather than Global GC start to end.
-Xtgc:excessivegc logs might be useful
FWIW, this is how I ran locally and added the various jvm command line options:
set JAVA_HOME=java_to_test
set PATH=java_to_test\bin;%PATH%
set SYSTEMTEST_HOME=git_checkout_location - e.g. c:\Users\user\git
set RESULTS_HOME=somewhere_to_put_results - e.g. c:\Users\user\stf_results
perl %SYSTEMTEST_HOME%/stf/stf.core/scripts/stf.pl -test-root="%SYSTEMTEST_HOME%/stf;%SYSTEMTEST_HOME%/openjdk-systemtest" -systemtest-prereqs="%SYSTEMTEST_HOME%/../systemtest_prereqs" -results-root="%RESULTS_HOME%/LambdaLoadTest_0" -test=LambdaLoadTest
For Unix
export JAVA_HOME=java_to_test
export PATH=java_to_test/bin:$PATH
export SYSTEMTEST_HOME=git_checkout_location - e.g. /home/user/git
export RESULTS_HOME=somewhere_to_put_results - e.g. /home/user/stf_results
perl $SYSTEMTEST_HOME/stf/stf.core/scripts/stf.pl -test-root="$SYSTEMTEST_HOME/stf;$SYSTEMTEST_HOME/openjdk-systemtest" -systemtest-prereqs="$SYSTEMTEST_HOME/../systemtest_prereqs" -results-root="$RESULTS_HOME/LambdaLoadTest_0" -test=LambdaLoadTest
LoadTestProcessDefinition loadTestInvocation = test.createLoadTestSpecification()
.addPrereqJarToClasspath(JavaProcessDefinition.JarId.JUNIT)
.addPrereqJarToClasspath(JavaProcessDefinition.JarId.HAMCREST)
.addProjectToClasspath("openjdk.test.lambdasAndStreams")
.setInactivityLimit("60m") // Since this test is run using -Xint as well, set a larger inactivity limit than default
.addSuite("lambda") // Start args for the first suite
.setSuiteThreadCount(cpuCount - 2, 2) // Leave 1 cpu for the JIT, 1 cpu for GC and set min 2
.setSuiteInventory(inventoryFile) // Point at the file which lists the tests
.setSuiteNumTests(200) // Run this many tests
.setSuiteRandomSelection(); // Randomly pick the next test each time
to
LoadTestProcessDefinition loadTestInvocation = test.createLoadTestSpecification()
.addJvmOption("-Xmx790M")
.addPrereqJarToClasspath(JavaProcessDefinition.JarId.JUNIT)
.addPrereqJarToClasspath(JavaProcessDefinition.JarId.HAMCREST)
.addProjectToClasspath("openjdk.test.lambdasAndStreams")
.setInactivityLimit("60m") // Since this test is run using -Xint as well, set a larger inactivity limit than default
.addSuite("lambda") // Start args for the first suite
.setSuiteThreadCount(cpuCount - 2, 2) // Leave 1 cpu for the JIT, 1 cpu for GC and set min 2
.setSuiteInventory(inventoryFile) // Point at the file which lists the tests
.setSuiteNumTests(200) // Run this many tests
.setSuiteRandomSelection(); // Randomly pick the next test each time
.addJvmOption("-verbose:gc")
From reproducing this and analyzing the data for various test scenarios, there were many findings, but here is a summary of what is most relevant.
Firstly, we are dealing with 2 problems here:
The first issue can be resolved through GC changes, however, it is not the real reason for failure. The latter mentioned problem is the more serious problem. From heap dump analysis and scavenger tracing, it appears that we have an intermittent memory leak with this test case, although this is not conclusive. Here are some findings:
Passing Test Case Scavenger Tracing:
Failing Test Case Scavenger Tracing:
LT stderr {SCAV: tgcScavenger OBJECT HISTOGRAM} - Tenure Age is 10 LT stderr {SCAV: | class instances of age 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 in semi-space |
LT stderr {SCAV: char[] 0 341 0 385 425 968 1887 4260 8926 19327 47220 0 0 0 0 LT stderr {SCAV: java/lang/String 0 341 0 385 425 968 1888 4401 8926 19327 47220 0 0 0 0 LT stderr {SCAV: net/adoptopenjdk/test/streams/support/Line 0 340 0 385 425 968 1888 4401 8926 19327 47220 0 0 0 0
And then later in the log file before OOM is hit:
LT stderr {SCAV: tgcScavenger OBJECT HISTOGRAM} - Tenure Age is 10 LT stderr {SCAV: | class instances of age 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 in semi-space |
LT stderr {SCAV: java/lang/Object[] 0 1340 1370 1408 1332 2091 22445 2859 3187 1720 2079 3961 2238 4317 212359 (holds net/adoptopenjdk/test/streams/support/Line objects - see dominator tree later) LT stderr {SCAV: java/lang/String 0 1412 1370 1408 1332 2089 22446 2860 3185 1722 2079 3960 2239 4316 212354 LT stderr {SCAV: java/util/HashMap$Node 0 1333 1370 1408 1332 2091 22445 2859 3187 1720 2988 20523 8569 21644 345623 LT stderr {SCAV: char[] 0 1408 1372 1410 1332 2088 22457 2860 3185 1722 2079 3966 2239 4382 210762
For the passing case with the same heap size, nothing remotely similar will be found in the scavenger tracing and no excessive tenuring for any of these object types. The same is observed with -Xint as well.
Thread stack:
at java.lang.Object.wait(JI)V (Native Method) at java.lang.Object.wait(J)V (Object.java:218) at java.util.concurrent.ForkJoinTask.externalInterruptibleAwaitDone()I (ForkJoinTask.java:378) at java.util.concurrent.ForkJoinTask.get()Ljava/lang/Object; (ForkJoinTask.java:1012) at net.adoptopenjdk.test.streams.TestParallelStreamOperations.lambda$testCommonWord$27()V (TestParallelStreamOperations.java:300(Compiled Code))
Here is a link to the line in TestParallelStreamOperations.java where the potential leak could be occurring; https://github.com/AdoptOpenJDK/openjdk-systemtest/blob/7df5b9a6be199e2acd977079562d3ae160c8c65d/openjdk.test.lambdasAndStreams/src/test.lambda/net/adoptopenjdk/test/streams/TestParallelStreamOperations.java#L300
Using a memory analyzer on the heap dump, here is some interesting info:
https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_special.system_ppc64_aix_Nightly_lambdaLoadTest/125 LambdaLoadTest_OpenJ9_NonLinux_special_23 variation: Mode688 JVM_OPTIONS: -Xcompressedrefs -Xjit:count=0 -Xgcpolicy:gencon -Xaggressive -Xconcurrentlevel0
javacores, core and other diagnostic files in the artifact https://140-211-168-230-openstack.osuosl.org/artifactory/ci-eclipse-openj9/Test/Test_openjdk11_j9_special.system_ppc64_aix_Nightly_lambdaLoadTest/125/system_lambdaLoadTest_test_output.tar.gz
I see lots of this