LambdaLoadTest hang - Githubissues

pshipton commented 4 years ago

https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_special.system_ppc64_aix_Nightly_lambdaLoadTest/125 LambdaLoadTest_OpenJ9_NonLinux_special_23 variation: Mode688 JVM_OPTIONS: -Xcompressedrefs -Xjit:count=0 -Xgcpolicy:gencon -Xaggressive -Xconcurrentlevel0

javacores, core and other diagnostic files in the artifact https://140-211-168-230-openstack.osuosl.org/artifactory/ci-eclipse-openj9/Test/Test_openjdk11_j9_special.system_ppc64_aix_Nightly_lambdaLoadTest/125/system_lambdaLoadTest_test_output.tar.gz

LT  05:23:09.450 - Completed 7.0%. Number of tests started=14 (+0)
STF 05:23:29.440 - Heartbeat: Process LT  is still running
LT  05:23:29.441 - Completed 7.0%. Number of tests started=14 (+0)
STF 05:23:30.444 - **FAILED** Process LT  has timed out
STF 05:23:30.444 - Collecting dumps for: LT

I see lots of this

3XMTHREADINFO3           Java callstack:
4XESTACKTRACE                at net/adoptopenjdk/test/lambda/TestLambdaMultithreaded$$Lambda$105/000000000000000000.execute(Bytecode PC:5(Compiled Code))
4XESTACKTRACE                at net/adoptopenjdk/test/lambda/TestLambdaMultithreaded.runSimpleMaths(TestLambdaMultithreaded.java:340(Compiled Code))
4XESTACKTRACE                at net/adoptopenjdk/test/lambda/TestLambdaMultithreaded.lambda$testLambdasWithMultipleThreads$14(TestLambdaMultithreaded.java:237(Compiled Code))
4XESTACKTRACE                at net/adoptopenjdk/test/lambda/TestLambdaMultithreaded$$Lambda$144/000000000000000000.run(Bytecode PC:12(Compiled Code))
4XESTACKTRACE                at java/lang/Thread.run(Thread.java:836(Compiled Code))

pshipton commented 4 years ago

@gita-omr can someone take a look pls.

JasonFengJ9 commented 4 years ago

A quick note, this test hangs at Windows platform as well https://github.com/eclipse/openj9/issues/8214.

gita-omr commented 4 years ago

A quick note, this test hangs at Windows platform as well #8214.

@andrewcraik fyi

pshipton commented 4 years ago

Not sure the Windows problem is the same, we'd have to wait for it to re-occur and look at any diagnostic files produced (the existing information says there are no diagnostic files produced). The Windows hang is a different mode and may be related to jdwp.

gita-omr commented 4 years ago

Ok, we’ll start on Power.

gita-omr commented 4 years ago

Ran grinder 30 times and it never failed.

gita-omr commented 4 years ago

An update: the job is unfortunately gone by now. I took a look at the artifacts and all javacore files are empty, system core did not show anything. I guess let's wait until it fails again...

pshipton commented 4 years ago

@gita-omr Not sure why you are seeing empty javacore files, I can see content in them.

gita-omr commented 4 years ago

Oh, maybe I ran out of disk space! Thanks.

gita-omr commented 4 years ago

Did it hang again on Windows? https://github.com/eclipse/openj9/issues/8214#issuecomment-700290562

pshipton commented 4 years ago

Yes, although I suspect the Windows hang is a separate issue, related to the mode. It "regularly" hangs on Windows, but not that often.

pshipton commented 4 years ago

Another one https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_special.system_ppc64_aix_Nightly_lambdaLoadTest/146 LambdaLoadTest_OpenJ9_NonLinux_special_27 variation: Mode612-OSRG JVM_OPTIONS: -Xcompressedrefs -Xgcpolicy:gencon -Xjit:enableOSR,enableOSROnGuardFailure,count=1,disableAsyncCompilation

https://140-211-168-230-openstack.osuosl.org/artifactory/ci-eclipse-openj9/Test/Test_openjdk8_j9_special.system_ppc64_aix_Nightly_lambdaLoadTest/146/system_lambdaLoadTest_test_output.tar.gz

lumpfish commented 4 years ago

Here is some investigation running on a local Windows 10 laptop. Builds used were: openj9 compressed refs:

openjdk version "1.8.0_275"
OpenJDK Runtime Environment (build 1.8.0_275-b01)
Eclipse OpenJ9 VM (build openj9-0.23.0, JRE 1.8.0 Windows 10 amd64-64-Bit Compressed References 20201112_897 (JIT enabled, AOT enabled)
OpenJ9   - 0394ef754
OMR      - 582366ae5
JCL      - b52d2ff7ee based on jdk8u275-b01)

openj9 non-compressed refs:

openjdk version "1.8.0_275"
OpenJDK Runtime Environment (build 1.8.0_275-b01)
Eclipse OpenJ9 VM (build openj9-0.23.0, JRE 1.8.0 Windows 10 amd64-64-Bit 20201113_491 (JIT enabled, AOT enabled)
OpenJ9   - 0394ef754
OMR      - 582366ae5
JCL      - b52d2ff7ee based on jdk8u275-b01)

hotspot:

openjdk version "1.8.0_275"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_275-b01)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.275-b01, mixed mode)

The LambdaLoadtest runs these junit tests on #cpus - 2 (minimum 2) threads:

<inventory> 
    <!-- Add the Lambda tests -->
    <junit class="net.adoptopenjdk.test.lambda.ClassLibraryLambdaTests"/>
    <junit class="net.adoptopenjdk.test.lambda.TestLambdaCapture"/>
    <junit class="net.adoptopenjdk.test.lambda.TestLambdaContexts"/>
    <junit class="net.adoptopenjdk.test.lambda.TestLambdaDefaultsAndStatics"/>
    <junit class="net.adoptopenjdk.test.lambda.TestLambdaInferredTyping"/>
    <junit class="net.adoptopenjdk.test.lambda.TestLambdaJavaInterfaces"/>
    <junit class="net.adoptopenjdk.test.lambda.TestLambdaLexicalScoping"/>
    <junit class="net.adoptopenjdk.test.lambda.TestLambdaMethodReferences"/>
    <junit class="net.adoptopenjdk.test.lambda.TestLambdaMultithreaded"/>
    <junit class="net.adoptopenjdk.test.lambda.TestLambdaNested"/>
    <junit class="net.adoptopenjdk.test.lambda.TestLambdaRecursive"/>
    <junit class="net.adoptopenjdk.test.lambda.TestLambdasUserDefinedInterface"/>

    <!-- Add the Streams tests -->
    <junit class="net.adoptopenjdk.test.streams.TestParallelStreamOperations"/>
    <junit class="net.adoptopenjdk.test.streams.TestStreamOperations"/>
</inventory>

If only one of the tests is included in the inventory then the test passes with 100Mb heap except for the net.adoptopenjdk.test.streams.TestParallelStreamOperations classes. It is this test which causes the OOM.

The test uses parallel streams for count the occurrences of various strings and letters in a test dataset of 130000 lines of text with a total of 8746601 characters.

The tests include the constant

private int MAX_NUM_CONCURRENT_TESTS = 2;

which is used to limit the number of instances of the test which can run concurrently (if there are already two instances running the test exits immediately). So the normal case is that there are two instances running for the duration of the test (since the tests also take much longer to complete than any of the other tests in the workload).

Results of some test runs with different heap sizes shows that there is an out of memory 'boundary' where the jvm may report out of memory or may just consume all available CPU for a long period. These tests were all run with no additional -X options:

openj9 compressed refs: -Xmx490M - test completed successfully in about 30 secs -Xmx480M - test fails OOM (out of memory - heap space) in about 3 minutes -Xmx450M - test did not complete within the 60 minutes allowed. During this time the java.exe process is consuming 80% of the laptop CPU (which has 12 cores).

openj9 non-compressed refs: -Xmx780M - test completes successfully in about 30 secs -Xmx770M - test does not complete within the 60 minutes allowed. During this time the java.exe process is consuming 80% of the CPU. -Xmx750M - test failed OOM (out of memory - heap space) in about 2 minutes

hotspot -Xmx410M - test completes successfully in about 30 secs -Xmx400M - test failed OOM (out of memory - heap space) in about 6 minutes

The test currently has no -Xmx setting, so the amount of heap will vary according to test machine selected at run time. The fact that the jvm appears to become extremely slow as the heap becomes exhausted looks like the most likely explanation for the tests appearing to hang or time out.

An obvious fix is to set a specific 1Gb heap size for these tests - I'll create a PR for that and do some extensive testing.

That will 'fix' the test case failures - but does the extreme slowing of the jvm warrant further investigation?

lumpfish commented 4 years ago

BTW, the LambdaLoadTests are not currently running as they should due to https://github.com/AdoptOpenJDK/openjdk-systemtest/issues/379 - the test run time is deemed to have expired as soon as the workload starts. But since this issue shows up with just two instances of the offending test case running it is still seen anyway (but there may be some test runs which 'complete' before two instances of that test have been started).

dmitripivkine commented 4 years ago

-Xmx450M - test did not complete within the 60 minutes allowed. During this time the java.exe process is consuming 80% of the laptop CPU (which has 12 cores

Is it possible to get GC verbose log for this case please? I think it might be inefficient Excessive GC detection situation

lumpfish commented 4 years ago

It turned out to be harder than I thought to catch a run which hangs, rather than completing successfully or going out of memory. The verbose gc log attached here was from a run with the non compressed refs build with -Xmx790M. The test was set to timeout after 10 mins, at which point it is killed. verbosegc.20201201.173427.29444.txt.zip

dmitripivkine commented 4 years ago

There is a case where Excessive GC condition is missed obviously. We are going to work to fix it

amicic commented 4 years ago

As suspected by @dmitripivkine this is indeed a case of excessive GC, but for some reason not leading to appropriate OOM exceptions.

We have two criteria that has to be met to raise OOM. Besides excessively doing GC, there should be no sign of progress (reclaimed memory). A theory is that the the latter is not properly measured in this particular scenario,

The scenario is that Scavenge aborted, and is followed by a precolate Global GC that satisfies the allocation but in Tenure (not Allocate space as noramally should be). Since Allocate space is empty subsequent allocations trigger allocation failure right away and process repeats. Along the way, Global GC, where excessive GC logic is mostly maintained, appears to reclaim lots of memory, but in fact this is bogus, since memory free before Global GC was reported 0%, but it's just a side effect of aborted Scavenge preceeded. The free memory was actually high at the moment AF was raised, and Global GC in fact recovered no memory.

Example:

<exclusive-start id="23503" timestamp="2020-12-01T17:39:55.929" intervalms="596.352">
  <response-info timems="0.668" idlems="0.339" threads="1" lastid="0000000003A02A00" lastname="JIT Compilation Thread-005" />
</exclusive-start>
<af-start id="23504" threadId="00000000037C8AF8" totalBytesRequested="104" timestamp="2020-12-01T17:39:55.929" intervalms="596.370" type="nursery" />
<cycle-start id="23505" type="scavenge" contextid="0" timestamp="2020-12-01T17:39:55.929" intervalms="596.380" />
<gc-start id="23506" type="scavenge" contextid="23505" timestamp="2020-12-01T17:39:55.929">
  <mem-info id="23507" free="62931864" total="828375040" percent="7">
    <mem type="nursery" free="0" total="207093760" percent="0">
      <mem type="allocate" free="0" total="137691136" percent="0" />
      <mem type="survivor" free="0" total="69402624" percent="0" />
    </mem>
    <mem type="tenure" free="62931864" total="621281280" percent="10">
      <mem type="soa" free="56718232" total="615067648" percent="9" />           <<<< this is the right pre-GC point with lots of free memory
      <mem type="loa" free="6213632" total="6213632" percent="100" />
    </mem>
    <remembered-set count="139186" />
  </mem-info>
</gc-start>
........
aborted scavenge
.......
<cycle-end id="23511" type="scavenge" contextid="23505" timestamp="2020-12-01T17:39:56.180" /><percolate-collect id="23512" from="nursery" to="global" reason="failed tenure threshold reached" timestamp="2020-12-01T17:39:56.180"/>
<percolate-collect id="23512" from="nursery" to="global" reason="failed tenure threshold reached" timestamp="2020-12-01T17:39:56.180"/>
<cycle-start id="23513" type="global" contextid="0" timestamp="2020-12-01T17:39:56.180" intervalms="615.573" />
<gc-start id="23514" type="global" contextid="23513" timestamp="2020-12-01T17:39:56.180">
    <mem type="tenure" free="6213632" total="621281280" percent="1">
      <mem type="soa" free="0" total="615067648" percent="0" /> <<<<<<<<<<<<<<<<< this is a bogus pre GC-point with low free memory
      <mem type="loa" free="6213632" total="6213632" percent="100" />
    </mem>
    <remembered-set count="139186" />
  </mem-info>
</gc-start>
.......
.......
<gc-end id="23519" type="global" contextid="23513" durationms="174.455" usertimems="1937.500" systemtimems="0.000" timestamp="2020-12-01T17:39:56.354" activeThreads="12">
  <mem-info id="23520" free="62931864" total="828375040" percent="7">
    <mem type="nursery" free="0" total="207093760" percent="0">
      <mem type="allocate" free="0" total="137691136" percent="0" />
      <mem type="survivor" free="0" total="69402624" percent="0" />
    </mem>
    <mem type="tenure" free="62931864" total="621281280" percent="10" micro-fragmented="644464" macro-fragmented="0">
      <mem type="soa" free="56718232" total="615067648" percent="9" />        <<<<<<<<<<< recovered memory high (relative to bogus pre-GC start), but really no recovered memory relative to real pre-GC start
      <mem type="loa" free="6213632" total="6213632" percent="100" />
    </mem>
    <remembered-set count="139186" />
  </mem-info>
</gc-end>
<cycle-end id="23521" type="global" contextid="23513" timestamp="2020-12-01T17:39:56.354" />
<allocation-satisfied id="23522" threadId="00000000037C8100" bytesRequested="104" />
<af-end id="23523" timestamp="2020-12-01T17:39:56.354" threadId="00000000037C8AF8" success="true" from="tenure-soa"/>
<exclusive-end id="23524" timestamp="2020-12-01T17:39:56.354" durationms="425.892" />

amicic commented 4 years ago

@jonoommen please try to reproduce this, and if the theory seems plausible, we'll work on a fix, where reclaimed memory would be measured from AF start to end, rather than Global GC start to end.

-Xtgc:excessivegc logs might be useful

lumpfish commented 4 years ago

FWIW, this is how I ran locally and added the various jvm command line options:

Build the openjdk-systemtest and stf repositories.
- Either build via testKitGen (https://github.com/AdoptOpenJDK/openjdk-tests/blob/master/doc/userGuide.md) or follow https://github.com/adoptopenjdk/openjdk-systemtest

Make sure you can run the test. For Windows:

set JAVA_HOME=java_to_test
set PATH=java_to_test\bin;%PATH%
set SYSTEMTEST_HOME=git_checkout_location - e.g. c:\Users\user\git
set RESULTS_HOME=somewhere_to_put_results - e.g. c:\Users\user\stf_results
perl %SYSTEMTEST_HOME%/stf/stf.core/scripts/stf.pl -test-root="%SYSTEMTEST_HOME%/stf;%SYSTEMTEST_HOME%/openjdk-systemtest" -systemtest-prereqs="%SYSTEMTEST_HOME%/../systemtest_prereqs" -results-root="%RESULTS_HOME%/LambdaLoadTest_0" -test=LambdaLoadTest

For Unix

export JAVA_HOME=java_to_test
export PATH=java_to_test/bin:$PATH
export SYSTEMTEST_HOME=git_checkout_location - e.g. /home/user/git
export RESULTS_HOME=somewhere_to_put_results - e.g. /home/user/stf_results
perl $SYSTEMTEST_HOME/stf/stf.core/scripts/stf.pl -test-root="$SYSTEMTEST_HOME/stf;$SYSTEMTEST_HOME/openjdk-systemtest" -systemtest-prereqs="$SYSTEMTEST_HOME/../systemtest_prereqs" -results-root="$RESULTS_HOME/LambdaLoadTest_0" -test=LambdaLoadTest

Add the -Xmx option to the LamdaLoadTest - in file openjdk.test.load/src/test.load/net/adoptopenjdk/stf/LambdaLoadTest.java, change

    LoadTestProcessDefinition loadTestInvocation = test.createLoadTestSpecification()
            .addPrereqJarToClasspath(JavaProcessDefinition.JarId.JUNIT)
            .addPrereqJarToClasspath(JavaProcessDefinition.JarId.HAMCREST)
            .addProjectToClasspath("openjdk.test.lambdasAndStreams")
            .setInactivityLimit("60m")                  // Since this test is run using -Xint as well, set a larger inactivity limit than default 
            .addSuite("lambda")                         // Start args for the first suite
            .setSuiteThreadCount(cpuCount - 2, 2)       // Leave 1 cpu for the JIT, 1 cpu for GC and set min 2
            .setSuiteInventory(inventoryFile)           // Point at the file which lists the tests
            .setSuiteNumTests(200)                      // Run this many tests
            .setSuiteRandomSelection();                 // Randomly pick the next test each time

to

    LoadTestProcessDefinition loadTestInvocation = test.createLoadTestSpecification()
            .addJvmOption("-Xmx790M")
            .addPrereqJarToClasspath(JavaProcessDefinition.JarId.JUNIT)
            .addPrereqJarToClasspath(JavaProcessDefinition.JarId.HAMCREST)
            .addProjectToClasspath("openjdk.test.lambdasAndStreams")
            .setInactivityLimit("60m")                  // Since this test is run using -Xint as well, set a larger inactivity limit than default 
            .addSuite("lambda")                         // Start args for the first suite
            .setSuiteThreadCount(cpuCount - 2, 2)       // Leave 1 cpu for the JIT, 1 cpu for GC and set min 2
            .setSuiteInventory(inventoryFile)           // Point at the file which lists the tests
            .setSuiteNumTests(200)                      // Run this many tests
            .setSuiteRandomSelection();                 // Randomly pick the next test each time

i.e. add the jvm option to the java command (which is generated by stf at run time) Add other jvm options (e.g. -verbose:gc) in the same way (.addJvmOption("-verbose:gc")

Rebuild the openjdk-systemtest repository
Rerun the test

jonoommen commented 3 years ago

From reproducing this and analyzing the data for various test scenarios, there were many findings, but here is a summary of what is most relevant.

Firstly, we are dealing with 2 problems here:

excessive GC not kicking in while it should as both CPU overhead and reclaimed memory criteria are met
OOM error triggered intermittently - some runs fail with OOM, while other runs are passing with plenty of free of memory (given the same heap size)

The first issue can be resolved through GC changes, however, it is not the real reason for failure. The latter mentioned problem is the more serious problem. From heap dump analysis and scavenger tracing, it appears that we have an intermittent memory leak with this test case, although this is not conclusive. Here are some findings:

Passing Test Case Scavenger Tracing:

No excessive tenuring of any kind

Failing Test Case Scavenger Tracing:

excessive tenuring of surviving java/util/HashMap$Node, java/lang/String, char[] and net/adoptopenjdk/test/streams/support/Line objects, this last class does not show up a single time in the scavenger tracing when a test passes, however when the test fails, scavenger tracing will be similar to this:

LT stderr {SCAV: tgcScavenger OBJECT HISTOGRAM} - Tenure Age is 10 LT stderr {SCAV: | class instances of age 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 in semi-space |

LT stderr {SCAV: char[] 0 341 0 385 425 968 1887 4260 8926 19327 47220 0 0 0 0 LT stderr {SCAV: java/lang/String 0 341 0 385 425 968 1888 4401 8926 19327 47220 0 0 0 0 LT stderr {SCAV: net/adoptopenjdk/test/streams/support/Line 0 340 0 385 425 968 1888 4401 8926 19327 47220 0 0 0 0

And then later in the log file before OOM is hit:

LT stderr {SCAV: tgcScavenger OBJECT HISTOGRAM} - Tenure Age is 10 LT stderr {SCAV: | class instances of age 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 in semi-space |

LT stderr {SCAV: java/lang/Object[] 0 1340 1370 1408 1332 2091 22445 2859 3187 1720 2079 3961 2238 4317 212359 (holds net/adoptopenjdk/test/streams/support/Line objects - see dominator tree later) LT stderr {SCAV: java/lang/String 0 1412 1370 1408 1332 2089 22446 2860 3185 1722 2079 3960 2239 4316 212354 LT stderr {SCAV: java/util/HashMap$Node 0 1333 1370 1408 1332 2091 22445 2859 3187 1720 2988 20523 8569 21644 345623 LT stderr {SCAV: char[] 0 1408 1372 1410 1332 2088 22457 2860 3185 1722 2079 3966 2239 4382 210762

For the passing case with the same heap size, nothing remotely similar will be found in the scavenger tracing and no excessive tenuring for any of these object types. The same is observed with -Xint as well.

Thread stack: at java.lang.Object.wait(JI)V (Native Method) at java.lang.Object.wait(J)V (Object.java:218) at java.util.concurrent.ForkJoinTask.externalInterruptibleAwaitDone()I (ForkJoinTask.java:378) at java.util.concurrent.ForkJoinTask.get()Ljava/lang/Object; (ForkJoinTask.java:1012) at net.adoptopenjdk.test.streams.TestParallelStreamOperations.lambda$testCommonWord$27()V (TestParallelStreamOperations.java:300(Compiled Code))

Here is a link to the line in TestParallelStreamOperations.java where the potential leak could be occurring; https://github.com/AdoptOpenJDK/openjdk-systemtest/blob/7df5b9a6be199e2acd977079562d3ae160c8c65d/openjdk.test.lambdasAndStreams/src/test.lambda/net/adoptopenjdk/test/streams/TestParallelStreamOperations.java#L300

Using a memory analyzer on the heap dump, here is some interesting info:

Object Histogram:

Screen Shot 2020-12-09 at 18 17 08

Leak Hunter (shows the few objects that dominate the heap)

Screen Shot 2020-12-09 at 18 12 37

Object Dominator Tree - Expanding on the leak hunter above and showing what objects are held in these heap dominating objects:

Screen Shot 2020-12-09 at 21 31 09

eclipse-openj9 / openj9

LambdaLoadTest hang #10638