Open pshipton opened 3 years ago
@amicic @bragaigor not sure if this is a GC problem? I'll run some GCs later to see if it can be recreated. Note the failure is in the v0.26.0-release branch.
I'm trying to reproduce it as well
5 x 5x grinders https://ci.eclipse.org/openj9/view/Test/job/Grinder/1585 https://ci.eclipse.org/openj9/view/Test/job/Grinder/1586 - passed https://ci.eclipse.org/openj9/view/Test/job/Grinder/1587 - passed https://ci.eclipse.org/openj9/view/Test/job/Grinder/1588 - passed https://ci.eclipse.org/openj9/view/Test/job/Grinder/1589 - passed
Happened to spot the following in the first one
13:15:56 DLT stderr 17:15:56.520 0x30464a00 j9mm.107 * ** ASSERTION FAILED ** at /home/jenkins/workspace/Build_JDK8_ppc64_aix_Release/omr/gc/base/MemoryPoolSplitAddressOrderedListBase.cpp:499: ((false && (((0) == nextFreeEntry) || (addrTop <= nextFreeEntry))))
Was the failure observed in other platforms or just in AIX PPC64 with JDK 8?
I only saw it on AIX, but if it's intermittent it may occur on other platforms but just hasn't shown up yet. I can grind another platform, do you have a preference?
That makes sense, and I would say either x86 linux 64 or PPC linux 64. (I'm also trying to reproduce in these platforms locally)
5 x 10x on xlinux https://ci.eclipse.org/openj9/view/Test/job/Grinder/1590 https://ci.eclipse.org/openj9/view/Test/job/Grinder/1591 https://ci.eclipse.org/openj9/view/Test/job/Grinder/1592 https://ci.eclipse.org/openj9/view/Test/job/Grinder/1593 https://ci.eclipse.org/openj9/view/Test/job/Grinder/1594
5 x 10x on plinux https://ci.eclipse.org/openj9/view/Test/job/Grinder/1595 https://ci.eclipse.org/openj9/view/Test/job/Grinder/1596 https://ci.eclipse.org/openj9/view/Test/job/Grinder/1597 https://ci.eclipse.org/openj9/view/Test/job/Grinder/1598 https://ci.eclipse.org/openj9/view/Test/job/Grinder/1599
These all passed.
Early to give a proper diagnosis, but so far, this looks way more likely a JIT related problem (corruption in heap, for example overextending object access etc) than a GC problem.
FYI @dmitripivkine
From the tests above it looks like this failure is only happening in AIX PPC64 with JDK 8 after all
I can not find any system core stored, even from grinder https://ci.eclipse.org/openj9/view/Test/job/Grinder/1585 with assertion in GC. Running grinders seems useless until we update them to generate cores. What is the right way to get core stored?
From the tests above it looks like this failure is only happening in AIX PPC64 with JDK 8 after all
Hard to tell given the low reproduction rate. The original problem wasn't reproduced even on AIX.
@Mesbah-Alam the grinder https://ci.eclipse.org/openj9/view/Test/job/Grinder/1585 asserted so it should have created a core file, but there are no messages about creating diagnostic files, and no diagnostics in the artifacts. Are we running this test with -Xdump:none
or similar options that disable diagnostics? I do see -dumpRequested false
, not sure if that is related.
@llxia fyi
1.DLT.stderr contains
17:15:56.520 0x30464a00 j9mm.107 * ** ASSERTION FAILED ** at /home/jenkins/workspace/Build_JDK8_ppc64_aix_Release/omr/gc/base/MemoryPoolSplitAddressOrderedListBase.cpp:499: ((false && (((0) == nextFreeEntry) || (addrTop <= nextFreeEntry))))
JVMDUMP039I Processing dump event "traceassert", detail "" at 2021/03/15 13:15:56 - please wait.
JVMDUMP039I Processing dump event "user", detail "" at 2021/03/15 13:50:55 - please wait.
Hello, I’ve been trying to reproduce the failure in Grinder and locally for AIX PPC64 JDK 8. So I let the test run for all day yesterday and out of 130 iterations the only suspect that I got was one test run that output the following:
DLT 17:16:51.084 - Starting thread. Suite=0 thread=5
gmake[4]: *** [autoGen.mk:30: DaaLoadTest_all_CS_5m_0] Hangup
gmake[3]: *** [/home/jenkins/temp/openn/openjdk-tests/TKG/../TKG/settings.mk:250: DaaLoadTest_all_CS_5m_0-daaLoadTest] Hangup
gmake[2]: *** [/home/jenkins/temp/openn/openjdk-tests/TKG/../TKG/settings.mk:250: DaaLoadTest_all_CS_5m_0-system] Hangup
gmake[1]: *** [settings.mk:250: DaaLoadTest_all_CS_5m_0-..] Hangup
gmake: *** [makefile:48: _DaaLoadTest_all_CS_5m_0] Hangup
But I’m not sure if that’s related to the issue at hand (this looks more like a machine problem?).
But I’m not sure if that’s related to the issue at hand (this looks more like a machine problem?).
Agreed.
I do see -dumpRequested false, not sure if that is related.
Hi @pshipton , this is not related. When there is an assertion or a crash, dumps should be created normally; -dumpRequested
flag will not prevent that.
(The -dumpRequested
flag is only used to control the STF feature that creates a set of heap, java and system dumps during a test failure. This was causing unnecessary dumps being created so we added the -dumpRequested
flag.)
STF 13:52:55.053 - Sending SIGXCPU (kill -24) to the java process to generate an OS dump
STF 13:52:55.054 - Running command: kill -24 18940050
STF 13:52:55.054 - Redirecting stderr to /home/jenkins/workspace/Grinder/openjdk-tests/TKG/output_16158285507977/DaaLoadTest_all_CS_5m_0/20210315-131551-DaaLoadTest/results/1.DLT.kill_24.stderr
STF 13:52:55.054 - Redirecting stdout to /home/jenkins/workspace/Grinder/openjdk-tests/TKG/output_16158285507977/DaaLoadTest_all_CS_5m_0/20210315-131551-DaaLoadTest/results/1.DLT.kill_24.stdout
DLT stderr JVMDUMP039I Processing dump event "user", detail "" at 2021/03/15 13:50:55 - please wait.
STF 13:52:55.063 - Monitoring Report Summary:
STF 13:52:55.063 - o Process DLT has timed out
STF 13:52:55.064 - Killing processes: DLT
STF 13:52:55.064 - o Process DLT pid 18940050 stop()
STF 13:52:56.065 - o Process DLT pid 18940050 killed
**FAILED** at step 1 (Run daa load test). Expected return value=0 Actual=1 at /home/jenkins/workspace/Grinder/openjdk-tests/TKG/../TKG/output_16158285507977/DaaLoadTest_all_CS_5m_0/20210315-131551-DaaLoadTest/execute.pl line 94.
Looking at the failure output, it may be a case of STF killing the sub-process due to timeout, causing the dumps being not generated.
STF 13:52:55.063 - o Process DLT has timed out
The test sub-process named DLT
gets launched by STF with a timeout of 35m
. When assertion happens, an attempt is made to generate the dump by issuing SIGXCPU
, but it appears that the DLT
sub-process reaches its timeout while the dump generation is taking place. So, STF just kills the whole thing - as it's designed to do.
The assert occurred at 13:15, well before the timeout occurred.
DLT 13:15:56.065 - Starting thread. Suite=0 thread=13
DLT stderr 17:15:56.520 0x30464a00 j9mm.107 * ** ASSERTION FAILED ** at /home/jenkins/workspace/Build_JDK8_ppc64_aix_Release/omr/gc/base/MemoryPoolSplitAddressOrderedListBase.cpp:499: ((false && (((0) == nextFreeEntry) || (addrTop <= nextFreeEntry))))
DLT stderr JVMDUMP039I Processing dump event "traceassert", detail "" at 2021/03/15 13:15:56 - please wait.
STF 13:20:54.142 - Heartbeat: Process DLT is still running
I assume there is a sub-process being run? It would be useful to (always) print the command lines for any sub-processes.
The assert occurred at 13:15, well before the timeout occurred.
True, but the dump collection started at 13:50:55.015
and the timeout occurs at 13:52:55.063
- in between these two events, we see a whole bunch of output suggesting the framework is trying to generate dump. That's why I was guessing if the timeout killed the dump generation.
I assume there is a sub-process being run?
By subprocess I meant the sub-process of the test, e.g. DLT
- whose command line does get printed by STF:
STF 13:15:54.612 - Running command: /home/jenkins/workspace/Grinder/openjdkbinary/j2sdk-image/bin/java -Xgcpolicy:gencon -Xgc:concurrentScavenge -classpath /home/jenkins/workspace/Grinder/jvmtest/system/stf/stf.load/bin:/home/jenkins/workspace/Grinder/jvmtest/system/stf/stf.core/bin:/home/jenkins/workspace/Grinder/jvmtest/system/systemtest_prereqs/log4j/log4j-api.jar:/home/jenkins/workspace/Grinder/jvmtest/system/systemtest_prereqs/log4j/log4j-core.jar:/home/jenkins/workspace/Grinder/jvmtest/system/systemtest_prereqs/junit/junit.jar:/home/jenkins/workspace/Grinder/jvmtest/system/systemtest_prereqs/junit/hamcrest-core.jar:/home/jenkins/workspace/Grinder/jvmtest/system/openj9-systemtest/openj9.test.daa/bin net.adoptopenjdk.loadTest.LoadTest -resultsDir /home/jenkins/workspace/Grinder/openjdk-tests/TKG/output_16158285507977/DaaLoadTest_all_CS_5m_0/20210315-131551-DaaLoadTest/results -resultsPrefix 1.DLT. -timeLimit 5m -dumpRequested false -reportFailureLimit 1 -abortAtFailureLimit 10 -maxTotalLogFileSpace 200M -maxSingleLogSize 1/25 -suite.daa.threadCount 14 -suite.daa.inventoryFile /home/jenkins/workspace/Grinder/openjdk-tests/TKG/output_16158285507977/DaaLoadTest_all_CS_5m_0/20210315-131551-DaaLoadTest/results/1.DLT.inventory/openj9.test.load/config/inventories/daa/daaAll.xml -suite.daa.inventoryExcludeFile none -suite.daa.selection random -suite.daa.seed -1 -suite.daa.repeatCount 1 -suite.daa.thinkingTime 0ms..0ms
https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_extended.system_ppc64_aix_Release_testList_2/1/ DaaLoadTest_all_CS_5m_0