Open pshipton opened 5 years ago
Trying a grinder on MauveMultiThreadLoadTest_special_15 https://ci.eclipse.org/openj9/view/Test/job/Test-Grinder/498/ osx1013-x86-3 The cores seem to be in the grinder job, so we can get them from there if needed, or easily recreate the problem.
@Mesbah-Alam @smlambert @DanHeidinga
00:35:49 LT stderr _RegisterApplication(), FAILED TO establish the default connection to the WindowServer, _CGSDefaultConnection() is NULL. 00:35:49 LT stderr 2019-09-11 00:35:46.622 java[39481:24524635] +[NSXPCSharedListener endpointForReply:withListenerName:]: an error occurred while attempting to obtain endpoint for listener 'ClientCallsAuxiliary': Connection interrupted
Looks like this stackoverflow question: https://stackoverflow.com/a/15202453/5029186
@AdamBrousseau
07:46:25 + tar -zcf system_mauveLoadTest_test_output.tar.gz ./openjdk-tests/TestConfig/test_output_15681755186622
[Pipeline] getArtifactoryServer
[Pipeline] newBuildInfo
[Pipeline] artifactoryUpload (hide)
expected to call org.jfrog.hudson.pipeline.common.types.ArtifactoryServer.upload but wound up catching artifactoryUpload; see: https://jenkins.io/redirect/pipeline-cps-method-mismatches/
08:12:12 [consumer_0] Deploying artifact: https://140-211-168-230-openstack.osuosl.org/artifactory/ci-eclipse-openj9/Test/Test_openjdk8_j9_special.system_x86-64_mac_Nightly/51/system_mauveLoadTest_test_output.tar.gz
Cancelling nested steps due to timeout
[Pipeline] }
[Pipeline] // timestamps
10:12:48 Error occurred for request PUT /artifactory/ci-eclipse-openj9/Test/Test_openjdk8_j9_special.system_x86-64_mac_Nightly/51/system_mauveLoadTest_test_output.tar.gz;build.parentNumber=46;build.parentName=Test_openjdk8_j9_special.system_x86-64_mac_Nightly;build.timestamp=1568170859606;build.name=Test_openjdk8_j9_special.system_x86-64_mac_Nightly;build.number=51 HTTP/1.1: Connection closed by remote host.
10:12:48 Attempting retry #1
10:12:48 [consumer_0] An exception occurred during execution:
10:12:48 java.lang.IllegalStateException: Connection pool shut down
Any idea how big the test tarball is? Looking at the grinder you linked the one there is 400M and takes 1 minute to zip. This one appears to take 1/2 hour just to zip. Then fails to upload after 2 hours.
@AdamBrousseau I don't care about the artifact, more concerned about the root problem https://github.com/eclipse/openj9/issues/7050#issuecomment-530508632
Not really understanding this. I tried running that StackOverflow suggestion but there's nothing in /Library/LaunchDaemons
. I think we need more help from Test in order to understand what is going on here.
The mauve tests passed when running on osx1014-x86-3. https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_special.system_x86-64_mac_Nightly/78/
Similar failures continue to show up:
LT stderr _RegisterApplication(), FAILED TO establish the default connection to the WindowServer, _CGSDefaultConnection() is NULL.
LT stderr 2019-09-23 06:30:25.447 java[61166:26134165] +[NSXPCSharedListener endpointForReply:withListenerName:]: an error occurred while attempting to obtain endpoint for listener 'ClientCallsAuxiliary': Connection interrupted
https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_special.system_x86-64_mac_Nightly/143/tapResults/
@Mesbah-Alam note that as per https://github.com/eclipse/openj9/issues/7050#issuecomment-531560606 it passed when running on osx1014-x86-3. Not sure what is special about this machine. @AdamBrousseau @jdekonin
There is nothing special about this machine. It should identical to osx1014-x86-1 and osx1014-x86-2.
I wonder if it passes on osx1014 machines. I'll keep of list of where it fails. osx1011-x86-1 osx1011-x86-2 osx1013-x86-3 osx1013-x86-4 osx1014-x86-1 osx1014-x86-2
and where it passes osx1014-x86-3
The 1013 and 1011 systems should all be identical as well.
I tried grinders on osx1014-x86-1 and 2, and the tests do not pass on these machines. There is something special about osx1014-x86-3.
This seems like a machine issue or test issue. It won't block the release so it doesn't need to be in the 0.17 milestone plan, moving it out. Although I would like it to be resolved sooner rather than later since the mauve tests fail on Mac almost every night.
The mauveLoadTest tests always fail (on internal, external openj9 and adopt machines):
BUILD_LIST = system/mauveLoadTest
.@vsebe note they (used to) pass on a single machine, osx1014-x86-3
Running a Grinder on osx1014-x86-3
12:56:29 LT stderr _RegisterApplication(), FAILED TO establish the default connection to the WindowServer, _CGSDefaultConnection() is NULL.
12:56:29 LT stderr 2019-11-27 12:56:27.792 java[46488:1391486] +[NSXPCSharedListener endpointForReply:withListenerName:]: an error occurred while attempting to obtain endpoint for listener 'ClientCallsAuxiliary': Connection interrupted
12:56:48 LT 12:56:47.503 - Completed 0.2%. Number of tests started=4426
...
13:11:28 LT 13:11:27.431 - Completed 0.2%. Number of tests started=4426 (+0)
13:11:29 LT 13:11:28.383 - **POSSIBLE HANG DETECTED**
13:11:29 STF 13:11:28.773 - **FAILED** Process LT has hung
13:11:29 STF 13:11:28.773 - Collecting dumps for: LT
13:11:29 STF 13:11:28.773 - Sending SIG 3 to the java process to generate a javacore
13:11:29 STF 13:11:28.773 - Running command: kill -3 46488
...
13:13:29 **FAILED** at step 1 (Run Mauve load test). Expected return value=0 Actual=1 at /Users/jenkins/workspace/Grinder/openjdk-tests/TestConfig/../TestConfig/test_output_15748773846700/MauveMultiThreadLoadTest_special_13/20191127-125625-MauveMultiThreadLoadTest/execute.pl line 95.
13:13:29 STF 13:13:29.299 - **FAILED** execute script failed. Expected return value=0 Actual=1
...
We managed to cause the same problem RegisterApplication(), FAILED TO establish the default connection to the WindowServer, _CGSDefaultConnection() is NULL.
in a sanity.functional test by loading the awt library.
@Mesbah-Alam if the awt library is loaded from the main thread first, then this test may be able to run on Mac Java 8.
Checking the inventory files for this load test, there is no awt test that gets run in the multi-threaded version of the mauve test (They area all commented out - as "swing and awt are not thread safe").
Awt tests are run only for the single-threaded version of mauve tests -- Does this not mean that the "awt libraries" should always be loaded from the main thread first?
@pshipton
@Mesbah-Alam just trying to figure out why these tests are hanging. We might need to experiment. I know loading the awt library causes the hang. I've heard that awt needs to be loaded in the main thread in order for it to work. We could try pre-loading it, or we can try to track down why it is loaded and exclude that sub-set. Setting an -Xtrace option on System.loadLibrary() can help find the callers.
This will dump a stack trace to stderr.
"-Xtrace:none,iprint=mt,trigger=method{java/lang/System.load*,jstacktrace}"
For reference, there is the latest run https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_special.system_x86-64_mac_Nightly/545/ All the enabled mauve tests fail
06:16:55 MauveMultiThreadLoadTest_special_13
06:16:55 MauveMultiThreadLoadTest_special_14
06:16:55 MauveMultiThreadLoadTest_special_15
06:16:55 MauveMultiThreadLoadTest_special_16
06:16:55 MauveMultiThreadLoadTest_special_17
06:16:55 MauveMultiThreadLoadTest_special_18
06:16:55 MauveMultiThreadLoadTest_special_19
06:16:55 MauveMultiThreadLoadTest_special_20
06:16:55 MauveMultiThreadLoadTest_special_21
06:16:55 MauveMultiThreadLoadTest_special_22
06:16:55 MauveMultiThreadLoadTest_special_25
06:16:55 MauveMultiThreadLoadTest_special_26
06:16:55 MauveSingleInvocationLoadTest_special_13
06:16:55 MauveSingleInvocationLoadTest_special_14
06:16:55 MauveSingleInvocationLoadTest_special_15
06:16:55 MauveSingleInvocationLoadTest_special_16
06:16:55 MauveSingleInvocationLoadTest_special_17
06:16:55 MauveSingleInvocationLoadTest_special_18
06:16:55 MauveSingleInvocationLoadTest_special_19
06:16:55 MauveSingleInvocationLoadTest_special_20
06:16:55 MauveSingleInvocationLoadTest_special_21
06:16:55 MauveSingleInvocationLoadTest_special_22
06:16:55 MauveSingleInvocationLoadTest_special_25
06:16:55 MauveSingleInvocationLoadTest_special_26
Mesbah captured a stack trace for a test loading awt.
14:04:18 LT stderr 19:04:17.104 0x7b05100 j9trc_aux.1 - [1] java.lang.System.loadLibrary (System.java:558)
14:04:18 LT stderr 19:04:17.104 0x7b05100 j9trc_aux.1 - [2] java.awt.Toolkit$3.run (Toolkit.java:1636)
14:04:18 LT stderr 19:04:17.104 0x7b05100 j9trc_aux.1 - [3] java.awt.Toolkit$3.run (Toolkit.java:1634)
14:04:18 LT stderr 19:04:17.104 0x7b05100 j9trc_aux.1 - [4] java.security.AccessController.doPrivileged (AccessController.java:678) (Compiled Code)
14:04:18 LT stderr 19:04:17.104 0x7b05100 j9trc_aux.1 - [5] java.awt.Toolkit.loadLibraries (Toolkit.java:1633)
14:04:18 LT stderr 19:04:17.104 0x7b05100 j9trc_aux.1 - [6] java.awt.Toolkit.<clinit> (Toolkit.java:1670)
14:04:18 LT stderr 19:04:17.104 0x7b05100 j9trc_aux.1 - [7] java.awt.Color.<clinit> (Color.java:275)
14:04:18 LT stderr 19:04:17.104 0x7b05100 j9trc_aux.1 - [8] gnu.testlet.java.text.AttributedCharacterIterator.getRunLimit.test1 (getRunLimit.java:51)
14:04:18 LT stderr 19:04:17.104 0x7b05100 j9trc_aux.1 - [9] gnu.testlet.java.text.AttributedCharacterIterator.getRunLimit.test (getRunLimit.java:41)
14:04:18 LT stderr 19:04:17.104 0x7b05100 j9trc_aux.1 - [10] sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
14:04:18 LT stderr 19:04:17.104 0x7b05100 j9trc_aux.1 - [11] sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
14:04:18 LT stderr 19:04:17.104 0x7b05100 j9trc_aux.1 - [12] sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43) (Compiled Code)
14:04:18 LT stderr 19:04:17.104 0x7b05100 j9trc_aux.1 - [13] java.lang.reflect.Method.invoke (Method.java:498) (Compiled Code)
14:04:18 LT stderr 19:04:17.104 0x7b05100 j9trc_aux.1 - [14] net.adoptopenjdk.loadTest.adaptors.MauveAdaptor.executeTest (MauveAdaptor.java:74) (Compiled Code)
14:04:18 LT stderr 19:04:17.104 0x7b05100 j9trc_aux.1 - [15] net.adoptopenjdk.loadTest.LoadTestRunner$2.run (LoadTestRunner.java:182)
14:04:18 LT stderr 19:04:17.104 0x7b05100 j9trc_aux.1 - [16] java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1149)
14:04:18 LT stderr 19:04:17.104 0x7b05100 j9trc_aux.1 - [17] java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:624)
14:04:18 LT stderr 19:04:17.104 0x7b05100 j9trc_aux.1 - [18] java.lang.Thread.run (Thread.java:821)
The test causing the problem is gnu.testlet.java.text.AttributedCharacterIterator.getRunLimit.test1
Excluded getRunLimit
and hit a second one that loads awt gnu.testlet.java.text.AttributedCharacterIterator.getRunStart
:
14:34:03 LT stderr 19:34:02.842 0xa279e00 j9trc_aux.1 - [2] java.awt.Toolkit$3.run (Toolkit.java:1636)
14:34:03 LT stderr 19:34:02.842 0xa279e00 j9trc_aux.1 - [3] java.awt.Toolkit$3.run (Toolkit.java:1634)
14:34:03 LT stderr 19:34:02.842 0xa279e00 j9trc_aux.1 - [4] java.security.AccessController.doPrivileged (AccessController.java:678) (Compiled Code)
14:34:03 LT stderr 19:34:02.842 0xa279e00 j9trc_aux.1 - [5] java.awt.Toolkit.loadLibraries (Toolkit.java:1633)
14:34:03 LT stderr 19:34:02.842 0xa279e00 j9trc_aux.1 - [6] java.awt.Toolkit.<clinit> (Toolkit.java:1670)
14:34:03 LT stderr 19:34:02.842 0xa279e00 j9trc_aux.1 - [7] java.awt.Color.<clinit> (Color.java:275)
14:34:03 LT stderr 19:34:02.842 0xa279e00 j9trc_aux.1 - [8] gnu.testlet.java.text.AttributedCharacterIterator.getRunStart.test1
As I exclude more tests, more new tests are being exposed as culprits. I'm going through the process of compiling the total list of excludes. Will update this issue when I reach the bottom of this list and get a green run with no hang.
We should explore pre-loading awt in the main thread rather than excluding too many tests.
@pshipton , for pre-loading awt in the main thread, would this suffice?
I tried the above change, and was able to reproduce the error even before any thread / mauve test started:
11:11:36 GEN stderr _RegisterApplication(), FAILED TO establish the default connection to the WindowServer, _CGSDefaultConnection() is NULL.
11:11:36 GEN stderr 2019-12-18 11:11:36.150 java[95094:73409151] +[NSXPCSharedListener endpointForReply:withListenerName:]: an error occurred while attempting to obtain endpoint for listener 'ClientCallsAuxiliary': Connection interrupted
Link: Test_grinder/job/Grinder/4971/console
I'm not sure how to fix it. It was @keithc-ca who suggested that awt needs to be loaded in the main thread. You should record here the experiments you tried.
public class A1 {
static {
try {
Class.forName("java.awt.Toolkit");
} catch (ClassNotFoundException e) {
e.printStackTrace();
}
}
public static void main (String [] args) {
System.out.print("Hello world");
}
}
public class A2 {
public static void main (String [] args) {
Thread t = new T();
t.start();
}
}
public class T extends Thread {
static {
try {
Class.forName("java.awt.Toolkit");
} catch (ClassNotFoundException e) {
e.printStackTrace();
}
}
public void run () {
System.out.println("Mr. T");
}
}
A1 and A2 were executed using OpenJ9 and Hotspot JDK8. They both work fine - i.e. the issue is not reproduced.
@Mesbah-Alam for A2 please try moving the Class.forName
into the run() method. Doing it in the static block means it's run in the main thread.
Done. Same result -- it works fine:
public class T extends Thread{
public void run () {
try {
Class.forName("java.awt.Toolkit");
} catch (ClassNotFoundException e) {
e.printStackTrace();
}
System.out.println("Mr. T");
}
}
Found the full set of files to be disabled for osx openj9 jdk8:
<mauve class="gnu.testlet.java.text.AttributedCharacterIterator.getRunLimit"/>
<mauve class="gnu.testlet.java.text.AttributedCharacterIterator.getRunStart"/>
<mauve class="gnu.testlet.javax.swing.ViewportLayout.minimumLayoutSize"/>
<mauve class="gnu.testlet.java.net.DatagramPacket.DatagramPacketTest2"/>
<mauve class="gnu.testlet.java.text.AttributedString.getIterator"/>
<mauve class="gnu.testlet.java.lang.reflect.Proxy.check13"/>
<mauve class="gnu.testlet.java.text.AttributedString.constructors"/>
<mauve class="gnu.testlet.java.security.KeyPairGenerator.getInstance14"/>
<mauve class="gnu.testlet.java.security.AlgorithmParameterGenerator.getInstance14"/>
<mauve class="gnu.testlet.java.lang.Class.serialization"/>
<mauve class="gnu.testlet.java.lang.reflect.Proxy.check13"/>
<mauve class="gnu.testlet.java.security.AlgorithmParameters.getInstance14"/>
<mauve class="gnu.testlet.java.text.AttributedString.addAttributes"/>
<mauve class="gnu.testlet.java.security.KeyFactory.getInstance14"/>
<mauve class="gnu.testlet.java.security.SecureRandom.TestOfPR23899"/>
<mauve class="gnu.testlet.java.util.UUID.TestAll"/>
I'll make the change in the tests so that we use a special inventory and exclude file for osx openj9 jdk8 - and continue to use the regular inventory and exclude file for the other os - jdk variations.
These tests pass on Hotspot, we need to look into why they don't work on OpenJ9. https://github.com/AdoptOpenJDK/openjdk-systemtest/pull/316/files#r360007264
Perhaps the test needs to use the -XstartOnFirstThread
option; see [1].
[1] https://github.com/eclipse/openj9/blob/master/test/functional/DDR_Test/tck_ddrext.xml#L95
JDK 8 0.36 build(mac11x64rt04
)
14:43:49 openjdk version "1.8.0_362"
14:43:49 IBM Semeru Runtime Open Edition (build 1.8.0_362-b07)
14:43:49 Eclipse OpenJ9 VM (build v0.36.0-release-c5da55ccc, JRE 1.8.0 Mac OS X amd64-64-Bit Compressed References 20230113_599 (JIT enabled, AOT enabled)
14:43:49 OpenJ9 - c5da55ccc
14:43:49 OMR - 2bb04132e
14:43:49 JCL - 1a94e3037e based on jdk8u362-b07)
[2023-01-13T20:21:17.214Z] 2023-01-13 15:21:17.137 java[38577:299094598] +[NSXPCSharedListener endpointForReply:withListenerName:]: an error occurred while attempting to obtain endpoint for listener 'ClientCallsAuxiliary': Connection interrupted
[2023-01-14T15:43:28.290Z] Cancelling nested steps due to timeout
JDK8 x86-64_mac(macx64rt4
)
08:47:25 Running test JCL_FVTest_SE80_Internal_0 ...
08:47:25 ===============================================
08:47:25 JCL_FVTest_SE80_Internal_0 Start Time: Mon Mar 25 08:47:24 2024 Epoch Time (ms): 1711370844899
08:47:25 variation: NoOptions
08:47:25 JVM_OPTIONS:
08:47:25 TESTING:
08:47:25 ...
08:47:25 ... TestNG 6.14.2 by Cédric Beust (cedric@beust.com)
08:47:25 ...
08:47:25
08:47:26
08:47:26 ** util.Date Test Harness **
08:47:26
08:47:26 2024-03-25 08:47:26.630 java[68712:219993851] XType: Using static font registry.
08:47:27 2024-03-25 08:47:26.686 java[68712:219993851] +[NSXPCSharedListener endpointForReply:withListenerName:replyErrorCode:]: an error occurred while attempting to obtain endpoint for listener 'ClientCallsAuxiliary': Connection interrupted
09:51:52 Cannot contact macx64rt4.rtp.raleigh.ibm.com: java.lang.InterruptedException
Perhaps a dup of https://github.com/eclipse/openj9/issues/4091 update: #4091 is closed as a dup of this one. Seems all Mauve sanity.system tests are excluded on osx.
https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_special.system_x86-64_mac_Nightly/51 osx1014-x86-2 All mauve testing failed System dumps are created, but there is no archive attached to the job. A grinder of a single sub-test might include diagnostics