eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.27k stars 721 forks source link

zlinux JCL_Test_JITHelpers hangs on z15, blocking sanity.functional testing #8159

Closed pshipton closed 4 years ago

pshipton commented 4 years ago

The OpenJ9 zlinux machines were upgraded to z15. ~As yet unknown if~ the hang is caused by the machine changes ~(seems likely), or a code change~. There is a core file captured with gcore in team/joe/core.2278.gz

https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_sanity.functional_s390x_linux_Nightly/232 https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_sanity.functional_s390x_linux_xl_Nightly/20/ https://ci.eclipse.org/openj9/job/Test_openjdk13_j9_sanity.functional_s390x_linux_Nightly/131/

jdk8 didn't run at all due to a missing boot jdk, ~so as yet unknown if the problem also occurs on jdk8~.

The hanging process is left running on the machine, even after the jenkins job is stopped.

The z15 machines (5-8) have been disabled in jenkins, and the old machines (1-4) enabled. This problem is still blocking as we need to return the old machines shortly.

01:41:28  ===============================================
01:41:28  Running test JCL_Test_JITHelpers_1 ...
01:41:28  ===============================================
01:41:28  JCL_Test_JITHelpers_1 Start Time: Fri Dec 20 06:41:27 2019 Epoch Time (ms): 1576824087807
01:41:28  "/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_Nightly/openjdkbinary/j2sdk-image/bin/java" -Xshareclasses:destroyAll; "/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_Nightly/openjdkbinary/j2sdk-image/bin/java" -Xshareclasses:groupAccess,destroyAll; echo "cache cleanup done";
01:41:28  JVMSHRC005I No shared class caches available
01:41:28  JVMSHRC005I No shared class caches available
01:41:28  cache cleanup done
01:41:28  variation: -Xjit:count=0 -XX:+CompactStrings
01:41:28  JVM_OPTIONS: -Xcompressedrefs -Xjit:count=0 -XX:+CompactStrings 
01:41:28  { itercnt=1; \
01:41:28  mkdir -p "/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_Nightly/openjdk-tests/TKG/../TKG/test_output_15768233397124/JCL_Test_JITHelpers_1"; \
01:41:28  cd "/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_Nightly/openjdk-tests/TKG/../TKG/test_output_15768233397124/JCL_Test_JITHelpers_1"; \
01:41:28  "/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_Nightly/openjdkbinary/j2sdk-image/bin/java" -Xcompressedrefs -Xjit:count=0 -XX:+CompactStrings  --add-exports=java.base/com.ibm.jit=ALL-UNNAMED \
01:41:28  --add-opens=java.base/com.ibm.jit=ALL-UNNAMED -Xbootclasspath/a:"/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_Nightly/openjdk-tests/TKG/../../jvmtest/functional/Java8andUp/TestResources.jar" \
01:41:28  -cp "/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_Nightly/openjdk-tests/TKG/../../jvmtest/TestConfig/resources:/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_Nightly/openjdk-tests/TKG/../TKG/lib/testng.jar:/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_Nightly/openjdk-tests/TKG/../TKG/lib/jcommander.jar:/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_Nightly/openjdk-tests/TKG/../../jvmtest/functional/Java8andUp/GeneralTest.jar" \
01:41:28  org.testng.TestNG -d "/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_Nightly/openjdk-tests/TKG/../TKG/test_output_15768233397124/JCL_Test_JITHelpers_1" "/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_Nightly/openjdk-tests/TKG/../../jvmtest/functional/Java8andUp/testng.xml" -testnames JCL_TEST_JITHelpers \
01:41:28  -groups level.sanity \
01:41:28  -excludegroups d.*.linux_390-64_cmprssptrs,d.*.arch.390,d.*.os.linux,d.*.bits.64,d.*.generic-all; \
01:41:28  if [ $? -eq 0 ] ; then echo ""; echo "JCL_Test_JITHelpers_1""_PASSED"; echo ""; cd /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_Nightly/openjdk-tests/TKG/..;  else echo ""; echo "JCL_Test_JITHelpers_1""_FAILED"; echo ""; fi; } 2>&1 | tee -a "/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_Nightly/openjdk-tests/TKG/../TKG/test_output_15768233397124/TestTargetResult";
01:41:41  [IncludeExcludeTestAnnotationTransformer] [INFO] exclude file is /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_Nightly/openjdk-tests/TKG/../TestConfig/resources/excludes/latest_exclude_11.txt
01:41:41  ...
01:41:41  ... TestNG 6.14.2 by Cédric Beust (cedric@beust.com)
01:41:41  ...
01:41:41  
01:41:43  WARNING: An illegal reflective access operation has occurred
01:41:43  WARNING: Illegal reflective access by org.openj9.test.com.ibm.jit.Test_JITHelpers (file:/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_Nightly/jvmtest/functional/Java8andUp/GeneralTest.jar) to field java.lang.String.value
01:41:43  WARNING: Please consider reporting this to the maintainers of org.openj9.test.com.ibm.jit.Test_JITHelpers
01:41:43  WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
01:41:43  WARNING: All illegal access operations will be denied in a future release
Cancelling nested steps due to timeout

OpenJ9 changes since the last good build, but the problem appears to be caused by the z15 upgrade. https://github.com/eclipse/openj9/compare/481b8a0...341b267

pshipton commented 4 years ago

@fjeremic @jdekonin

pshipton commented 4 years ago

sanity.functional for jdk8 https://ci.eclipse.org/openj9/view/Pipelines/job/Pipeline-Release-Build/135/

This build hangs the same way.

pshipton commented 4 years ago

Joe captured the following stack from the process before it was somehow killed, not sure if it is relevant.

(gdb) bt
#0  0x000003ff9728060c in ?? ()
#1  0x000003ffb762e126 in sidecarInvokeReflectMethodImpl ()
   from /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_Personal/openjdkbinary/j2sdk-image/lib/compressedrefs/libj9vm29.so
#2  0x000003ffb762f532 in sidecarInvokeReflectMethod ()
   from /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_Personal/openjdkbinary/j2sdk-image/lib/compressedrefs/libj9vm29.so
#3  0x000003ffb545d134 in JVM_InvokeMethod_Impl ()
   from /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_Personal/openjdkbinary/j2sdk-image/lib/compressedrefs/libjclse29.so
#4  0x000003ff96d173e6 in ?? ()
#5  0x000003ffb762d9b2 in runCallInMethod ()
   from /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_Personal/openjdkbinary/j2sdk-image/lib/compressedrefs/libj9vm29.so
#6  0x000003ffb76482f2 in gpProtectedRunCallInMethod ()
   from /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_Personal/openjdkbinary/j2sdk-image/lib/compressedrefs/libj9vm29.so
#7  0x000003ffb7229710 in omrsig_protect ()
   from /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_Personal/openjdkbinary/j2sdk-image/lib/compressedrefs/libj9prt29.so
#8  0x000003ffb769386c in gpProtectAndRun ()
   from /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_Personal/openjdkbinary/j2sdk-image/lib/compressedrefs/libj9vm29.so
#9  0x000003ffb764a290 in gpCheckCallin ()
   from /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_Personal/openjdkbinary/j2sdk-image/lib/compressedrefs/libj9vm29.so
#10 0x000003ffb7647b26 in callStaticVoidMethod ()
   from /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_Personal/openjdkbinary/j2sdk-image/lib/compressedrefs/libj9vm29.so
#11 0x000003ffb848654c in JavaMain (_args=<optimized out>)
    at /home/jenkins/workspace/Build_JDK11_s390x_linux_Personal/src/java.base/share/native/libjli/java.c:549
#12 0x000003ffb8087934 in start_thread (arg=0x3ffb807f910) at pthread_create.c:335
#13 0x000003ffb82edce2 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:74
pshipton commented 4 years ago

Build using the last known working openj9 sha https://ci.eclipse.org/openj9/view/Pipelines/job/Pipeline-Release-Build/138/

pshipton commented 4 years ago

Building from the last known working openj9 sha also hangs. The problem is caused by the machine update to z15.

I assumed the OMR changes had gone through a proper acceptance build, but I notice they didn't. However the latest is passing on the older zlinux machines.

fjeremic commented 4 years ago

@pshipton @jdekonin is there any way for me to get access to this machine? This will go much faster if I can just hop on and reproduce. I don't have access to a z15 locally to test with at the moment.

pshipton commented 4 years ago

For the record, Filip has access now.

fjeremic commented 4 years ago

I am able to reproduce the problem. Here is the minimum reproducible standalone test case:

import java.lang.reflect.Field;
import com.ibm.jit.JITHelpers;

public class Test {
        public static void main(String[] args) {
                test_intrinsicIndexOfStringLatin1();
        }

        private static final com.ibm.jit.JITHelpers helpers = getJITHelpers();
        private static com.ibm.jit.JITHelpers getJITHelpers() {
                try {
                        Field f = com.ibm.jit.JITHelpers.class.getDeclaredField("helpers");
                        f.setAccessible(true);
                        return (com.ibm.jit.JITHelpers) f.get(null);
                } catch (IllegalAccessException e) {
                        throw new RuntimeException(e);
                } catch (NoSuchFieldException e) {
                        throw new RuntimeException(e);
                }
        }

        public static void test_intrinsicIndexOfStringLatin1() {
                try {
                        Field valueField = String.class.getDeclaredField("value");
                        valueField.setAccessible(true);

                        Field enableCompressionField = String.class.getDeclaredField("enableCompression");
                        enableCompressionField.setAccessible(true);

                        if ((boolean)enableCompressionField.get(null)) {
                                if (helpers.intrinsicIndexOfStringLatin1(valueField.get("a"), 1, valueField.get("b"), 1, 0) != -1) {
                                        System.out.println("Failed");
                                } else {
                                        System.out.println("Passed");
                                }
                        }
                } catch (IllegalAccessException e) {
                        throw new RuntimeException(e);
                } catch (NoSuchFieldException e) {
                        throw new RuntimeException(e);
                }
        }
}

Compiling and running:

> jdk-11.0.6+8/bin/javac --add-exports=java.base/com.ibm.jit=ALL-UNNAMED --add-opens=java.base/com.ibm.jit=ALL-UNNAMED Test.java
> jdk-11.0.6+8/bin/java --add-exports=java.base/com.ibm.jit=ALL-UNNAMED --add-opens=java.base/com.ibm.jit=ALL-UNNAMED -Xcompressedrefs '-Xjit:limit={Test.test_intrinsicIndexOfStringLatin1()V}(tracefull,log=log.trace,lastoptindex=0),count=0,disableasynccompilation' -XX:+CompactStrings Test

hangs the test. We can run it under a debugger to quickly find where we are looping:

> gdb --args jdk-11.0.6+8/bin/java --add-exports=java.base/com.ibm.jit=ALL-UNNAMED --add-opens=java.base/com.ibm.jit=ALL-UNNAMED -Xcompressedrefs '-Xjit:limit={Test.test_intrinsicIndexOfStringLatin1()V}(tracefull,log=log.trace,lastoptindex=0,breakonentry),count=0,disableasynccompilation' -XX:+CompactStrings Test
...
(gdb)
0x000003ffdcdff326 in ?? ()
(gdb)
0x000003ffdcdff32a in ?? ()
(gdb)
0x000003ffdcdff330 in ?? ()
(gdb)
0x000003ffdcdff334 in ?? ()
(gdb)
0x000003ffdcdff338 in ?? ()
(gdb)
0x000003ffdcdff33e in ?? ()
(gdb)
0x000003ffdcdff34c in ?? ()
(gdb)
0x000003ffdcdff352 in ?? ()
(gdb)
0x000003ffdcdff356 in ?? ()
(gdb)
0x000003ffdcdff35a in ?? ()
(gdb)
0x000003ffdcdff35e in ?? ()
(gdb)
0x000003ffdcdff364 in ?? ()
(gdb)
0x000003ffdcdff326 in ?? ()

Looking at the log we seem to be stuck in this loop:

     0x3ffdcdff326 00000226 [     0x3ffd715fb10]                            Label L0117:
     0x3ffdcdff326 00000226 [     0x3ffd715fc00]                            Label L0105:
     0x3ffdcdff326 00000226 [     0x3ffd715fcf0] b9 e9 30 80                SGRK    GPR8,GPR0,GPR3
     0x3ffdcdff32a 0000022a [     0x3ffd715fdd0] ec 8a 00 00 10 7c          CGIJ    GPR8,Label L0106,16,BNM(mask=0xa),
     0x3ffdcdff330 00000230 [     0x3ffd715fed0] b9 e8 30 41                AGRK    GPR4,GPR1,GPR3
     0x3ffdcdff334 00000234 [     0x3ffd7160000] a7 8b ff ff                AGHI    GPR8,0xffff
     0x3ffdcdff338 00000238 [     0x3ffd71601b0] e7 08 40 08 00 37          VLL     VRF0,GPR8,#405 8(GPR4)
     0x3ffdcdff33e 0000023e [     0x3ffd71602a0] a7 f4 00 5c                BRC     NOP(0xf), Label L0107, labelTargetAddr=0x000003FFDCDFF34C
     0x3ffdcdff342 00000242 [     0x3ffd7160390]                            Label L0106:
     0x3ffdcdff342 00000242 [     0x3ffd7160540] e7 03 10 08 00 06          VL      VRF0,#406 8(GPR3,GPR1)
     0x3ffdcdff348 00000248 [     0x3ffd7160630] a7 89 00 10                LGHI    GPR8,0x10
     0x3ffdcdff34c 0000024c [     0x3ffd7160720]                            Label L0107:
     0x3ffdcdff34c 0000024c [     0x3ffd71608d0] e7 30 20 00 40 8b          VSTRS   VRF3,VRF0,VRF2,VRF4,0,0
     0x3ffdcdff352 00000252 [     0x3ffd71609c0] a7 24 00 51                BRC     MASK3(0x2), Label L0118, labelTargetAddr=0x000003FFDCDFF368
     0x3ffdcdff356 00000256 [     0x3ffd7160ab0] a7 14 00 74                BRC     MASK2(0x1), Label L0119, labelTargetAddr=0x000003FFDCDFF384
     0x3ffdcdff35a 0000025a [     0x3ffd7160ba0] b9 08 00 38                AGR     GPR3,GPR8
     0x3ffdcdff35e 0000025e [     0x3ffd7160c80] ec 37 00 00 20 64          CGRJ    GPR3,GPR7,Label L0114,BLR(mask=0x2),
     0x3ffdcdff364 00000264 [     0x3ffd7160d80] a7 f4 ff e1                BRC     NOP(0xf), Label L0105, labelTargetAddr=0x000003FFDCDFF326

The CGRJ at 0x3ffdcdff35e looks problematic as GPR7 == 0 in the entire loop, so we never actually exit the loop. This is likely because GRP8 which is the index used in the VLL instruction is 0-based, i.e. a value of 0 means "load 1 element" in the VLL instruction. I suspect the condition of the CGRJ needs to be updated to be branch if "greater than or equal" to, not just "greater than".

Although we will really need to examine this evaluator in detail to make sure we have every path covered since this is the second bug we've found in this area since the original implementation. The first bug was fixed in the non-z15 path in #6938 and as part of that change we added the test suite which caught this bug. As part of fixing this issue what I'd like to see full coverage of every possible path through that evaluator.

This can be achieved by temporarily placing dynamic debug counters and validating they are all hit during testing.

fjeremic commented 4 years ago

@VermaSh I likely won't have time to properly fix this before Monday as I'll be away after. Could you take this one over with high priority once you are back? See above comment to get started. As part of the fix I'd like to see full coverage through all paths of that evaluator.

VermaSh commented 4 years ago

All paths of the evaluator seem to be ok, verified through @fjeremic's test. The condition for CGRJ pointed out in previous comment was the only change that was needed.

DanHeidinga commented 4 years ago

@VermaSh Is there a PR with the fix available yet?

VermaSh commented 4 years ago

A PR has been created, apologies for the delay

pshipton commented 4 years ago

Confirming the problem is resolved by disabling z14 machines, enabling z15 machines, and started a build Running testing on z15 here https://ci.eclipse.org/openj9/view/Pipelines/job/Pipeline-Release-Build/142/

pshipton commented 4 years ago

As the previous build is running on z15 (and looks like it's going to pass), I've re-enabled z14. We can run with both z14 and z15 over the weekend and next week until we have to return the z14.

pshipton commented 4 years ago

Previous build passed on z15 machines.