eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.24k stars 713 forks source link

cmdLineTester_jvmtitests_hcr_OSRG_nongold_0 rc018 ppc64le segfault in internal tests #19813

Open cjjdespres opened 1 week ago

cjjdespres commented 1 week ago

Failure link

A sample, as it's been failing very frequently in the JITServer nightly tests:

https://hyc-runtimes-jenkins.swg-devops.com/job/Test_openjdk17_j9_extended.functional_ppc64le_linux_jit_Personal/991/ https://hyc-runtimes-jenkins.swg-devops.com/job/Test_openjdk11_j9_extended.functional_ppc64le_linux_jit_Personal/1078/

The failure in cmdLineTester_jvmtitests_hcr_OSRG_nongold_SE80_0 on ppc64le JDK8 in this test from July 4 also fails in rc018 in the same way.

Optional info

I initially noticed this in the JITServer nightly tests, but a 50x grinder without JITAAS had a 43/50 failure rate, so this is not JITServer-specific. I have only seen this failure on ppc64le so far.

Failure output (captured from console output)

        Testing: rc018
        Test start time: 2024/07/04 01:34:28 Pacific Standard Time
        Running command: "/home/jenkins/workspace/Test_openjdk17_j9_extended.functional_ppc64le_linux_jit_Personal_testList_0/jdkbinary/j2sdk-image/bin/java" -XX:+UseJITServer -Xgcpolicy:optthruput -Xdebug -Xrunjdwp:transport=dt_socket,address=8888,server=y,onthrow=no.pkg.foo,launch=echo -Xjit:enableOSR,enableOSROnGuardFailure,count=1,disableAsyncCompilation  -Xdump    -agentlib:jvmtitest=test:rc018 -cp "/home/jenkins/workspace/Test_openjdk17_j9_extended.functional_ppc64le_linux_jit_Personal_testList_0/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/jvmtitests/jvmtitest.jar:/home/jenkins/workspace/Test_openjdk17_j9_extended.functional_ppc64le_linux_jit_Personal_testList_0/../../testDependency/lib/asm-all.jar" com.ibm.jvmti.tests.util.TestRunner
        Time spent starting: 8 milliseconds
        Time spent executing: 8339 milliseconds
        Test result: FAILED
        Output from test:
         [OUT] *** Testing [1/1]:   testReflectRedefineAtSameTime
         [OUT] starting reflect worker threads
         [OUT] starting to populate java heap
         [ERR] Unhandled exception
         [ERR] Type=Segmentation error vmState=0x00040000
         [ERR] J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
         [ERR] Handler1=00007FFF9FD518F0 Handler2=00007FFFA41C97C0
         [ERR] R0=00007FFF9F37E780 R1=00007FFFA473C1B0 R2=00007FFF9F76EB00 R3=0000000000000000
         [ERR] R4=00000000001CFE20 R5=0000000000000000 R6=0000000000000000 R7=00007FFF85280074
         [ERR] R8=00007FFF83269B86 R9=00007FFFA00D87A0 R10=00007FFF83269B86 R11=0000000000000000
         [ERR] R12=00007FFF8318F2A8 R13=00007FFFA47468E0 R14=000000000017B700 R15=0000000000095800
         [ERR] R16=00007FFF816E0038 R17=0000000000000054 R18=0000000000000000 R19=0000000089F7B8E8
         [ERR] R20=0000000000827FE1 R21=000000000000009C R22=0000000088D0E0B0 R23=0000000048000001
         [ERR] R24=48000DC548000001 R25=00007FFFA00D8210 R26=00007FFF8318F2A8 R27=0000000000000001
         [ERR] R28=00007FFF85280074 R29=0000000000000000 R30=00007FFFA00D87A0 R31=0000000000000000
         [ERR] NIP=00007FFF9F37E824 MSR=800000000280F033 ORIG_GPR3=D00000000282056C CTR=00007FFF9EDBAD80
         [ERR] LINK=00007FFF9F37E820 XER=0000000000000000 CCR=0000000028242422 SOFTE=0000000000000001
         [ERR] TRAP=0000000000000300 DAR=0000000000000020 dsisr=0000000042000000 RESULT=0000000000000000
         [ERR] FPR0=000000000017ba00 (f: 1554944.000000, d: 7.682444e-318)
         [ERR] FPR1=4056bb7820000000 (f: 536870912.000000, d: 9.092921e+01)
         [ERR] FPR2=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR3=3fee666660000000 (f: 1610612736.000000, d: 9.500000e-01)
         [ERR] FPR4=3fce840b4ac4e4d2 (f: 1254417664.000000, d: 2.384047e-01)
         [ERR] FPR5=bfe7154748bef6c8 (f: 1220474624.000000, d: -7.213475e-01)
         [ERR] FPR6=3fe62e42fefa39ef (f: 4277811712.000000, d: 6.931472e-01)
         [ERR] FPR7=bfb1973c5a611ccc (f: 1516313856.000000, d: -6.871392e-02)
         [ERR] FPR8=6552747365745f38 (f: 1702125312.000000, d: 1.196544e+180)
         [ERR] FPR9=6552646e41666564 (f: 1097229696.000000, d: 1.192487e+180)
         [ERR] FPR10=324f5f7463656c66 (f: 1667591296.000000, d: 2.327369e-66)
         [ERR] FPR11=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR12=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR13=4020800000000000 (f: 0.000000, d: 8.250000e+00)
         [ERR] FPR14=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR15=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR16=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR17=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR18=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR19=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR20=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR21=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR22=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR23=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR24=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR25=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR26=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR27=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR28=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR29=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR30=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] FPR31=0000000000000000 (f: 0.000000, d: 0.000000e+00)
         [ERR] Module=/home/jenkins/workspace/Test_openjdk17_j9_extended.functional_ppc64le_linux_jit_Personal_testList_0/jdkbinary/j2sdk-image/lib/default/libj9jit29.so
         [ERR] Module_base_address=00007FFF9E7E0000
         [ERR] Target=2_90_20240704_1047 (Linux 4.18.0-553.5.1.el8_10.ppc64le)
         [ERR] CPU=ppc64le (4 logical CPUs) (0x1dcc50000 RAM)
         [ERR] ----------- Stack Backtrace -----------
         [ERR] _ZN3OMR9CodeCache17replaceTrampolineEP20TR_OpaqueMethodBlockPvS3_S3_b+0xe4 (0x00007FFF9F37E824 [libj9jit29.so+0xb9e824])
         [ERR] _ZN3OMR16CodeCacheManager17replaceTrampolineEP20TR_OpaqueMethodBlockPvS3_S3_S3_b+0x94 (0x00007FFF9F381044 [libj9jit29.so+0xba1044])
         [ERR] mcc_replaceTrampoline+0x58 (0x00007FFF9ED65B98 [libj9jit29.so+0x585b98])
         [ERR] _Z15ppcCodePatchingPvS_S_S_S_S_+0x310 (0x00007FFF9EDBB090 [libj9jit29.so+0x5db090])
         [ERR] _ZN3OMR9CodeCache14patchCallPointEP20TR_OpaqueMethodBlockPvS3_S3_+0x100 (0x00007FFF9F37DB50 [libj9jit29.so+0xb9db50])
         [ERR] mcc_callPointPatching_unwrapper+0x50 (0x00007FFF9ED659F0 [libj9jit29.so+0x5859f0])
         [ERR] old_slow_jitCallCFunction+0x54 (0x00007FFF9F435644 [libj9jit29.so+0xc55644])
         [ERR]  (0x00007FFF9F46CF90 [libj9jit29.so+0xc8cf90])
         [ERR] sidecarInvokeReflectMethodImpl+0x508 (0x00007FFF9FD2BD98 [libj9vm29.so+0x1bd98])
         [ERR] sidecarInvokeReflectMethod+0x3c (0x00007FFF9FD2CCEC [libj9vm29.so+0x1ccec])
         [ERR] JVM_InvokeMethod_Impl+0xa8 (0x00007FFF9E17B7B8 [libjclse29.so+0x8b7b8])
         [ERR] JVM_InvokeMethod+0x3c (0x00007FFFA46664BC [libjvm.so+0x264bc])
         [ERR] JVM_InvokeMethod+0x30 (0x00007FFFA47473B0 [libjvm.so+0x73b0])
         [ERR] Java_jdk_internal_reflect_NativeMethodAccessorImpl_invoke0+0x24 (0x00007FFF9E02DC84 [libjava.so+0xdc84])
         [ERR]  (0x00007FFF9FF6C9B8 [libj9vm29.so+0x25c9b8])
         [ERR] ffi_call_int+0xd4 (0x00007FFF9FF6C684 [libj9vm29.so+0x25c684])
         [ERR] debugBytecodeLoopCompressed+0x16f2c (0x00007FFF9FE367EC [libj9vm29.so+0x1267ec])
         [ERR]  (0x00007FFF9FED6118 [libj9vm29.so+0x1c6118])
         [ERR] runCallInMethod+0x258 (0x00007FFF9FD2B528 [libj9vm29.so+0x1b528])
         [ERR] gpProtectedRunCallInMethod+0x54 (0x00007FFF9FD55C04 [libj9vm29.so+0x45c04])
         [ERR] signalProtectAndRunGlue+0x28 (0x00007FFF9FEEA248 [libj9vm29.so+0x1da248])
         [ERR] omrsig_protect+0x3e4 (0x00007FFFA41CAC74 [libj9prt29.so+0x3ac74])
         [ERR] gpProtectAndRun+0xa8 (0x00007FFF9FEEA318 [libj9vm29.so+0x1da318])
         [ERR] gpCheckCallin+0xc4 (0x00007FFF9FD582C4 [libj9vm29.so+0x482c4])
         [ERR] callStaticVoidMethod+0x48 (0x00007FFF9FD55218 [libj9vm29.so+0x45218])
         [ERR] JavaMain+0x1210 (0x00007FFFA4A27E60 [libjli.so+0x7e60])
         [ERR] ThreadJavaMain+0x18 (0x00007FFFA4A2E0F8 [libjli.so+0xe0f8])
         [ERR] start_thread+0xf8 (0x00007FFFA49D9718 [libpthread.so.0+0x9718])
         [ERR] clone+0x74 (0x00007FFFA48BB798 [libc.so.6+0x13b798])
cjjdespres commented 1 week ago

@pshipton @tajila The earliest I've been able to find these failures is the nightly test run of June 27, which tested openj9 commits from June 26 and earlier, so I think this may be related to https://github.com/eclipse-openj9/openj9/pull/19710 and maybe https://github.com/eclipse-openj9/openj9/issues/19778. I'm not sure if the test options need to be adjusted as in https://github.com/eclipse-openj9/openj9/pull/19792 or if it's something else.

cjjdespres commented 1 week ago

The version info for the early failure in https://github.com/eclipse-openj9/openj9/issues/19813#issuecomment-2209462326 is:

# DETECTED_JAVA_VERSION=openjdk version "11.0.24-internal" 2024-07-16
# OpenJDK Runtime Environment (build 11.0.24-internal+0-adhoc.jenkins.BuildJDK11ppc64lelinuxjitPersonal)
# Eclipse OpenJ9 VM (build master-979ce920c7c, JRE 11 Linux ppc64le-64-Bit Compressed References 20240627_1784 (JIT enabled, AOT enabled)
# OpenJ9   - 979ce920c7c
# OMR      - 47a9d248db0
# JCL      - 35dab526d05 based on jdk-11.0.24+6)
tajila commented 1 week ago

@pshipton @tajila The earliest I've been able to find these failures is the nightly test run of June 27, which tested openj9 commits from June 26 and earlier, so I think this may be related to https://github.com/eclipse-openj9/openj9/pull/19710 and maybe https://github.com/eclipse-openj9/openj9/issues/19778. I'm not sure if the test options need to be adjusted as in https://github.com/eclipse-openj9/openj9/pull/19792 or if it's something else.

Ill take a look

tajila commented 1 week ago

Grinder with option added, https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/41855/ - passed

If it passes that will confirm its related to the HCR changes.

However, the crash is unexpected regardless of options specified so that will need investigation.

pshipton commented 1 week ago

Also seen in non-jitserver builds as a timeout. https://hyc-runtimes-jenkins.swg-devops.com/job/Test_openjdk11_j9_extended.functional_ppc64le_linux_testList_1/709/ https://hyc-runtimes-jenkins.swg-devops.com/job/Test_openjdk17_j9_extended.functional_ppc64le_linux_testList_0/626/

pshipton commented 1 week ago

Created https://github.com/eclipse-openj9/openj9/pull/19819 to add -XX:+EnableExtendedHCR and https://github.com/eclipse-openj9/openj9/pull/19820 for 0.46

tajila commented 1 week ago

@cjjdespres I still think the crash needs to be investigated. The extended HCR options may have just exposed an issue. Disabling extended HCR prevents redefinition from changing the class shape, any attempt to do so should result in a JVMTI error. The fact that it is crashing means an assumption was broken.

I looked at the stacktrace and I couldn't find anything suspicious, but thats probably because Im not very familiar with that code. Do you know why it is crashing? I think we still need to answer this question.

cjjdespres commented 1 week ago

I agree that this should be investigated. I think this option influences jvmtiRedefineClasses() and related functions, right? I'm also not at all familiar with that code, or with power.

tajila commented 1 week ago

@hzongaro @zl-wang Do you know who can investigate this?

zl-wang commented 1 week ago

@tajila could you briefly describe the difference between previous HCR and extended HCR? in particular, method recompilation and its implications. from the stack back-trace, it crashed trying to replace a trampoline (could be many different reasons). presumably, this is across platform, but it happened to be on ppc64le due to certain recompilations by-chance.

zl-wang commented 1 week ago

@IBMJimmyk i will ask jimmy to take a further look.

tajila commented 1 week ago

@tajila could you briefly describe the difference between previous HCR and extended HCR? in particular, method recompilation and its implications. from the stack back-trace, it crashed trying to replace a trampoline (could be many different reasons). presumably, this is across platform, but it happened to be on ppc64le due to certain recompilations by-chance.

The basic difference is that exended HCR allows one to change the class shape (ie. add/remove fields and methods). This means we dont redefine the J9Class in place, as we do with fast HCR. We create a new J9Class, then we need to do a heap walk to replace all object headers that are affected, update class heirarchies, array component type, etc. This also means that things like the vtable/jitvtable will have a new address, j9methods will also have a new address, etc.

tajila commented 1 week ago

WRT "method recompilation and its implications." I'll let @gacholio add anything I may have missed.

zl-wang commented 1 week ago

i.e. as part of the HCR hooked actions, every such-typed existing object is re-constructed?

tajila commented 1 week ago

i.e. as part of the HCR hooked actions, every such-typed existing object is re-constructed?

Yes, the J9Class in the header of affected object intances are updated to the new J9Class in the extended HCR case.

zl-wang commented 1 week ago

Yes, the J9Class in the header of affected object intances are updated to the new J9Class in the extended HCR case.

I meant more than that. since fields can be added or removed, every such object needs to be reconstructed literally. size can be different. it sounds like a newObject() being required. without new size, is the old-object allowed to continue running?

tajila commented 1 week ago

I meant more than that. since fields can be added or removed, every such object needs to be reconstructed literally. size can be different. it sounds like a newObject() being required. without new size, is the old-object allowed to continue running?

Sorry, you can add/remove static fields, but not instance fields. So only the instance header needs to be updated.