eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.28k stars 721 forks source link

Restore Intermittent failure (SEG fault) While Iterating Threads/Fetching Thread GC Env #17731

Closed RSalman closed 11 months ago

RSalman commented 1 year ago

Intermittent failure (SEG fault) found during development, which seems to have a larger underlying cause. The following code was added locally (invoked during JVM restore) to iterate over all the JVM threads and fetch their GC env:

    OMR_VMThread *walkThread = NULL;
    GC_OMRVMThreadListIterator threadListIterator(env->getOmrVM());
    while (NULL != (walkThread = threadListIterator.nextOMRVMThread())) {
        MM_EnvironmentBase *walkEnv = MM_EnvironmentBase::getEnvironment(walkThread);
        if (!walkEnv->reinitializeForRestore()) {
            return false;
        }
    }

This pattern of iterating threads is typical and found throughout the codebase. An iterated thread has always been assumed to have a valid GC Environment (i.e., NULL check has not been needed). However, during restore, this iteration may result in a crash as a thread may return a NULL GC Environment. It's seems like a thread is in a inconsistent state. I can't tell if there is an actual concern to be addressed, I'm not aware that this type of thread has caused issues elsewhere, outside of this GC Env iteration. There is a workaround to just do a NULL check, but this issue may be worth investigating as it may cause other issues later on.

This issue was initially discovered while running the CRIU sanity tests on the farm. Reproducing it locally has been strange, the script below is used. The 1000 itrs show no failure, however a failure shows up when the test is interrupted at random points (SIGINT/Ctrl+C). Need to do further investigation to reconcile how it is produced locally vs failure on farm.

#!/bin/bash
for i in {1..1000}
do
        java -Xgcthreads4 -XX:+EnableCRIUSupport  -Xgcpolicy:optthruput -cp /openj9//jvmtest/functional/cmdLineTests/criu/criu.jar org.openj9.criu.TestSingleThreadModeRestoreException 1 1

        criu restore -D ./cpData --shell-job >criuOutput
done

To reproduce, the code snippet above can be added to the end of MM_Configuration::reinitializeForRestore(MM_EnvironmentBase *env) and the test script can be run with modified VM.

[OUT] Unhandled exception
 [OUT] Type=Segmentation error vmState=0x00400004
 [OUT] J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
 [OUT] Handler1=00007F6F249525F0 Handler2=00007F6F25E66B50 InaccessibleAddress=0000000000000050
 [OUT] RDI=0000000000000040 RSI=0000000000000000 RAX=0000000000290338 RBX=00007F6F2024DDA8
 [OUT] RCX=0000000000000004 RDX=0000000000017A38 R8=0000000000000001 R9=00007F6F19AE354C
 [OUT] R10=0000000000000000 R11=0000000000000246 R12=00007F6F20046FE0 R13=00007F6F20074770
 [OUT] R14=0000000000186300 R15=0000000000000004
 [OUT] RIP=00007F6F24426DD6 GS=0000 FS=0000 RSP=00007F6EA4BF08B0
 [OUT] EFlags=0000000000010202 CS=0033 RBP=00007F6EA4BF08F0 ERR=0000000000000004
 [OUT] TRAPNO=000000000000000E OLDMASK=0000000000000000 CR2=0000000000000050
 [OUT] xmm0 0000000000000000 (f: 0.000000, d: 0.000000e+00)
 [OUT] xmm1 0a0a0a0a0a0a0a0a (f: 168430096.000000, d: 2.646194e-260)
 [OUT] xmm2 0000000000000000 (f: 0.000000, d: 0.000000e+00)
 [OUT] xmm3 0000000000000000 (f: 0.000000, d: 0.000000e+00)
 [OUT] xmm4 00007f6f2454ff88 (f: 609550208.000000, d: 6.922617e-310)
 [OUT] xmm5 00007f6f2454f020 (f: 609546240.000000, d: 6.922617e-310)
 [OUT] xmm6 00000000be546abb (f: 3193203456.000000, d: 1.577652e-314)
 [OUT] xmm7 0000003000000020 (f: 32.000000, d: 1.018558e-312)
 [OUT] xmm8 0000000000000000 (f: 0.000000, d: 0.000000e+00)
 [OUT] xmm9 0000000000000000 (f: 0.000000, d: 0.000000e+00)
 [OUT] xmm10 2020000000202020 (f: 2105376.000000, d: 5.966673e-154)
 [OUT] xmm11 0000ff0000000000 (f: 0.000000, d: 1.385239e-309)
 [OUT] xmm12 0000000000000000 (f: 0.000000, d: 0.000000e+00)
 [OUT] xmm13 0000000000000000 (f: 0.000000, d: 0.000000e+00)
 [OUT] xmm14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
 [OUT] xmm15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
 [OUT] Module=/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_Personal_testList_0/openjdkbinary/j2sdk-image/lib/default/libj9gc29.so
 [OUT] Module_base_address=00007F6F2428A000
 [OUT] Target=2_90_20230623_1182 (Linux 3.10.0-1160.90.1.el7.x86_64)
 [OUT] CPU=amd64 (4 logical CPUs) (0x1e8cbd000 RAM)
 [OUT] ----------- Stack Backtrace -----------
 [OUT] _ZN22MM_EnvironmentDelegate22reinitializeForRestoreEP18MM_EnvironmentBase+0x6 (0x00007F6F24426DD6 [libj9gc29.so+0x19cdd6])
 [OUT] _ZN16MM_Configuration22reinitializeForRestoreEP18MM_EnvironmentBase+0x2d2 (0x00007F6F243985B2 [libj9gc29.so+0x10e5b2])
 [OUT] _ZN24MM_ConfigurationStandard22reinitializeForRestoreEP18MM_EnvironmentBase+0x16 (0x00007F6F243D2046 [libj9gc29.so+0x148046])
 [OUT] j9gc_reinitialize_for_restore+0x7d (0x00007F6F242BA54D [libj9gc29.so+0x3054d])
 [OUT] Java_org_eclipse_openj9_criu_CRIUSupport_checkpointJVMImpl+0x1d2d (0x00007F6EA474D6CD [libj9criu29.so+0x46cd])
 [OUT] ffi_call_unix64+0x52 (0x00007F6F24B301AA [libj9vm29.so+0x2211aa])
 [OUT] ffi_call_int+0x1a2 (0x00007F6F24B2F342 [libj9vm29.so+0x220342])
 [OUT] _ZN36VM_CRIUBytecodeInterpreterCompressed3runEP10J9VMThread+0x12f55 (0x00007F6F24A87DA5 [libj9vm29.so+0x178da5])
 [OUT] criuBytecodeLoopCompressed+0xc9 (0x00007F6F24A74E39 [libj9vm29.so+0x165e39])
 [OUT]  (0x00007F6F24ABE4E2 [libj9vm29.so+0x1af4e2])

If an assertion is added to verify the env:

./run.sh: line 2: 3427552 Killed                  java -Xgcthreads4 -XX:+EnableCRIUSupport -Xgcpolicy:optthruput -cp /openj9//jvmtest/functional/cmdLineTests/criu/criu.jar org.openj9.criu.TestSingleThreadModeRestoreException 1 1
[walkEnv 0000000000000000] [walkThread 000000000025B520]
00:26:05.731 0x186100    j9mm.107    *   ** ASSERTION FAILED ** at /salman.rana/openj9-openjdk-jdk11/omr/gc/base/Configuration.cpp:562: ((false && (walkEnv != __null)))
JVMDUMP039I Processing dump event "traceassert", detail "" at 2023/06/29 17:26:05 - please wait.
JVMDUMP032I JVM requested System dump using '/debug/it/core.20230629.172605.3427552.0001.dmp' in response to an event
JVMDUMP010I System dump written to /debug/it/core.20230629.172605.3427552.0001.dmp
JVMDUMP032I JVM requested Java dump using '/debug/it/javacore.20230629.172605.3427552.0002.txt' in response to an event
JVMDUMP010I Java dump written to /debug/it/javacore.20230629.172605.3427552.0002.txt
JVMDUMP032I JVM requested Snap dump using '/debug/it/Snap.20230629.172605.3427552.0003.trc' in response to an event
JVMDUMP010I Snap dump written to /debug/it/Snap.20230629.172605.3427552.0003.trc
JVMDUMP013I Processed dump event "traceassert", detail "".
RSalman commented 1 year ago

@dmitripivkine @amicic @tajila fyi

amicic commented 1 year ago

Would be nice to know in what state (native stack trace, or at least line of code) of the thread that has NULL pointer is. But then I know gdb will give only stack trace of the asserting thread...

dmitripivkine commented 1 year ago

Would you please upload system core somewhere, `/team' for example ? Javacore and Snap traces might be useful too.

dmitripivkine commented 1 year ago

Would you please upload system core somewhere, `/team' for example ? Javacore and Snap traces might be useful too.

ok, please don't mind. I have uploaded original core from your machine.

RSalman commented 1 year ago

To reproduce, the code snippet above can be added to the end of MM_Configuration::reinitializeForRestore(MM_EnvironmentBase *env) and the test script can be run with modified VM.

Alternatively, patch can be taken from https://github.com/eclipse/omr/pull/7038

dmitripivkine commented 1 year ago

The problem is in allocateVMThread() the sequence is:

tajila commented 12 months ago

@singh264 Please take a look at this

  1. First step is to reproduce this issue, you can ask @kangyining for help
  2. Check if adding a System.gc() call before checkpointJVMImpl solves the issue, if not we will need to do something else
singh264 commented 11 months ago

@tajila I reproduced the issue, and adding System.gc() call before checkpointJVMImpl seems to solve the issue as I do not observe the segmentation error after running the test script for 1000 iterations.

tajila commented 11 months ago

Okay, please prepare a PR with that change