eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.28k stars 721 forks source link

OpenJDK java/foreign/LibraryLookupTest crash vmState=0x0000000 #17524

Closed pshipton closed 1 year ago

pshipton commented 1 year ago

Found in the grinder for https://github.com/eclipse-openj9/openj9/issues/17506 Internal grinder (svlxtor3 - zlinux) jdk_foreign_0 java/foreign/LibraryLookupTest.java

(Internal artifacts](https://na.artifactory.swg-devops.com/artifactory/sys-rt-generic-local/hyc-runtimes-jenkins.swg-devops.com/Grinder_iteration_1/2391/openjdk_test_output.tar.gz)

12:45:33  STDOUT:
12:45:33  test LibraryLookupTest.testBadLibraryLookupName(): success
12:45:33  test LibraryLookupTest.testBadLibraryLookupPath(): success
12:45:33  test LibraryLookupTest.testLoadLibraryConfined(): success
12:45:33  test LibraryLookupTest.testLoadLibraryConfinedClosed(): success
12:45:33  test LibraryLookupTest.testLoadLibraryShared(): success
12:45:33  STDERR:
12:45:33  Unhandled exception
12:45:33  Type=Segmentation error vmState=0x00000000
12:45:33  J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
12:45:33  Handler1=000003FF84E49740 Handler2=000003FF84D319D8 InaccessibleAddress=000003FF5DE00000
12:45:33  gpr0=000003FF5E9D5C80 gpr1=000003FE00000000 gpr2=0000000000000001 gpr3=000003FF64D24268
12:45:33  gpr4=0000000000000300 gpr5=00000000000003FF gpr6=0000000001BF3800 gpr7=000003FF5DE006B0
12:45:33  gpr8=0000000001BF3920 gpr9=0000000000000000 gpr10=000003FF5E9D7900 gpr11=000003FF5E9D5C50
12:45:33  gpr12=0000000001B15E00 gpr13=000003FF8509569C gpr14=000003FF8509569C gpr15=000003FF5E9D5BB0
12:45:33  psw=000003FF5DE006B0 mask=0705000180000000 fpc=00080000 bea=000003FF8509569A
12:45:33  fpr0 000003ff80035f90 (f: 2147704832.000000, d: 2.171863e-311)
12:45:33  fpr1 489679005e9d5820 (f: 1587369984.000000, d: 4.894112e+41)
12:45:33  fpr2 000003ff84c97fa0 (f: 2227798016.000000, d: 2.171902e-311)
12:45:33  fpr3 3f336e21dbeabbfe (f: 3689593856.000000, d: 2.964814e-04)
12:45:33  fpr4 000003fef4186b10 (f: 4095240960.000000, d: 2.170703e-311)
12:45:33  fpr5 0000000000000000 (f: 0.000000, d: 0.000000e+00)
12:45:33  fpr6 0000000000000001 (f: 1.000000, d: 4.940656e-324)
12:45:33  fpr7 0000000000000000 (f: 0.000000, d: 0.000000e+00)
12:45:33  fpr8 0000000000041000 (f: 266240.000000, d: 1.315400e-318)
12:45:33  fpr9 0000000000000000 (f: 0.000000, d: 0.000000e+00)
12:45:33  fpr10 000003ff5e997000 (f: 1587113984.000000, d: 2.171586e-311)
12:45:33  fpr11 0000000000000000 (f: 0.000000, d: 0.000000e+00)
12:45:33  fpr12 0005fd142a2d7cbe (f: 707624128.000000, d: 8.328158e-309)
12:45:33  fpr13 000003fee8004058 (f: 3892330496.000000, d: 2.170603e-311)
12:45:33  fpr14 000003ff902c65e0 (f: 2418828800.000000, d: 2.171997e-311)
12:45:33  fpr15 000003feec031488 (f: 3959624960.000000, d: 2.170636e-311)
12:45:33  Target=2_90_20230531_38 (Linux 5.4.0-125-generic)
12:45:33  CPU=s390x (8 logical CPUs) (0x3ddae5000 RAM)
12:45:33  ----------- Stack Backtrace -----------
12:45:33  protectedBacktrace+0x22 (0x000003FF84D2DA8A [libj9prt29.so+0x2da8a])
12:45:33  omrsig_protect+0x3d8 (0x000003FF84D32B28 [libj9prt29.so+0x32b28])
12:45:33  omrintrospect_backtrace_thread_raw+0xbc (0x000003FF84D2DFF4 [libj9prt29.so+0x2dff4])
12:45:33  omrsig_protect+0x3d8 (0x000003FF84D32B28 [libj9prt29.so+0x32b28])
12:45:33  omrintrospect_backtrace_thread+0x84 (0x000003FF84D2D904 [libj9prt29.so+0x2d904])
12:45:33  generateDiagnosticFiles+0x7e (0x000003FF84E4963E [libj9vm29.so+0x4963e])
12:45:33  omrsig_protect+0x3d8 (0x000003FF84D32B28 [libj9prt29.so+0x32b28])
12:45:33  structuredSignalHandler+0x174 (0x000003FF84E498B4 [libj9vm29.so+0x498b4])
12:45:33  mainSynchSignalHandler+0x21e (0x000003FF84D31BF6 [libj9prt29.so+0x31bf6])
12:45:33   (0x000003FF5E9D5728 [<unknown>+0x0])
12:45:33  ---------------------------------------

From the javacore

1XMCURTHDINFO  Current thread
3XMTHREADINFO      "pool-2-thread-5" J9VMThread:0x0000000001BF3800, omrthread_t:0x000003FEF4186B10, java/lang/Thread:0x000000002DE5B070, state:R, prio=5
3XMJAVALTHREAD            (java/lang/Thread getId:0x28, isDaemon:false)
3XMJAVALTHRCCL            jdk/internal/loader/ClassLoaders$AppClassLoader(0x000000000E0A1E30)
3XMTHREADINFO1            (native thread ID:0x13B40E, native priority:0x5, native policy:UNKNOWN, vmstate:R, vm thread flags:0x00000000)
3XMTHREADINFO2            (native stack address range from:0x000003FF5E998000, to:0x000003FF5E9D8000, size:0x40000)
3XMCPUTIME               CPU usage total: 0.023794604 secs, current category="Application"
3XMHEAPALLOC             Heap bytes allocated since last GC cycle=4096 (0x1000)
3XMTHREADINFO3           Java callstack:
4XESTACKTRACE                at openj9/internal/foreign/abi/InternalDowncallHandler.invokeNative(Native Method)
4XESTACKTRACE                at openj9/internal/foreign/abi/InternalDowncallHandler.runNativeMethod(InternalDowncallHandler.java:566)
4XESTACKTRACE                at java/lang/invoke/LambdaForm$DMH/0x00000000f40ecfa0.invokeSpecial(LambdaForm$DMH)
4XESTACKTRACE                at java/lang/invoke/LambdaForm$MH/0x00000000d0004e30.invoke(LambdaForm$MH)
4XESTACKTRACE                at java/lang/invoke/LambdaForm$MH/0x00000000f416bdd0.invokeExact_MT(LambdaForm$MH)
4XESTACKTRACE                at LibraryLookupTest.callFunc(LibraryLookupTest.java:90)
4XESTACKTRACE                at LibraryLookupTest$LibraryAccess.run(LibraryLookupTest.java:166)
4XESTACKTRACE                at java/util/concurrent/ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144(Compiled Code))
4XESTACKTRACE                at java/util/concurrent/ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
4XESTACKTRACE                at java/lang/Thread.runWith(Thread.java:1652)
4XESTACKTRACE                at java/lang/Thread.run(Thread.java:1639)
pshipton commented 1 year ago

Another failure in internal grinder

pshipton commented 1 year ago

@ChengJin01 pls take a look.

ChengJin01 commented 1 year ago

[1] I tried with internal Grinders x50 but nothing wrong occurred so far (also manually ran the failing test suite on zLinux many times which still ended up with nothing abnormal).

[2] The failing test case is testLoadLibrarySharedClosed (all other 5 test cases passed except this test as mentioned in the description) at https://github.com/ibmruntimes/openj9-openjdk-jdk20/blob/openj9/test/jdk/java/foreign/LibraryLookupTest.java

    @Test
    void testLoadLibrarySharedClosed() throws Throwable {
        Arena arena = Arena.openShared();
        MemorySegment addr = loadLibrary(arena.scope());
        ExecutorService accessExecutor = Executors.newCachedThreadPool();
        for (int i = 0; i < NUM_ACCESSORS ; i++) {
            accessExecutor.execute(new LibraryAccess(addr));
        }
        while (true) {
            try {
                arena.close(); <------- the expected exception was triggered by intentionally closing the scope for the downcall address. 
                break;
            } catch (IllegalStateException ex) {
                // wait for addressable parameter to be released
                Thread.onSpinWait();
            }
        }
        accessExecutor.shutdown();
        assertTrue(accessExecutor.awaitTermination(MAX_EXECUTOR_WAIT_SECONDS, TimeUnit.SECONDS));
    }

which is intended to validate the scope of the downcall address by intentionally closing the arena via arena.close() after launching the threads with the downcall address passed in. So the crash occurred when doing the native downcall at https://github.com/eclipse-openj9/openj9/blob/3f742ecbdb57da05f83956ea0dcc249901264a85/jcl/src/java.base/share/classes/openj9/internal/foreign/abi/InternalDowncallHandler.java#L753 or https://github.com/eclipse-openj9/openj9/blob/3f742ecbdb57da05f83956ea0dcc249901264a85/jcl/src/java.base/share/classes/openj9/internal/foreign/abi/InternalDowncallHandler.java#L758 with an invalid downcallAddr.address() as the related library was closed by the arena.

[3] After double-checking the previous fix for the scope check on the downcall address at https://github.com/eclipse-openj9/openj9/pull/17074, I think the only reason for the crash is that the check at at https://github.com/eclipse-openj9/openj9/pull/17074 was way earlier than the actual invocation in invokeNative(), in which case a timing issue occurred between the scope check and invokeNative(). That being said, the crash was triggered by the invalid downcall address which became obsolete between the scope check and invokeNative(). So the simple fix is to directly check the scope of the downcall address when calling invokeNative() to avoid any potential timing issue.

I will create the fix to see how it goes with the failing test suites.

pshipton commented 1 year ago

Reopen until the fix is delivered to 0.39 and 0.40.

pshipton commented 1 year ago

This doesn't appear fixed. There is a test failure and also a crash. https://openj9-jenkins.osuosl.org/job/Test_openjdk20_j9_sanity.openjdk_s390x_linux_OMR/20 jdk_foreign_1 java/foreign/LibraryLookupTest.java

https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk20_j9_sanity.openjdk_s390x_linux_OMR/20/openjdk_test_output.tar.gz

07:09:17  test LibraryLookupTest.testLoadLibrarySharedClosed(): failure
07:09:17  java.lang.AssertionError: expected [true] but found [false]
07:09:17    at org.testng.Assert.fail(Assert.java:99)
07:09:17    at org.testng.Assert.failNotEquals(Assert.java:1037)
07:09:17    at org.testng.Assert.assertTrue(Assert.java:45)
07:09:17    at org.testng.Assert.assertTrue(Assert.java:55)
07:09:17    at LibraryLookupTest.testLoadLibrarySharedClosed(LibraryLookupTest.java:151)
07:09:17  STDERR:
07:09:17  Unhandled exception
07:09:17  Type=Segmentation error vmState=0x00000000
07:09:17  J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
07:09:17  Handler1=000003FFA4D49710 Handler2=000003FFA4C31998 InaccessibleAddress=000003FF5DF00000
07:09:17  gpr0=000003FF9EA7DC50 gpr1=000003FF00000000 gpr2=000003FF0431AD88 gpr3=000003FF9E64D588
07:09:17  gpr4=000003FFA0245408 gpr5=000003FFA4B87C88 gpr6=000003FF9EA7DBC8 gpr7=000003FF5DF00718
07:09:17  gpr8=000003FF04331520 gpr9=0000000000000000 gpr10=000003FF9EA7F900 gpr11=000003FF9EA7DC20
07:09:17  gpr12=000003FF9E64D588 gpr13=000003FFA4F95784 gpr14=000003FFA4F95784 gpr15=000003FF9EA7DB80
07:09:17  psw=000003FF5DF00718 mask=0705000180000000 fpc=00080000 bea=000003FFA4F95782
07:09:17  fpr0 000003ffa019a600 (f: 2686035456.000000, d: 2.172129e-311)
07:09:17  fpr1 48d085809ea7d3b8 (f: 2661798912.000000, d: 5.756900e+42)
07:09:17  fpr2 000003ff04331400 (f: 70456320.000000, d: 2.170837e-311)
07:09:17  fpr3 4005555555555555 (f: 1431655808.000000, d: 2.666667e+00)
07:09:17  fpr4 000003ff74a96430 (f: 1957258240.000000, d: 2.171769e-311)
07:09:17  fpr5 3ff0000000000000 (f: 0.000000, d: 1.000000e+00)
07:09:17  fpr6 000003ff9e64d588 (f: 2657408512.000000, d: 2.172115e-311)
07:09:17  fpr7 0000000000000000 (f: 0.000000, d: 0.000000e+00)
07:09:17  fpr8 0000000000041000 (f: 266240.000000, d: 1.315400e-318)
07:09:17  fpr9 0000000000000000 (f: 0.000000, d: 0.000000e+00)
07:09:17  fpr10 000003ff9ea3f000 (f: 2661543936.000000, d: 2.172117e-311)
07:09:17  fpr11 0000000000000000 (f: 0.000000, d: 0.000000e+00)
07:09:17  fpr12 0005fd8829477bcc (f: 692550592.000000, d: 8.330619e-309)
07:09:17  fpr13 000003ff2001b4e8 (f: 536982784.000000, d: 2.171067e-311)
07:09:17  fpr14 000003ffaa3c65e0 (f: 2856084992.000000, d: 2.172213e-311)
07:09:17  fpr15 000003ff20063248 (f: 537276992.000000, d: 2.171067e-311)
07:09:17  Target=2_90_20230607_34 (Linux 5.4.0-137-generic)
07:09:17  CPU=s390x (4 logical CPUs) (0x1f58f7000 RAM)
07:09:17  ----------- Stack Backtrace -----------
07:09:17  protectedBacktrace+0x22 (0x000003FFA4C2DA4A [libj9prt29.so+0x2da4a])
07:09:17  omrsig_protect+0x3d8 (0x000003FFA4C32AE8 [libj9prt29.so+0x32ae8])
07:09:17  omrintrospect_backtrace_thread_raw+0xbc (0x000003FFA4C2DFB4 [libj9prt29.so+0x2dfb4])
07:09:17  omrsig_protect+0x3d8 (0x000003FFA4C32AE8 [libj9prt29.so+0x32ae8])
07:09:17  omrintrospect_backtrace_thread+0x84 (0x000003FFA4C2D8C4 [libj9prt29.so+0x2d8c4])
07:09:17  generateDiagnosticFiles+0x7e (0x000003FFA4D4960E [libj9vm29.so+0x4960e])
07:09:17  omrsig_protect+0x3d8 (0x000003FFA4C32AE8 [libj9prt29.so+0x32ae8])
07:09:17  structuredSignalHandler+0x174 (0x000003FFA4D49884 [libj9vm29.so+0x49884])
07:09:17  mainSynchSignalHandler+0x21e (0x000003FFA4C31BB6 [libj9prt29.so+0x31bb6])
07:09:17   (0x000003FF9EA7D6F8 [<unknown>+0x0])
07:09:17  ---------------------------------------
ChengJin01 commented 1 year ago

I think the fix at https://github.com/eclipse-openj9/openj9/pull/17528 is still insufficient to capture the invalid downcall address especially in terms of the multi-threading environment in the failing test case.

As explained at https://github.com/eclipse-openj9/openj9/issues/17062#issuecomment-1488953229, the reason why the problem never occurred in JDK19 is that the code there at https://github.com/ibmruntimes/openj9-openjdk-jdk19/blob/6e33eff3d0105a864028ec3fd0973267842d1ffb/src/java.base/share/classes/jdk/internal/foreign/NativeMemorySegmentImpl.java#L74

    public MemoryAddress address() {
        checkValidState(); <------------- (annotated with @ForceInline in OpenJDK)
        return MemoryAddress.ofLong(unsafeGetOffset());
    }

to effectively prevent the invalid address from being used after it becomes obsolete, which was refactored & removed in JDK20.

So the only workaround is to modify the code in OpenJDK to restore checkValidState in address() to support our FFI implementation in OpenJ9.

ChengJin01 commented 1 year ago

Fundamentally speak, state should be thread-safe in checkValidStateRaw() with @ForceInline while it is not the case in isAlive() which is used in https://github.com/eclipse-openj9/openj9/pull/17528.

    public void checkValidState() {
        try {
            checkValidStateRaw();
        } catch (ScopedMemoryAccess.ScopedAccessError error) {
            throw error.newRuntimeException();
        }
    }

    @ForceInline
    public void checkValidStateRaw() {
        if (owner != null && owner != Thread.currentThread()) {
            throw WRONG_THREAD;
        }
        if (state < OPEN) { <----------------------
            throw ALREADY_CLOSED;
        }
    }

    public boolean isAlive() {
        return state >= OPEN; <-------------------
    }
ChengJin01 commented 1 year ago

@pshipton,

I am wondering whether it is feasible to add @ForceInline to validateMemScope() in OpenJ9 at https://github.com/eclipse-openj9/openj9/blob/50019a8a232edb65c3ad040f75c06e3ff367122f/jcl/src/java.base/share/classes/openj9/internal/foreign/abi/InternalDowncallHandler.java#L245 to force JIT to optimize the code in there to ensure that the state can be accessed safely ? otherwise, we would have to add code in OpenJDK to deal with the situation as explained at https://github.com/eclipse-openj9/openj9/issues/17524#issuecomment-1581403379.

pshipton commented 1 year ago

The JIT does recognize the annotation and may do something. You'd better check with the JIT team.

I don't understand why @ForceInline makes a difference, it should just change the timing.

ChengJin01 commented 1 year ago

I don't understand why @ForceInline makes a difference, it should just change the timing.

It seems that he live check on state in isAlive() viavalidateMemScope() is not guaranteed to be thread-safe in JDK20, in which case the memory of the downcall address gets released intentionally by the test thread when the address is being used by other threads created from the thread pool. That being said, the timing issue only occurs after isAlive returns true in the mean time the downcall address is being released by the test thread, in which case the downcall address becomes obsolete in the native. So we need a way to ensure that the state in OpenJDK can be accessed correctly in the multi-threading environment.

pshipton commented 1 year ago

Inlining it doesn't make variables thread safe, it just changes the timing, which may make a problem less likely to occur. To be thread safe you need to use volatile and synchronized. Or if you are talking about an object getting collected once it's out of scope, you can use java.lang.ref.Reference.reachabilityFence(Object) to keep a strong reference until you are done with the object.

ChengJin01 commented 1 year ago

Inlining it doesn't make variables thread safe, it just changes the timing, which may make a problem less likely to occur. To be thread safe you need to use volatile and synchronized.

The problem is that both state and isAlive are in OpenJDK which is out of our control from OpenJ9 perspective. That being said, the test thread doesn't rely on our method viavalidateMemScope in OpenJ9 to access state.

pshipton commented 1 year ago

I wouldn't say OpenJDK code is out of our control. We can modify it as necessary, but then we need to maintain any patches.

ChengJin01 commented 1 year ago

I wouldn't say OpenJDK code is out of our control. We can modify it as necessary, but then we need to maintain any patches.

To avoid any patches in OpenJDK, I am thinking to replace isAlive with checkValidState in OpenJ9 which is to simulate the case in JDK19 to see whether the timing issue can be fixed.

ChengJin01 commented 1 year ago

After double-checking the explanation of Arena.close() at https://download.java.net/java/early_access/jdk20/docs/api/java.base/java/lang/foreign/Arena.html#close() against the related code in OpenJDK, it should be reasonable to keep the downcall address alive by leveraging our code of SetDependency() in jcl/src/java.base/share/classes/openj9/internal/foreign/abi/InternalDowncallHandler.java to prevent the address from being released by Arena.close() till the downcall is done.

pshipton commented 1 year ago

https://openj9-jenkins.osuosl.org/job/Test_openjdk20_j9_sanity.openjdk_x86-64_linux_OMR/21/