Closed pshipton closed 1 year ago
Another failure in internal grinder
@ChengJin01 pls take a look.
[1] I tried with internal Grinders x50 but nothing wrong occurred so far (also manually ran the failing test suite on zLinux many times which still ended up with nothing abnormal).
[2] The failing test case is testLoadLibrarySharedClosed
(all other 5 test cases passed except this test as mentioned in the description) at https://github.com/ibmruntimes/openj9-openjdk-jdk20/blob/openj9/test/jdk/java/foreign/LibraryLookupTest.java
@Test
void testLoadLibrarySharedClosed() throws Throwable {
Arena arena = Arena.openShared();
MemorySegment addr = loadLibrary(arena.scope());
ExecutorService accessExecutor = Executors.newCachedThreadPool();
for (int i = 0; i < NUM_ACCESSORS ; i++) {
accessExecutor.execute(new LibraryAccess(addr));
}
while (true) {
try {
arena.close(); <------- the expected exception was triggered by intentionally closing the scope for the downcall address.
break;
} catch (IllegalStateException ex) {
// wait for addressable parameter to be released
Thread.onSpinWait();
}
}
accessExecutor.shutdown();
assertTrue(accessExecutor.awaitTermination(MAX_EXECUTOR_WAIT_SECONDS, TimeUnit.SECONDS));
}
which is intended to validate the scope of the downcall address by intentionally closing the arena via arena.close()
after launching the threads with the downcall address passed in. So the crash occurred when doing the native downcall at https://github.com/eclipse-openj9/openj9/blob/3f742ecbdb57da05f83956ea0dcc249901264a85/jcl/src/java.base/share/classes/openj9/internal/foreign/abi/InternalDowncallHandler.java#L753 or https://github.com/eclipse-openj9/openj9/blob/3f742ecbdb57da05f83956ea0dcc249901264a85/jcl/src/java.base/share/classes/openj9/internal/foreign/abi/InternalDowncallHandler.java#L758 with an invalid downcallAddr.address()
as the related library was closed by the arena.
[3] After double-checking the previous fix for the scope check on the downcall address at https://github.com/eclipse-openj9/openj9/pull/17074, I think the only reason for the crash is that the check at at https://github.com/eclipse-openj9/openj9/pull/17074 was way earlier than the actual invocation in invokeNative()
, in which case a timing issue occurred between the scope check and invokeNative()
. That being said, the crash was triggered by the invalid downcall address which became obsolete between the scope check and invokeNative()
. So the simple fix is to directly check the scope of the downcall address when calling invokeNative()
to avoid any potential timing issue.
I will create the fix to see how it goes with the failing test suites.
Reopen until the fix is delivered to 0.39 and 0.40.
This doesn't appear fixed. There is a test failure and also a crash. https://openj9-jenkins.osuosl.org/job/Test_openjdk20_j9_sanity.openjdk_s390x_linux_OMR/20 jdk_foreign_1 java/foreign/LibraryLookupTest.java
07:09:17 test LibraryLookupTest.testLoadLibrarySharedClosed(): failure
07:09:17 java.lang.AssertionError: expected [true] but found [false]
07:09:17 at org.testng.Assert.fail(Assert.java:99)
07:09:17 at org.testng.Assert.failNotEquals(Assert.java:1037)
07:09:17 at org.testng.Assert.assertTrue(Assert.java:45)
07:09:17 at org.testng.Assert.assertTrue(Assert.java:55)
07:09:17 at LibraryLookupTest.testLoadLibrarySharedClosed(LibraryLookupTest.java:151)
07:09:17 STDERR:
07:09:17 Unhandled exception
07:09:17 Type=Segmentation error vmState=0x00000000
07:09:17 J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
07:09:17 Handler1=000003FFA4D49710 Handler2=000003FFA4C31998 InaccessibleAddress=000003FF5DF00000
07:09:17 gpr0=000003FF9EA7DC50 gpr1=000003FF00000000 gpr2=000003FF0431AD88 gpr3=000003FF9E64D588
07:09:17 gpr4=000003FFA0245408 gpr5=000003FFA4B87C88 gpr6=000003FF9EA7DBC8 gpr7=000003FF5DF00718
07:09:17 gpr8=000003FF04331520 gpr9=0000000000000000 gpr10=000003FF9EA7F900 gpr11=000003FF9EA7DC20
07:09:17 gpr12=000003FF9E64D588 gpr13=000003FFA4F95784 gpr14=000003FFA4F95784 gpr15=000003FF9EA7DB80
07:09:17 psw=000003FF5DF00718 mask=0705000180000000 fpc=00080000 bea=000003FFA4F95782
07:09:17 fpr0 000003ffa019a600 (f: 2686035456.000000, d: 2.172129e-311)
07:09:17 fpr1 48d085809ea7d3b8 (f: 2661798912.000000, d: 5.756900e+42)
07:09:17 fpr2 000003ff04331400 (f: 70456320.000000, d: 2.170837e-311)
07:09:17 fpr3 4005555555555555 (f: 1431655808.000000, d: 2.666667e+00)
07:09:17 fpr4 000003ff74a96430 (f: 1957258240.000000, d: 2.171769e-311)
07:09:17 fpr5 3ff0000000000000 (f: 0.000000, d: 1.000000e+00)
07:09:17 fpr6 000003ff9e64d588 (f: 2657408512.000000, d: 2.172115e-311)
07:09:17 fpr7 0000000000000000 (f: 0.000000, d: 0.000000e+00)
07:09:17 fpr8 0000000000041000 (f: 266240.000000, d: 1.315400e-318)
07:09:17 fpr9 0000000000000000 (f: 0.000000, d: 0.000000e+00)
07:09:17 fpr10 000003ff9ea3f000 (f: 2661543936.000000, d: 2.172117e-311)
07:09:17 fpr11 0000000000000000 (f: 0.000000, d: 0.000000e+00)
07:09:17 fpr12 0005fd8829477bcc (f: 692550592.000000, d: 8.330619e-309)
07:09:17 fpr13 000003ff2001b4e8 (f: 536982784.000000, d: 2.171067e-311)
07:09:17 fpr14 000003ffaa3c65e0 (f: 2856084992.000000, d: 2.172213e-311)
07:09:17 fpr15 000003ff20063248 (f: 537276992.000000, d: 2.171067e-311)
07:09:17 Target=2_90_20230607_34 (Linux 5.4.0-137-generic)
07:09:17 CPU=s390x (4 logical CPUs) (0x1f58f7000 RAM)
07:09:17 ----------- Stack Backtrace -----------
07:09:17 protectedBacktrace+0x22 (0x000003FFA4C2DA4A [libj9prt29.so+0x2da4a])
07:09:17 omrsig_protect+0x3d8 (0x000003FFA4C32AE8 [libj9prt29.so+0x32ae8])
07:09:17 omrintrospect_backtrace_thread_raw+0xbc (0x000003FFA4C2DFB4 [libj9prt29.so+0x2dfb4])
07:09:17 omrsig_protect+0x3d8 (0x000003FFA4C32AE8 [libj9prt29.so+0x32ae8])
07:09:17 omrintrospect_backtrace_thread+0x84 (0x000003FFA4C2D8C4 [libj9prt29.so+0x2d8c4])
07:09:17 generateDiagnosticFiles+0x7e (0x000003FFA4D4960E [libj9vm29.so+0x4960e])
07:09:17 omrsig_protect+0x3d8 (0x000003FFA4C32AE8 [libj9prt29.so+0x32ae8])
07:09:17 structuredSignalHandler+0x174 (0x000003FFA4D49884 [libj9vm29.so+0x49884])
07:09:17 mainSynchSignalHandler+0x21e (0x000003FFA4C31BB6 [libj9prt29.so+0x31bb6])
07:09:17 (0x000003FF9EA7D6F8 [<unknown>+0x0])
07:09:17 ---------------------------------------
I think the fix at https://github.com/eclipse-openj9/openj9/pull/17528 is still insufficient to capture the invalid downcall address especially in terms of the multi-threading environment in the failing test case.
As explained at https://github.com/eclipse-openj9/openj9/issues/17062#issuecomment-1488953229, the reason why the problem never occurred in JDK19 is that the code there at https://github.com/ibmruntimes/openj9-openjdk-jdk19/blob/6e33eff3d0105a864028ec3fd0973267842d1ffb/src/java.base/share/classes/jdk/internal/foreign/NativeMemorySegmentImpl.java#L74
public MemoryAddress address() {
checkValidState(); <------------- (annotated with @ForceInline in OpenJDK)
return MemoryAddress.ofLong(unsafeGetOffset());
}
to effectively prevent the invalid address from being used after it becomes obsolete, which was refactored & removed in JDK20.
So the only workaround is to modify the code in OpenJDK to restore checkValidState
in address() to support our FFI implementation in OpenJ9.
Fundamentally speak, state
should be thread-safe in checkValidStateRaw()
with @ForceInline
while it is not the case in isAlive()
which is used in https://github.com/eclipse-openj9/openj9/pull/17528.
public void checkValidState() {
try {
checkValidStateRaw();
} catch (ScopedMemoryAccess.ScopedAccessError error) {
throw error.newRuntimeException();
}
}
@ForceInline
public void checkValidStateRaw() {
if (owner != null && owner != Thread.currentThread()) {
throw WRONG_THREAD;
}
if (state < OPEN) { <----------------------
throw ALREADY_CLOSED;
}
}
public boolean isAlive() {
return state >= OPEN; <-------------------
}
@pshipton,
I am wondering whether it is feasible to add @ForceInline
to validateMemScope()
in OpenJ9 at https://github.com/eclipse-openj9/openj9/blob/50019a8a232edb65c3ad040f75c06e3ff367122f/jcl/src/java.base/share/classes/openj9/internal/foreign/abi/InternalDowncallHandler.java#L245
to force JIT to optimize the code in there to ensure that the state
can be accessed safely ? otherwise, we would have to add code in OpenJDK to deal with the situation as explained at https://github.com/eclipse-openj9/openj9/issues/17524#issuecomment-1581403379.
The JIT does recognize the annotation and may do something. You'd better check with the JIT team.
I don't understand why @ForceInline
makes a difference, it should just change the timing.
I don't understand why @ForceInline makes a difference, it should just change the timing.
It seems that he live check on state
in isAlive()
viavalidateMemScope()
is not guaranteed to be thread-safe in JDK20, in which case the memory of the downcall address gets released intentionally by the test thread when the address is being used by other threads created from the thread pool. That being said, the timing issue only occurs after isAlive
returns true in the mean time the downcall address is being released by the test thread, in which case the downcall address becomes obsolete in the native. So we need a way to ensure that the state
in OpenJDK can be accessed correctly in the multi-threading environment.
Inlining it doesn't make variables thread safe, it just changes the timing, which may make a problem less likely to occur. To be thread safe you need to use volatile and synchronized. Or if you are talking about an object getting collected once it's out of scope, you can use java.lang.ref.Reference.reachabilityFence(Object) to keep a strong reference until you are done with the object.
Inlining it doesn't make variables thread safe, it just changes the timing, which may make a problem less likely to occur. To be thread safe you need to use volatile and synchronized.
The problem is that both state
and isAlive
are in OpenJDK which is out of our control from OpenJ9 perspective. That being said, the test thread doesn't rely on our method viavalidateMemScope
in OpenJ9 to access state
.
I wouldn't say OpenJDK code is out of our control. We can modify it as necessary, but then we need to maintain any patches.
I wouldn't say OpenJDK code is out of our control. We can modify it as necessary, but then we need to maintain any patches.
To avoid any patches in OpenJDK, I am thinking to replace isAlive
with checkValidState
in OpenJ9 which is to simulate the case in JDK19 to see whether the timing issue can be fixed.
After double-checking the explanation of Arena.close()
at https://download.java.net/java/early_access/jdk20/docs/api/java.base/java/lang/foreign/Arena.html#close() against the related code in OpenJDK, it should be reasonable to keep the downcall address alive by leveraging our code of SetDependency()
in jcl/src/java.base/share/classes/openj9/internal/foreign/abi/InternalDowncallHandler.java
to prevent the address from being released by Arena.close()
till the downcall is done.
Found in the grinder for https://github.com/eclipse-openj9/openj9/issues/17506 Internal grinder (svlxtor3 - zlinux) jdk_foreign_0 java/foreign/LibraryLookupTest.java
(Internal artifacts](https://na.artifactory.swg-devops.com/artifactory/sys-rt-generic-local/hyc-runtimes-jenkins.swg-devops.com/Grinder_iteration_1/2391/openjdk_test_output.tar.gz)
From the javacore