eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.27k stars 720 forks source link

jdk_foreign_0_FAILED - java/foreign/TestHandshake.java #13211

Open JasonFengJ9 opened 3 years ago

JasonFengJ9 commented 3 years ago

Failure link

https://openj9-jenkins.osuosl.org/job/Test_openjdk16_j9_sanity.openjdk_ppc64_aix_Nightly/55/consoleFull

22:09:40  openjdk version "16.0.1-internal" 2021-04-20
22:09:40  OpenJDK Runtime Environment (build 16.0.1-internal+0-adhoc.****.BuildJDK16ppc64aixNightly)
22:09:40  Eclipse OpenJ9 VM (build master-074746f7710, JRE 16 AIX ppc64-64-Bit Compressed References 20210718_55 (JIT enabled, AOT enabled)
22:09:40  OpenJ9   - 074746f7710
22:09:40  OMR      - 88427d6420f
22:09:40  JCL      - 5dd08a5cc3f based on jdk-16.0.1+9)

Rerun in Grinder

Optional info

Failure output (captured from console output)

06:14:39  ===============================================
06:14:39  java/foreign/TestHandshake.java
06:14:39  Total tests run: 6, Failures: 1, Skips: 0
06:14:39  ===============================================
06:14:39  

06:14:39  ...
06:14:39  Output overflow:
06:14:39  JT Harness has limited the test output to the text
06:14:39  at the beginning and the end, so that you can see how the
06:14:39  test began, and how it completed.
06:14:39  
06:14:39  If you need to see more of the output from the test,
06:14:39  set the system property javatest.maxOutputSize to a higher
06:14:39  value. The current value is 100000
06:14:39  ...

06:14:39  STDERR:
06:14:39  WARNING: Using incubator modules: jdk.incubator.foreign
06:14:39  java.lang.Exception: failures: 1
06:14:39    at com.sun.javatest.regtest.agent.TestNGRunner.main(TestNGRunner.java:96)
06:14:39    at com.sun.javatest.regtest.agent.TestNGRunner.main(TestNGRunner.java:54)
06:14:39    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
06:14:39    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:78)
06:14:39    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
06:14:39    at java.base/java.lang.reflect.Method.invoke(Method.java:567)
06:14:39    at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127)
06:14:39    at java.base/java.lang.Thread.run(Thread.java:883)
06:14:39  
06:14:39  JavaTest Message: Test threw exception: java.lang.Exception: failures: 1

06:14:39  rerun:
06:14:39  cd /home/jenkins/workspace/Test_openjdk16_j9_sanity.openjdk_ppc64_aix_Nightly/aqa-tests/TKG/output_16266607888323/jdk_foreign_0/work/scratch && \
06:14:39  DISPLAY=:0 \
06:14:39  HOME=/home/jenkins \
06:14:39  LANG=en_US \
06:14:39  PATH=/bin:/usr/bin:/usr/sbin \
06:14:39  TZ=EST5EDT \
06:14:39  CLASSPATH=/home/jenkins/workspace/Test_openjdk16_j9_sanity.openjdk_ppc64_aix_Nightly/aqa-tests/TKG/output_16266607888323/jdk_foreign_0/work/classes/java/foreign/TestHandshake.d:/home/jenkins/workspace/Test_openjdk16_j9_sanity.openjdk_ppc64_aix_Nightly/aqa-tests/openjdk/openjdk-jdk/test/jdk/java/foreign:/home/jenkins/workspace/Test_openjdk16_j9_sanity.openjdk_ppc64_aix_Nightly/jvmtest/openjdk/jtreg/lib/testng.jar:/home/jenkins/workspace/Test_openjdk16_j9_sanity.openjdk_ppc64_aix_Nightly/jvmtest/openjdk/jtreg/lib/jcommander.jar:/home/jenkins/workspace/Test_openjdk16_j9_sanity.openjdk_ppc64_aix_Nightly/jvmtest/openjdk/jtreg/lib/javatest.jar:/home/jenkins/workspace/Test_openjdk16_j9_sanity.openjdk_ppc64_aix_Nightly/jvmtest/openjdk/jtreg/lib/jtreg.jar \
06:14:39      /home/jenkins/workspace/Test_openjdk16_j9_sanity.openjdk_ppc64_aix_Nightly/openjdkbinary/j2sdk-image/bin/java \
06:14:39          -Dtest.vm.opts='-ea -esa -Xmx512m --add-modules jdk.incubator.foreign -XX:+UseCompressedOops' \
06:14:39          -Dtest.tool.vm.opts='-J-ea -J-esa -J-Xmx512m -J--add-modules -Jjdk.incubator.foreign -J-XX:+UseCompressedOops' \
06:14:39          -Dtest.compiler.opts= \
06:14:39          -Dtest.java.opts= \
06:14:39          -Dtest.jdk=/home/jenkins/workspace/Test_openjdk16_j9_sanity.openjdk_ppc64_aix_Nightly/openjdkbinary/j2sdk-image \
06:14:39          -Dcompile.jdk=/home/jenkins/workspace/Test_openjdk16_j9_sanity.openjdk_ppc64_aix_Nightly/openjdkbinary/j2sdk-image \
06:14:39          -Dtest.timeout.factor=8.0 \
06:14:39          -Dtest.nativepath=/home/jenkins/workspace/Test_openjdk16_j9_sanity.openjdk_ppc64_aix_Nightly/openjdkbinary/openjdk-test-image/jdk/jtreg/native \
06:14:39          -Dtest.root=/home/jenkins/workspace/Test_openjdk16_j9_sanity.openjdk_ppc64_aix_Nightly/aqa-tests/openjdk/openjdk-jdk/test/jdk \
06:14:39          -Dtest.name=java/foreign/TestHandshake.java \
06:14:39          -Dtest.file=/home/jenkins/workspace/Test_openjdk16_j9_sanity.openjdk_ppc64_aix_Nightly/aqa-tests/openjdk/openjdk-jdk/test/jdk/java/foreign/TestHandshake.java \
06:14:39          -Dtest.src=/home/jenkins/workspace/Test_openjdk16_j9_sanity.openjdk_ppc64_aix_Nightly/aqa-tests/openjdk/openjdk-jdk/test/jdk/java/foreign \
06:14:39          -Dtest.src.path=/home/jenkins/workspace/Test_openjdk16_j9_sanity.openjdk_ppc64_aix_Nightly/aqa-tests/openjdk/openjdk-jdk/test/jdk/java/foreign \
06:14:39          -Dtest.classes=/home/jenkins/workspace/Test_openjdk16_j9_sanity.openjdk_ppc64_aix_Nightly/aqa-tests/TKG/output_16266607888323/jdk_foreign_0/work/classes/java/foreign/TestHandshake.d \
06:14:39          -Dtest.class.path=/home/jenkins/workspace/Test_openjdk16_j9_sanity.openjdk_ppc64_aix_Nightly/aqa-tests/TKG/output_16266607888323/jdk_foreign_0/work/classes/java/foreign/TestHandshake.d \
06:14:39          -Dtest.class.path.prefix=/home/jenkins/workspace/Test_openjdk16_j9_sanity.openjdk_ppc64_aix_Nightly/aqa-tests/TKG/output_16266607888323/jdk_foreign_0/work/classes/java/foreign/TestHandshake.d:/home/jenkins/workspace/Test_openjdk16_j9_sanity.openjdk_ppc64_aix_Nightly/aqa-tests/openjdk/openjdk-jdk/test/jdk/java/foreign \
06:14:39          -Dtest.modules='jdk.incubator.foreign java.base/jdk.internal.vm.annotation java.base/jdk.internal.misc' \
06:14:39          --add-modules jdk.incubator.foreign,java.base \
06:14:39          --add-exports java.base/jdk.internal.vm.annotation=ALL-UNNAMED \
06:14:39          --add-exports java.base/jdk.internal.misc=ALL-UNNAMED \
06:14:39          -ea \
06:14:39          -esa \
06:14:39          -Xmx512m \
06:14:39          -XX:+UseCompressedOops \
06:14:39          -Djava.library.path=/home/jenkins/workspace/Test_openjdk16_j9_sanity.openjdk_ppc64_aix_Nightly/openjdkbinary/openjdk-test-image/jdk/jtreg/native \
06:14:39          -Xint \
06:14:39          com.sun.javatest.regtest.agent.MainWrapper /home/jenkins/workspace/Test_openjdk16_j9_sanity.openjdk_ppc64_aix_Nightly/aqa-tests/TKG/output_16266607888323/jdk_foreign_0/work/java/foreign/TestHandshake.d/testng.1.jta java/foreign/TestHandshake.java false TestHandshake
06:14:39  
06:14:39  TEST RESULT: Failed. Execution failed: `main' threw exception: java.lang.Exception: failures: 1
06:14:39  --------------------------------------------------
06:15:34  Test results: passed: 30; failed: 1
06:15:43  Report written to /home/jenkins/workspace/Test_openjdk16_j9_sanity.openjdk_ppc64_aix_Nightly/jvmtest/openjdk/report/html/report.html
06:15:43  Results written to /home/jenkins/workspace/Test_openjdk16_j9_sanity.openjdk_ppc64_aix_Nightly/aqa-tests/TKG/output_16266607888323/jdk_foreign_0/work
06:15:43  Error: Some tests failed or other problems occurred.
06:15:43  
06:15:43  jdk_foreign_0_FAILED
JasonFengJ9 commented 3 years ago

A slightly different error at JDK16 0.27 release build https://openj9-jenkins.osuosl.org/job/Test_openjdk16_j9_sanity.openjdk_x86-64_windows_Release/6/consoleFull

08:31:49  openjdk version "16.0.2-internal" 2021-07-21
08:31:49  OpenJDK Runtime Environment (build 16.0.2-internal+0-adhoc.****.buildjdk16x86-64windowsrelease)
08:31:49  Eclipse OpenJ9 VM (build openj9-0.27.0, JRE 16 Windows Server 2012 R2 amd64-64-Bit Compressed References 20210722_7 (JIT enabled, AOT enabled)
08:31:49  OpenJ9   - 1851b0074f8
08:31:49  OMR      - 9db1c87
08:31:49  JCL      - 50b0cc91164 based on jdk-16.0.2+3)

Rerun in Grinder

09:26:56  --------------------------------------------------
09:26:56  TEST: java/foreign/TestHandshake.java

09:26:56  STDERR:
09:26:56  WARNING: Using incubator modules: jdk.incubator.foreign
09:26:56  java.lang.NullPointerException
09:26:56    at java.base/jdk.internal.util.ArraysSupport.vectorizedMismatch(ArraysSupport.java)
09:26:56    at java.base/jdk.internal.misc.ScopedMemoryAccess.vectorizedMismatchInternal(ScopedMemoryAccess.java:236)
09:26:56    at java.base/jdk.internal.misc.ScopedMemoryAccess.vectorizedMismatch(ScopedMemoryAccess.java:217)
09:26:56    at jdk.incubator.foreign/jdk.internal.foreign.AbstractMemorySegmentImpl.vectorizedMismatchLargeForBytes(AbstractMemorySegmentImpl.java:203)
09:26:56    at jdk.incubator.foreign/jdk.internal.foreign.AbstractMemorySegmentImpl.mismatch(AbstractMemorySegmentImpl.java:166)
09:26:56    at TestHandshake$SegmentMismatchAccessor.doAccess(TestHandshake.java:189)
09:26:56    at TestHandshake$AbstractSegmentAccessor.run(TestHandshake.java:94)
09:26:56    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
09:26:56    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
09:26:56    at java.base/java.lang.Thread.run(Thread.java:883)
09:26:56  STATUS:Failed.`main' threw exception: java.lang.NullPointerException
JasonFengJ9 commented 3 years ago

Another one at JDK 16 0.27 release build https://openj9-jenkins.osuosl.org/job/Test_openjdk16_j9_sanity.openjdk_ppc64le_linux_Release/6/consoleFull

11:03:11  test TestHandshake.testHandshake("SegmentMismatchAccessor", TestHandshake$$Lambda$30/0x000000006ece8dc0@f523f37a): failure
11:03:11  java.lang.AssertionError: expected [true] but found [false]
11:03:11    at org.testng.Assert.fail(Assert.java:94)
11:03:11    at org.testng.Assert.failNotEquals(Assert.java:496)
11:03:11    at org.testng.Assert.assertTrue(Assert.java:42)
11:03:11    at org.testng.Assert.assertTrue(Assert.java:52)
11:03:11    at TestHandshake.testHandshake(TestHandshake.java:76)
11:03:11    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
11:03:11    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:78)
11:03:11    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
11:03:11    at java.base/java.lang.reflect.Method.invoke(Method.java:567)
11:03:11    at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:85)
11:03:11    at org.testng.internal.Invoker.invokeMethod(Invoker.java:639)
11:03:11    at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:821)
11:03:11    at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1131)
11:03:11    at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:125)
11:03:11    at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:108)
11:03:11    at org.testng.TestRunner.privateRun(TestRunner.java:773)
11:03:11    at org.testng.TestRunner.run(TestRunner.java:623)
11:03:11    at org.testng.SuiteRunner.runTest(SuiteRunner.java:357)
11:03:11    at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:352)
11:03:11    at org.testng.SuiteRunner.privateRun(SuiteRunner.java:310)
11:03:11    at org.testng.SuiteRunner.run(SuiteRunner.java:259)
11:03:11    at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
11:03:11    at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:86)
11:03:11    at org.testng.TestNG.runSuitesSequentially(TestNG.java:1185)
11:03:11    at org.testng.TestNG.runSuitesLocally(TestNG.java:1110)
11:03:11    at org.testng.TestNG.run(TestNG.java:1018)
11:03:11    at com.sun.javatest.regtest.agent.TestNGRunner.main(TestNGRunner.java:94)
11:03:11    at com.sun.javatest.regtest.agent.TestNGRunner.main(TestNGRunner.java:54)
11:03:11    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
11:03:11    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:78)
11:03:11    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
11:03:11    at java.base/java.lang.reflect.Method.invoke(Method.java:567)
11:03:11    at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127)
11:03:11    at java.base/java.lang.Thread.run(Thread.java:883)

11:03:11  ===============================================
11:03:11  java/foreign/TestHandshake.java
11:03:11  Total tests run: 6, Failures: 1, Skips: 0
11:03:11  ===============================================
11:03:11  
11:03:11  STDERR:
11:03:11  WARNING: Using incubator modules: jdk.incubator.foreign
11:03:11  java.lang.Exception: failures: 1
11:03:11    at com.sun.javatest.regtest.agent.TestNGRunner.main(TestNGRunner.java:96)
11:03:11    at com.sun.javatest.regtest.agent.TestNGRunner.main(TestNGRunner.java:54)
11:03:11    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
11:03:11    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:78)
11:03:11    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
11:03:11    at java.base/java.lang.reflect.Method.invoke(Method.java:567)
11:03:11    at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127)
11:03:11    at java.base/java.lang.Thread.run(Thread.java:883)
11:03:11  
11:03:11  JavaTest Message: Test threw exception: java.lang.Exception: failures: 1

11:04:20  jdk_foreign_0_FAILED
tajila commented 3 years ago

@EricYangIBM This may be related to https://github.com/eclipse-openj9/openj9/issues/13234

EricYangIBM commented 3 years ago

I believe the issue is that https://github.com/ibmruntimes/openj9-openjdk-jdk16/blob/openj9/test/jdk/java/foreign/TestHandshake.java#L242 keeps throwing and looping without ever successfully closing the segment because there is always another thread accessing that segment. I'm not sure why it only happens for the SegmentMismatchAccessor variation, I think it is because ScopedMemoryAccess.vectorizedMismatch takes longer and therefore gives fewer opportunities to close the segment.

Would having closeScope0 interrupt threads that are accessing the scope during close with an exception increase the success rate of close? I get the feeling that this is an issue with the test since it is a try until success - there is a probability that close will always happen during segment access.

Also, I am intermittently getting this failure locally. Do the release builds run the test over multiple iterations?

pshipton commented 3 years ago

Release builds don't do anything special, they just run the test.

tajila commented 3 years ago

Can you set MAX_EXECUTOR_WAIT_SECONDS to a longer timeout to see if that fixes the issue.

EricYangIBM commented 3 years ago

Can you set MAX_EXECUTOR_WAIT_SECONDS to a longer timeout to see if that fixes the issue.

Testing that right now, but I think it will. https://hyc-runtimes-jenkins.swg-devops.com/view/Test_grinder/job/Grinder/17169/console shows SegmentMismatchAccessor iterations that successfully close the segment (e.g. at 348 ms, 5049 ms for iteration 0 and 1 for the last grinder iteration). If the timeout is increased I think there will be more successful SegmentMismatchAccessor iterations.

EricYangIBM commented 3 years ago

I think the problem is that SegmentMismatchAccessor takes too long (it compares two 1 000 000 byte native memory segments) which prevents the main thread from obtaining the lock. In https://hyc-runtimes-jenkins.swg-devops.com/view/Test_grinder/job/Grinder/17198/consoleFull you can see that an accessor (thread) often holds the lock for 2000 ms. In the hotspot implementation closeScope interrupts accessor threads so the closer thread can immediately acquire the lock. Our implementation doesn't do this so we get the long hold times.

EricYangIBM commented 3 years ago

Update: Replacing https://github.com/ibmruntimes/openj9-openjdk-jdk16/blob/openj9/test/jdk/java/foreign/TestHandshake.java#L76 with

while (!accessExecutor.isTerminated()) {
    Thread.sleep(5000);
}

eventually results in the test passing (locally on x86-64 linux). I will open a PR to exclude this test and an issue to add interrupts to closeScope0.

JasonFengJ9 commented 3 years ago

The AssertionError is still observed in latest internal run Test_openjdk16_j9_sanity.openjdk_s390x_linux/72/

openjdk version "16.0.2" 2021-07-20
IBM Semeru Runtime Open Edition 16.0.2.0 (build 16.0.2+7-202108050408)
Eclipse OpenJ9 VM 16.0.2.0 (build master-9af2e37a0, JRE 16 Linux s390x-64-Bit Compressed References 20210804_68 (JIT enabled, AOT enabled)
OpenJ9   - 9af2e37a0
OMR      - 45440608c
JCL      - ee030f1f04 based on jdk-16.0.2+7)

Rerun in Grinder

[2021-08-05T07:38:25.508Z] Running test jdk_foreign_0 ...
[2021-08-05T07:38:25.508Z] ===============================================
[2021-08-05T07:38:25.508Z] jdk_foreign_0 Start Time: Thu Aug  5 00:38:25 2021 Epoch Time (ms): 1628149105425
[2021-08-05T07:38:25.508Z] "/home/jenkins/workspace/Test_openjdk16_j9_sanity.openjdk_s390x_linux/openjdkbinary/j2sdk-image/bin/java" -Xshareclasses:destroyAll; "/home/jenkins/workspace/Test_openjdk16_j9_sanity.openjdk_s390x_linux/openjdkbinary/j2sdk-image/bin/java" -Xshareclasses:groupAccess,destroyAll; echo "cache cleanup done";
[2021-08-05T07:38:26.308Z] JVMSHRC005I No shared class caches available
[2021-08-05T07:38:26.308Z] JVMSHRC005I No shared class caches available
[2021-08-05T07:38:26.308Z] cache cleanup done
[2021-08-05T07:38:26.308Z] variation: Mode150
[2021-08-05T07:38:26.308Z] JVM_OPTIONS:  -XX:+UseCompressedOops 

[2021-08-05T07:39:40.416Z] --------------------------------------------------
[2021-08-05T07:39:40.416Z] TEST: java/foreign/TestHandshake.java

[2021-08-05T07:39:40.429Z] Segment closed - delay (ms): 10392
[2021-08-05T07:39:40.429Z] test TestHandshake.testHandshake("SegmentMismatchAccessor", TestHandshake$$Lambda$30/0x0000000084e26d88@e3103462): failure
[2021-08-05T07:39:40.429Z] java.lang.AssertionError: expected [true] but found [false]
[2021-08-05T07:39:40.429Z]  at org.testng.Assert.fail(Assert.java:94)
[2021-08-05T07:39:40.429Z]  at org.testng.Assert.failNotEquals(Assert.java:496)
[2021-08-05T07:39:40.429Z]  at org.testng.Assert.assertTrue(Assert.java:42)
[2021-08-05T07:39:40.429Z]  at org.testng.Assert.assertTrue(Assert.java:52)
[2021-08-05T07:39:40.429Z]  at TestHandshake.testHandshake(TestHandshake.java:76)
[2021-08-05T07:39:40.429Z]  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[2021-08-05T07:39:40.429Z]  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:78)
[2021-08-05T07:39:40.429Z]  at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[2021-08-05T07:39:40.429Z]  at java.base/java.lang.reflect.Method.invoke(Method.java:567)
[2021-08-05T07:39:40.429Z]  at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:85)
[2021-08-05T07:39:40.429Z]  at org.testng.internal.Invoker.invokeMethod(Invoker.java:639)
[2021-08-05T07:39:40.429Z]  at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:821)
[2021-08-05T07:39:40.429Z]  at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1131)
[2021-08-05T07:39:40.429Z]  at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:125)
[2021-08-05T07:39:40.429Z]  at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:108)
[2021-08-05T07:39:40.429Z]  at org.testng.TestRunner.privateRun(TestRunner.java:773)
[2021-08-05T07:39:40.429Z]  at org.testng.TestRunner.run(TestRunner.java:623)
[2021-08-05T07:39:40.429Z]  at org.testng.SuiteRunner.runTest(SuiteRunner.java:357)
[2021-08-05T07:39:40.429Z]  at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:352)
[2021-08-05T07:39:40.429Z]  at org.testng.SuiteRunner.privateRun(SuiteRunner.java:310)
[2021-08-05T07:39:40.429Z]  at org.testng.SuiteRunner.run(SuiteRunner.java:259)
[2021-08-05T07:39:40.429Z]  at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
[2021-08-05T07:39:40.429Z]  at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:86)
[2021-08-05T07:39:40.429Z]  at org.testng.TestNG.runSuitesSequentially(TestNG.java:1185)
[2021-08-05T07:39:40.429Z]  at org.testng.TestNG.runSuitesLocally(TestNG.java:1110)
[2021-08-05T07:39:40.429Z]  at org.testng.TestNG.run(TestNG.java:1018)
[2021-08-05T07:39:40.429Z]  at com.sun.javatest.regtest.agent.TestNGRunner.main(TestNGRunner.java:94)
[2021-08-05T07:39:40.429Z]  at com.sun.javatest.regtest.agent.TestNGRunner.main(TestNGRunner.java:54)
[2021-08-05T07:39:40.429Z]  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[2021-08-05T07:39:40.429Z]  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:78)
[2021-08-05T07:39:40.429Z]  at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[2021-08-05T07:39:40.429Z]  at java.base/java.lang.reflect.Method.invoke(Method.java:567)
[2021-08-05T07:39:40.429Z]  at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127)
[2021-08-05T07:39:40.429Z]  at java.base/java.lang.Thread.run(Thread.java:883)
[2021-08-05T07:39:40.429Z] ITERATION 0

[2021-08-05T07:39:40.452Z] ===============================================
[2021-08-05T07:39:40.452Z] java/foreign/TestHandshake.java
[2021-08-05T07:39:40.452Z] Total tests run: 6, Failures: 1, Skips: 0
[2021-08-05T07:39:40.452Z] ===============================================
[2021-08-05T07:39:40.452Z] 
[2021-08-05T07:39:40.452Z] STDERR:
[2021-08-05T07:39:40.452Z] WARNING: Using incubator modules: jdk.incubator.foreign
[2021-08-05T07:39:40.452Z] java.lang.Exception: failures: 1
[2021-08-05T07:39:40.452Z]  at com.sun.javatest.regtest.agent.TestNGRunner.main(TestNGRunner.java:96)
[2021-08-05T07:39:40.452Z]  at com.sun.javatest.regtest.agent.TestNGRunner.main(TestNGRunner.java:54)
[2021-08-05T07:39:40.452Z]  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[2021-08-05T07:39:40.452Z]  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:78)
[2021-08-05T07:39:40.452Z]  at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[2021-08-05T07:39:40.452Z]  at java.base/java.lang.reflect.Method.invoke(Method.java:567)
[2021-08-05T07:39:40.452Z]  at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127)
[2021-08-05T07:39:40.452Z]  at java.base/java.lang.Thread.run(Thread.java:883)

[2021-08-05T07:39:40.452Z] TEST RESULT: Failed. Execution failed: `main' threw exception: java.lang.Exception: failures: 1
[2021-08-05T07:39:40.452Z] --------------------------------------------------
[2021-08-05T07:39:59.221Z] Test results: passed: 30; failed: 1
[2021-08-05T07:40:05.038Z] Report written to /home/jenkins/workspace/Test_openjdk16_j9_sanity.openjdk_s390x_linux/jvmtest/openjdk/report/html/report.html
[2021-08-05T07:40:05.038Z] Results written to /home/jenkins/workspace/Test_openjdk16_j9_sanity.openjdk_s390x_linux/aqa-tests/TKG/output_16281397277782/jdk_foreign_0/work
[2021-08-05T07:40:05.038Z] Error: Some tests failed or other problems occurred.
[2021-08-05T07:40:05.038Z] 
[2021-08-05T07:40:05.038Z] jdk_foreign_0_FAILED

fyi @llxia

pshipton commented 3 years ago

Seems the test was excluded after the build which failed.

JasonFengJ9 commented 3 years ago

The test in question has been excluded, sanity.openjdk is green w/ a recent JDK16 AIX build at https://hyc-runtimes-jenkins.swg-devops.com/job/Test_openjdk16_j9_sanity.openjdk_ppc64_aix_Personal/4/consoleFull

08:38:16  openjdk version "16.0.2-internal" 2021-07-20
08:38:16  OpenJDK Runtime Environment (build 16.0.2-internal+0-adhoc.jenkins.BuildJDK16ppc64aixPersonal)
08:38:16  Eclipse OpenJ9 VM (build master-cd8db74f1d4, JRE 16 AIX ppc64-64-Bit Compressed References 20210819_35 (JIT enabled, AOT enabled)
08:38:16  OpenJ9   - cd8db74f1d4
08:38:16  OMR      - ac3ab4821a5
08:38:16  JCL      - 44316ee515b based on jdk-16.0.2+7)

Moving the issue to 0.29.

babsingh commented 2 years ago

Runtimes for TestHandShake: https://github.com/eclipse-openj9/openj9/issues/15204#issuecomment-1155347134

ExceutorService:

Step 1: Launch 10 x Threads {
    outer: while (MemorySegment.session().isAlive()) {}
        try {
                doAccess(MemorySegment)
        } catch (IllegalStateException) {
                //suspending - elapased ... ms
        Thread.sleep X ms
        //resuming - elapased ... ms
        continue outer
        }
    }
    //terminated - elapsed ... ms
}

Step 2: Thread.sleep Y ms

Step 3: Launch Thread {
    while (true) {
        try {
                MemorySession.close()
        } catch (IllegalStateException) {
                Thread.onSpinWait
            }
    }
    //Segment closed - elapsed ... ms
}

From the perspective of the new impl:

In order to remove the INFINITE loop, the test needs to restrict the number of times the memory segment is accessed in Step 1 while still creating collisions between memory segment accesses (Step 1: doAccess) and closeScope (Step 3).

There are also alternatives: i) remove timeouts for the test; or ii) exclude the test since it is technically invalid due to the INFINITE loop.

ChengJin01 commented 1 year ago

The issue with vectorizedMismatch() is still detected in JDK19 on Windows as follows:

...
ITERATION 3
Started first thread: "Accessor #" + id ; elapsed (ms): 44
Starting handshaker with delay set to 126 millis
STDERR:
java.lang.NullPointerException
        at java.base/jdk.internal.util.ArraysSupport.vectorizedMismatch(Apport.java:126)
java.lang.NullPointerException
        at java.base/jdk.internal.misc.ScopedMemoryAccess.vectorizedMismarnal(ScopedMemoryAccess.java:232)
        at java.base/jdk.internal.misc.ScopedMemoryAccess.vectorizedMismapedMemoryAccess.java:213)
        at java.base/jdk.internal.foreign.AbstractMemorySegmentImpl.vectosmatchLargeForBytes(AbstractMemorySegmentImpl.java:208)
        at java.base/jdk.internal.foreign.AbstractMemorySegmentImpl.mismatractMemorySegmentImpl.java:171)
        at TestHandshake$SegmentMismatchAccessor.doAccess(TestHandshake.j)
        at TestHandshake$AbstractSegmentAccessor.run(TestHandshake.java:1
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThlExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(TolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1573)
STATUS:Failed.`main' threw exception: java.lang.NullPointerException
        at java.base/jdk.internal.util.ArraysSupport.vectorizedMismatch(Apport.java:126)
        at java.base/jdk.internal.misc.ScopedMemoryAccess.vectorizedMismarnal(ScopedMemoryAccess.java:232)
        at java.base/jdk.internal.misc.ScopedMemoryAccess.vectorizedMismapedMemoryAccess.java:213)
        at java.base/jdk.internal.foreign.AbstractMemorySegmentImpl.vectosmatchLargeForBytes(AbstractMemorySegmentImpl.java:208)
        at java.base/jdk.internal.foreign.AbstractMemorySegmentImpl.mismatractMemorySegmentImpl.java:171)
        at TestHandshake$SegmentMismatchAccessor.doAccess(TestHandshake.j)
        at TestHandshake$AbstractSegmentAccessor.run(TestHandshake.java:1
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThlExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(TolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1573)
STATUS:Failed.`main' threw exception: java.lang.NullPointerException

Given this issue has not yet been resolved via https://github.com/eclipse-openj9/openj9/issues/15204, it still needs to be excluded for the moment. So far, this issue is never spotted on other platforms except Windows.

pshipton commented 6 months ago

@ChengJin01 is there any update on this? Will it be resolved in 0.44 or should it move out?

ChengJin01 commented 6 months ago

@ChengJin01 is there any update on this? Will it be resolved in 0.44 or should it move out?

@jdmpapin, I notice the issue related to vectorizedMismatch at https://github.com/eclipse-openj9/openj9/issues/15204 was resolved via https://github.com/eclipse-openj9/openj9/pull/16662 in 0.43. If so, there should be no problem with this test suite, am I correct?

jdmpapin commented 6 months ago

I'm not familiar enough at the moment to understand the reason for the change in behaviour described in https://github.com/eclipse-openj9/openj9/pull/15161#issuecomment-1143061836, whereby previously closeScope() would cause accessors to fail, but now accessors will cause close() to fail instead. I'm guessing this was done to match an upstream change, even though https://github.com/eclipse-openj9/openj9/issues/13211#issuecomment-888491766 suggests that HotSpot matches our old behaviour. (If not, then we have a difference in behaviour that introduces accidental infinite loops :open_mouth:)

Based on https://github.com/eclipse-openj9/openj9/issues/13211#issuecomment-1156886243, my understanding is that with a faster vectorizedMismatch this test will have fewer spurious failures - perhaps almost none in practice - but it remains fundamentally broken in that an infinite loop is still possible.

pshipton commented 6 months ago

I ran a 5x grinder on jdk17,jdk21,jdk22 on each AIX, Windows, pLinux, zLinux. Two of these failed. Windows jdk22 crashed. I'll move this issue into the Java 22 milestone to investigate the crash.

https://openj9-jenkins.osuosl.org/view/Test/job/Grinder/3340 - plinux jdk17, iteration 2 The failure isn't shown in the test output due to an Output overflow.

https://openj9-jenkins.osuosl.org/view/Test/job/Grinder/3339 - Windows jdk22, iteration 1

https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Grinder/3339/openjdk_test_output.tar.gz

14:51:13  Type=Segmentation error vmState=0x00000000
14:51:13  Windows_ExceptionCode=c0000005 J9Generic_Signal=00000004 ExceptionAddress=00007FFBF3D5E6E8 ContextFlags=0010005f
14:51:13  Handler1=00007FFBF3E20CF0 Handler2=00007FFBF40AABA0 InaccessibleReadAddress=0000020793C54838
14:51:13  RDI=0000003A787FF858 RSI=0000003A787FF860 RAX=0000020793C54838 RBX=0000003A787FF860
14:51:13  RCX=0000000000387100 RDX=000000000034FEB8 R8=000000000008B430 R9=000000000000000F
14:51:13  R10=0000000000000000 R11=0000020780085F7A R12=0000003A787FF850 R13=0000003A787FF848
14:51:13  R14=0000000000387100 R15=0000000000000000
14:51:13  RIP=00007FFBF3D5E6E8 RSP=0000003A787FF370 RBP=0000000000387100 EFLAGS=0000000000010246
14:51:13  FS=0053 ES=002B DS=002B
14:51:13  Unhandled exception
14:51:13  Type=Segmentation error vmState=0x00000000
14:51:13  Windows_ExceptionCode=c0000005 J9Generic_Signal=00000004 ExceptionAddress=00007FFBF3D5E6E8 ContextFlags=0010005f
14:51:13  Handler1=00007FFBF3E20CF0 Handler2=00007FFBF40AABA0 InaccessibleReadAddress=0000020793D3FCA8
14:51:13  RDI=0000003A788FF458 RSI=0000003A788FF460 RAX=0000020793D3FCA8 RBX=0000003A788FF460
14:51:13  RCX=000000000035D900 RDX=00000000003591B8 R8=0000000000000001 R9=0000000000000018
14:51:13  R10=0000000000000000 R11=0000020780085F7A R12=0000003A788FF450 R13=0000003A788FF448
14:51:13  R14=000000000035D900 R15=0000000000000000
14:51:13  RIP=00007FFBF3D5E6E8 RSP=0000003A788FEF70 RBP=000000000035D900 EFLAGS=0000000000010246
14:51:13  FS=0053 ES=002B DS=002B
14:51:13  Unhandled exception
14:51:13  Type=Segmentation error vmState=0x00000000
14:51:13  Windows_ExceptionCode=c0000005 J9Generic_Signal=00000004 ExceptionAddress=00007FFBF3D5E6E8 ContextFlags=0010005f
14:51:13  Handler1=00007FFBF3E20CF0 Handler2=00007FFBF40AABA0 InaccessibleReadAddress=0000020793B7AE90
14:51:13  RDI=0000003A789FF328 RSI=0000003A789FF330 RAX=0000020793B7AE90 RBX=0000003A789FF330
14:51:13  RCX=000000000034AE00 RDX=000000000035C208 R8=0000000000000001 R9=0000000000000018
14:51:13  R10=0000000000000000 R11=0000020780085F7A R12=0000003A789FF320 R13=0000003A789FF318
14:51:13  R14=000000000034AE00 R15=0000000000000000
14:51:13  RIP=00007FFBF3D5E6E8 RSP=0000003A789FEE40 RBP=000000000034AE00 EFLAGS=0000000000010246
14:51:13  FS=0053 ES=002B DS=002B
14:51:13  Unhandled exception
14:51:13  Type=Segmentation error vmState=0x00000000
14:51:13  Windows_ExceptionCode=c0000005 J9Generic_Signal=00000004 ExceptionAddress=00007FFBF3D5E6E8 ContextFlags=0010005f
14:51:13  Handler1=00007FFBF3E20CF0 Handler2=00007FFBF40AABA0 InaccessibleReadAddress=0000020793F24B78
14:51:13  RDI=0000003A78AFF6C8 RSI=0000003A78AFF6D0 RAX=0000020793F24B78 RBX=0000003A78AFF6D0
14:51:13  RCX=0000000000355C00 RDX=0000000000360EC8 R8=0000000000000001 R9=0000000000000018
14:51:13  R10=0000000000000000 R11=0000020780085F7A R12=0000003A78AFF6C0 R13=0000003A78AFF6B8
14:51:13  R14=0000000000355C00 R15=0000000000000000
14:51:13  RIP=00007FFBF3D5E6E8 RSP=0000003A78AFF1E0 RBP=0000000000355C00 EFLAGS=0000000000010246
14:51:13  FS=0053 ES=002B DS=002B
14:51:13  XMM0 0000020780085f7a (f: 2148032256.000000, d: 1.102377e-311)
14:51:13  XMM1 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM2 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM3 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM4 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM5 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM6 0000000000360ef0 (f: 3542768.000000, d: 1.750360e-317)
14:51:13  XMM7 0000000000360ef0 (f: 3542768.000000, d: 1.750360e-317)
14:51:13  XMM8 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM9 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM10 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM11 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM12 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM13 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  Module=C:\Users\jenkins\workspace\Grinder\jdkbinary\j2sdk-image\bin\default\j9vm29.dll
14:51:13  Module_base_address=00007FFBF3D50000 Offset_in_DLL=000000000000e6e8
14:51:13  Target=2_90_20240223_22 (Windows Server 2019 10.0 build 17763)
14:51:13  CPU=amd64 (4 logical CPUs) (0x3fff77000 RAM)
14:51:13  ----------- Stack Backtrace -----------
14:51:13  XMM0 0000020780085f7a (f: 2148032256.000000, d: 1.102377e-311)
14:51:13  XMM1 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM2 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM3 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM4 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM5 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM6 000000000035c230 (f: 3523120.000000, d: 1.740653e-317)
14:51:13  XMM7 000000000035c230 (f: 3523120.000000, d: 1.740653e-317)
14:51:13  XMM8 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM9 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM10 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM11 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM12 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM13 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  Module=C:\Users\jenkins\workspace\Grinder\jdkbinary\j2sdk-image\bin\default\j9vm29.dll
14:51:13  Module_base_address=00007FFBF3D50000 Offset_in_DLL=000000000000e6e8
14:51:13  Target=2_90_20240223_22 (Windows Server 2019 10.0 build 17763)
14:51:13  CPU=amd64 (4 logical CPUs) (0x3fff77000 RAM)
14:51:13  ----------- Stack Backtrace -----------
14:51:13  XMM0 0000020780085f7a (f: 2148032256.000000, d: 1.102377e-311)
14:51:13  XMM1 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM2 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM3 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM4 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM5 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM6 00000000003591e0 (f: 3510752.000000, d: 1.734542e-317)
14:51:13  XMM7 00000000003591e0 (f: 3510752.000000, d: 1.734542e-317)
14:51:13  XMM8 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM9 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM10 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM11 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM12 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM13 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  Module=C:\Users\jenkins\workspace\Grinder\jdkbinary\j2sdk-image\bin\default\j9vm29.dll
14:51:13  Module_base_address=00007FFBF3D50000 Offset_in_DLL=000000000000e6e8
14:51:13  Target=2_90_20240223_22 (Windows Server 2019 10.0 build 17763)
14:51:13  CPU=amd64 (4 logical CPUs) (0x3fff77000 RAM)
14:51:13  ----------- Stack Backtrace -----------
14:51:13  XMM0 0000020780085f7a (f: 2148032256.000000, d: 1.102377e-311)
14:51:13  XMM1 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM2 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM3 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM4 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM5 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM6 000000000034fee0 (f: 3473120.000000, d: 1.715949e-317)
14:51:13  XMM7 000000000034fee0 (f: 3473120.000000, d: 1.715949e-317)
14:51:13  XMM8 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM9 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM10 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM11 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM12 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM13 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  XMM15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:51:13  Module=C:\Users\jenkins\workspace\Grinder\jdkbinary\j2sdk-image\bin\default\j9vm29.dll
14:51:13  Module_base_address=00007FFBF3D50000 Offset_in_DLL=000000000000e6e8
14:51:13  Target=2_90_20240223_22 (Windows Server 2019 10.0 build 17763)
14:51:13  CPU=amd64 (4 logical CPUs) (0x3fff77000 RAM)
14:51:13  ----------- Stack Backtrace -----------
14:51:13  (0x00007FFBF3D5E6E8 [j9vm29+0xe6e8])
14:51:13  (0x0000003A78AFF6C0)
14:51:13  (0x000000000008BAB0)
14:51:13  (0x0000000000360F10)
14:51:13  (0x0000003A78AFF6D0)
14:51:13  (0x0000003A78AFF2C0)
14:51:13  (0x0000000100000001)
14:51:13  (0x000000000000000A)
14:51:13  (0x0000003A78AFF380)
14:51:13  (0x0000000000000002)
14:51:13  (0x0000003A78AFF6D0)
14:51:13  (0x0000003A78AFF6C8)
14:51:13  (0x0000000000360EF0)
14:51:13  (0x0000003A78AFF6B8)
14:51:13  (0x0000003A78AFF6E0)
14:51:13  (0x0000003A78AFF6C0)
14:51:13  ---------------------------------------
ChengJin01 commented 6 months ago

It looks like the crash above at https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Grinder/3339/openjdk_test_output.tar.gz was still related to vectorizedMismatch as follows:

1XMCURTHDINFO  Current thread
3XMTHREADINFO      "pool-16-thread-3" J9VMThread:0x000000000034AE00, omrthread_t:0x0000020793AF5498, java/lang/Thread:0x00000000FFE4D300, state:R, prio=5
3XMJAVALTHREAD            (java/lang/Thread getId:0x5A, isDaemon:false)
3XMJAVALTHRCCL            jdk/internal/loader/ClassLoaders$AppClassLoader(0x00000000E004D948)
3XMTHREADINFO1            (native thread ID:0x4D4, native priority:0x5, native policy:UNKNOWN, vmstate:R, vm thread flags:0x00000020)
3XMCPUTIME               CPU usage total: 1.093750000 secs, user: 1.015625000 secs, system: 0.078125000 secs, current category="Application"
3XMHEAPALLOC             Heap bytes allocated since last GC cycle=0 (0x0)
3XMTHREADINFO3           Java callstack:
4XESTACKTRACE                at jdk/internal/misc/Unsafe.getLong(Native Method)
4XESTACKTRACE                at jdk/internal/misc/Unsafe.getLongUnaligned(Unsafe.java:5540)
4XESTACKTRACE                at jdk/internal/util/ArraysSupport.vectorizedMismatch(ArraysSupport.java:130)
4XESTACKTRACE                at jdk/internal/misc/ScopedMemoryAccess.vectorizedMismatchInternal(ScopedMemoryAccess.java:232)
4XESTACKTRACE                at jdk/internal/misc/ScopedMemoryAccess.vectorizedMismatch(ScopedMemoryAccess.java:213)
4XESTACKTRACE                at jdk/internal/foreign/AbstractMemorySegmentImpl.vectorizedMismatchLargeForBytes(AbstractMemorySegmentImpl.java:222)
4XESTACKTRACE                at jdk/internal/foreign/AbstractMemorySegmentImpl.mismatch(AbstractMemorySegmentImpl.java:695)
4XESTACKTRACE                at java/lang/foreign/MemorySegment.mismatch(MemorySegment.java:2571)
4XESTACKTRACE                at jdk/internal/foreign/AbstractMemorySegmentImpl.mismatch(AbstractMemorySegmentImpl.java:289)
4XESTACKTRACE                at TestHandshake$SegmentMismatchAccessor.doAccess(TestHandshake.java:232)
4XESTACKTRACE                at TestHandshake$AbstractSegmentAccessor.run(TestHandshake.java:106)
4XESTACKTRACE                at java/util/concurrent/ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
4XESTACKTRACE                at java/util/concurrent/ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
4XESTACKTRACE                at java/lang/Thread.runWith(Thread.java:1595)
4XESTACKTRACE                at java/lang/Thread.run(Thread.java:1582)
ChengJin01 commented 6 months ago

Hi @jdmpapin, any input/update as to the test crash above related to vectorizedMismatch?

ChengJin01 commented 6 months ago

As discussed with Devin offline, the crash shouldn't happen whatever it is supposed to be given the intention of the test suite is to verify the lifetime of segment which is only related to the scopedMemoryAccess. Assuming the crash is unrelated to JIT, the only reason I can image is that the problem might still come from scopedMemoryAccess in which case the native access with unsafe occurred after the underlying memory segment was closed (so the crash was caused by the invalid access to the native memory).

ChengJin01 commented 6 months ago

Hi @babsingh, could you take a look at the dumps at https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Grinder/3339/openjdk_test_output.tar.gz to how this happened via ScopedMemoryAccess given part of related native code exists in OpenJ9?

babsingh commented 6 months ago

could you take a look at the dumps

Taking a look ...

babsingh commented 6 months ago

In JDK22, the implementation of ScopedMemoryAccess.closeScope0 has changed; see https://github.com/ibmruntimes/openj9-openjdk-jdk22/commit/3dd9ec970a375e480e278f746cf84a1b52f1063e. The new implementation resolves the following issue: https://bugs.openjdk.org/browse/JDK-8319782.

In the Windows crash, memory operations are performed while the memory session/scope is closed. Memory operations should either be aborted by throwing an exception or not performed if the meomory session/scope is closed. To resolve the crash, we will need to update ScopedMemoryAccess.closeScope0 as per https://github.com/ibmruntimes/openj9-openjdk-jdk22/commit/3dd9ec970a375e480e278f746cf84a1b52f1063e.

ChengJin01 commented 5 months ago

@pshipton, could you help run grinders to see whether the crash disappears given the fix at https://github.com/eclipse-openj9/openj9/pull/19167 was merged?

pshipton commented 5 months ago

Windows 5x x 3 https://openj9-jenkins.osuosl.org/view/Test/job/Grinder/3406/ - passed 10x x 3 https://openj9-jenkins.osuosl.org/view/Test/job/Grinder/3410/ - some failures, see next comment plinux 10x x 3 https://openj9-jenkins.osuosl.org/view/Test/job/Grinder/3409/ - passed AIX 10x x 3 https://openj9-jenkins.osuosl.org/job/Grinder/3412 - passed zlinux 10x x 3 https://openj9-jenkins.osuosl.org/job/Grinder/3413 - passed

pshipton commented 5 months ago

https://openj9-jenkins.osuosl.org/job/Grinder_iteration_0/455

18:51:42  JVMCDRT000E Unable to locate JIT stack map - aborting VM
18:51:42  JVMCDRT001E Method: jdk/internal/util/ArraysSupport.vectorizedMismatch(Ljava/lang/Object;JLjava/lang/Object;JII)I (00000000000AB030)
18:51:42  JVMCDRT002E Failing PC: 00007FF8E8640586 (offset 00000000000000DE), metaData = 000001D9135BB068
18:51:42  22:51:07.615 0x3a8f00j9codertvm(j9ji.110    *   ** ASSERTION FAILED ** at f:\users\jenkins\workspace\build_jdk22_x86-64_windows_nightly\openj9\runtime\codert_vm\jswalk.c:534: ((0 ))

https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Grinder_iteration_0/455/openjdk_test_output.tar.gz


https://openj9-jenkins.osuosl.org/job/Grinder_iteration_1/456

18:36:26  ITERATION 4
18:36:26  Started first thread: "Accessor #" + id ; elapsed (ms): 0
18:36:26  Starting handshaker with delay set to 265 millis
18:36:26  test TestHandshake.testHandshake("SegmentCopyAccessor", TestHandshake$$Lambda/0x0000000070c23750@bc203ac): failure
18:36:26  java.lang.AssertionError: expected [true] but found [false]
18:36:26    at org.testng.Assert.fail(Assert.java:99)
18:36:26    at org.testng.Assert.failNotEquals(Assert.java:1037)
18:36:26    at org.testng.Assert.assertTrue(Assert.java:45)
18:36:26    at org.testng.Assert.assertTrue(Assert.java:55)
18:36:26    at TestHandshake.testHandshake(TestHandshake.java:86)
ChengJin01 commented 5 months ago

The javacore at https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Grinder_iteration_0/455/openjdk_test_output.tar.gz indicates ArraysSupport.vectorizedMismatch was still in JIT as follows (the assertion failure occurred in jitWalkFrame at openj9\runtime\codert_vm\jswalk.c:534):

1XMCURTHDINFO  Current thread
3XMTHREADINFO      "pool-16-thread-4" J9VMThread:0x00000000003A8F00, omrthread_t:0x000001D96EC686A0, java/lang/Thread:0x00000000FFE4B1E0, state:R, prio=5
3XMJAVALTHREAD            (java/lang/Thread getId:0x61, isDaemon:false)
3XMJAVALTHRCCL            jdk/internal/loader/ClassLoaders$AppClassLoader(0x00000000E0030988)
3XMTHREADINFO1            (native thread ID:0xA04, native priority:0x5, native policy:UNKNOWN, vmstate:R, vm thread flags:0x00100020)
3XMCPUTIME               CPU usage total: 0.265625000 secs, user: 0.203125000 secs, system: 0.062500000 secs, current category="Application"
3XMHEAPALLOC             Heap bytes allocated since last GC cycle=4096 (0x1000)
1INTERNAL                    Unable to obtain lock context information
3XMTHREADINFO3           Java callstack:
---> 4XESTACKTRACE                at jdk/internal/util/ArraysSupport.vectorizedMismatch(ArraysSupport.java(Compiled Code))
4XESTACKTRACE                at jdk/internal/misc/ScopedMemoryAccess.vectorizedMismatchInternal(ScopedMemoryAccess.java:232)
4XESTACKTRACE                at jdk/internal/misc/ScopedMemoryAccess.vectorizedMismatch(ScopedMemoryAccess.java:213)
4XESTACKTRACE                at jdk/internal/foreign/AbstractMemorySegmentImpl.vectorizedMismatchLargeForBytes(AbstractMemorySegmentImpl.java:222)
4XESTACKTRACE                at jdk/internal/foreign/AbstractMemorySegmentImpl.mismatch(AbstractMemorySegmentImpl.java:695)
4XESTACKTRACE                at java/lang/foreign/MemorySegment.mismatch(MemorySegment.java:2571)
4XESTACKTRACE                at jdk/internal/foreign/AbstractMemorySegmentImpl.mismatch(AbstractMemorySegmentImpl.java:289)
4XESTACKTRACE                at TestHandshake$SegmentMismatchAccessor.doAccess(TestHandshake.java:232)
4XESTACKTRACE                at TestHandshake$AbstractSegmentAccessor.run(TestHandshake.java:106(Compiled Code))
4XESTACKTRACE                at java/util/concurrent/ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144(Compiled Code))
4XESTACKTRACE                at java/util/concurrent/ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
4XESTACKTRACE                at java/lang/Thread.runWith(Thread.java:1595)
4XESTACKTRACE                at java/lang/Thread.run(Thread.java:1582)

@jdmpapin, does it mean the work with ArraysSupport.vectorizedMismatch is not yet ready at https://github.com/eclipse-openj9/openj9/issues/16717 or anything associated with the fix at https://github.com/eclipse-openj9/openj9/pull/16662?

jdmpapin commented 5 months ago

I don't think that off-heap is in use (see #14005), so #16717 should be irrelevant

I'm pretty sure the transformation from #16662 is not involved. If it were, then the caller (vectorizedMismatchInternal) would also have to be compiled. Additionally, when the transformation replaces the call with (mainly) arraycmplen, the replacement IL doesn't have a GC point, so we shouldn't be walking the stack while running it. It's possible that there could be an unexpected segfault during arraycmplen, which could be taken as the failure of an implicit null check and trigger stack walking, but if that were to happen, then ArraysSupport.vectorizedMismatch wouldn't appear on the stack in the javacore

jdmpapin commented 5 months ago

Is it possible anymore to get the build that was running in https://openj9-jenkins.osuosl.org/job/Grinder_iteration_0/455?

pshipton commented 5 months ago

It was started from https://openj9-jenkins.osuosl.org/view/Test/job/Grinder/3410, which used the nightly build for jdk22. Since it ran Mar 19, the nightly should have been from the 18th. The version output shows 20240318_38 so the build is at https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Build_JDK22_x86-64_windows_Nightly/38/

18:19:06  openjdk version "22-internal" 2024-03-19
18:19:06  OpenJDK Runtime Environment (build 22-internal-adhoc.jenkins.buildjdk22x86-64windowsnightly)
18:19:06  Eclipse OpenJ9 VM (build master-cffd23660ae, JRE 22 Windows Server 2012 R2 amd64-64-Bit Compressed References 20240318_38 (JIT enabled, AOT enabled)
18:19:06  OpenJ9   - cffd23660ae
18:19:06  OMR      - 1bf2ef421ca
18:19:06  JCL      - a7e9975f4b5 based on jdk-22+36)
jdmpapin commented 4 months ago

The assertion failure appears not to be due to a JIT bug. Rather, it seems the memory to be compared got unmapped while the comparison was still running. Details follow

Looking at the source of the methods on the Java stack, the values of the parameters a, aOffset, b, bOffset, and length all come from AbstractMemorySegmentImpl.mismatch():

i = AbstractMemorySegmentImpl.vectorizedMismatchLargeForBytes(srcImpl.sessionImpl(), dstImpl.sessionImpl(),
        srcImpl.unsafeGetBase(), srcImpl.unsafeGetOffset() + srcFromOffset,
        dstImpl.unsafeGetBase(), dstImpl.unsafeGetOffset() + dstFromOffset,
        bytes);

The corresponding frame from !stackslots has only two O-slots:

<3a8f00> Bytecode frame: bp = 0x00000000003C5C00, sp = 0x00000000003C5BF0, pc = 0x000001D90070FEE4, cp = 0x00000000003885C0, arg0EA = 0x00000000003C5CB0, flags = 0x0000000000000000
<3a8f00>        Method: jdk/internal/foreign/AbstractMemorySegmentImpl.mismatch(Ljava/lang/foreign/MemorySegment;JJLjava/lang/foreign/MemorySegment;JJ)J !j9method 0x000000000038A328
<3a8f00>        Bytecode index = 148
<3a8f00>        Using local mapper
<3a8f00>        Locals starting at 0x00000000003C5CB0 for 0x0000000000000016 slots
...
<3a8f00>                O-Slot: t10[0x00000000003C5C60] = 0x00000000FFE30A20
<3a8f00>                O-Slot: t11[0x00000000003C5C58] = 0x00000000FFE4B168
...

both of which are instances of NativeMemorySegmentImpl with length 1 million:

> !j9object 0x00000000FFE30A20
!J9Object 0x00000000FFE30A20 {
        struct J9Class* clazz = !j9class 0x38BD00 // jdk/internal/foreign/NativeMemorySegmentImpl
        Object flags = 0x00000000;
        I lockword = 0x00000000 (offset = 0) (java/lang/Object) <hidden>
        J length = 0x00000000000F4240 (offset = 4) (jdk/internal/foreign/AbstractMemorySegmentImpl)
        Z readOnly = 0x00000000 (offset = 16) (jdk/internal/foreign/AbstractMemorySegmentImpl)
        Ljdk/internal/foreign/MemorySessionImpl; scope = !fj9object 0xffe309e8 (offset = 12) (jdk/internal/foreign/AbstractMemorySegmentImpl)
        J min = 0x000001D913C732E0 (offset = 20) (jdk/internal/foreign/NativeMemorySegmentImpl)
}
> !j9object 0x00000000FFE4B168
!J9Object 0x00000000FFE4B168 {
        struct J9Class* clazz = !j9class 0x38BD00 // jdk/internal/foreign/NativeMemorySegmentImpl
        Object flags = 0x00000000;
        I lockword = 0x00000000 (offset = 0) (java/lang/Object) <hidden>
        J length = 0x00000000000F4240 (offset = 4) (jdk/internal/foreign/AbstractMemorySegmentImpl)
        Z readOnly = 0x00000000 (offset = 16) (jdk/internal/foreign/AbstractMemorySegmentImpl)
        Ljdk/internal/foreign/MemorySessionImpl; scope = !fj9object 0xffe309e8 (offset = 12) (jdk/internal/foreign/AbstractMemorySegmentImpl)
        J min = 0x000001D914644350 (offset = 20) (jdk/internal/foreign/NativeMemorySegmentImpl)
}

Here are the implementations of unsafeGetBase() and unsafeGetOffset() for NativeMemorySegmentImpl:

@Override
public long unsafeGetOffset() {
    return min;
}

@Override
public Object unsafeGetBase() {
    return null;
}

so a and b should both be null and vectorizedMismatch() should compare 1 million bytes of native memory at aOffset=0x1d913c732e0 and bOffset=0x1d914644350 (or vice versa, but later we'll see that it's this way around)

The core file has nowhere near 1 million bytes mapped starting from this aOffset. In fact there's only about 3kB mapped there. And at bOffset there's no memory mapped whatsoever

These were in fact the arguments received by the JIT body:

0x3c5a48: 0x0000000000000000    a = null
0x3c5a40: 0x00007ff8fc614380    --
0x3c5a38: 0x000001d913c732e0    aOffset
0x3c5a30: 0x0000000000000000    b = null
0x3c5a28: 0x0000000000000000    --
0x3c5a20: 0x000001d914644350    bOffset
0x3c5a18: 0x00000000000f4240    length = 1 million
0x3c5a10: 0x0000000000000000    log2ArrayIndexScale = 0, i.e. comparing bytes
0x3c5a08: 0x00007ff8fc6143b0    returnFromJIT1

(Slots marked -- are just the long arguments' extra slots.)

From https://github.com/eclipse-openj9/openj9/issues/13211#issuecomment-2008301196 the failing PC is 0x7ff8e8640586, which is here in the JIT body of vectorizedMismatch():

0x7ff8e8640582      mov rax, rbx
0x7ff8e8640585      mov r8, qword ptr [rbx + rdi]   <--- this instr addr + 1 byte
0x7ff8e864058a      mov rcx, r8

I don't know why the PC is inside the instruction, but I believe this load tried to access unmapped memory and that as a result we're trying to throw NPE as though for an implicit null check failure. I see that walkState->walkSP points to a reference to an instance of NullPointerException. To find the address for the load, here are the values of rbx and rdi from !stackslots:

<3a8f00>        JIT-Resolve-RegisterMap[0x0000001D9F47F798] = U64(0x0000000000000000) (jit_rbx)
<3a8f00>        JIT-Resolve-RegisterMap[0x0000001D9F47F7B0] = U64(0x000001D913D484B8) (jit_rdi)

so the address would have been 0x1d913d484b8, which is aOffset + 872,920. So this is the address from which to load here:

long av = U.getLongUnaligned(a, aOffset + bi);

on the iteration where wi is 109,115. And this address is unmapped in the core file, like earlier addresses closer to aOffset are

So it looks like the backing memory was valid when vectorizedMismatch() started and it got unmapped concurrently

The assertion failed during stack walking just because this load instruction is not a GC point and there are no other GC points prior to it in the JIT body

babsingh commented 4 months ago

@gacholio re: https://github.com/eclipse-openj9/openj9/pull/19167. The async exception is correctly set, but it doesn't trigger in the below case where the top frame is JIT compiled. Do we need to decompile the top frame?


// state = 0xFFFFFFFF (CLOSED = -1), MemorySession is closed
> !fj9object 0xffe309e8
!J9Object 0x00000000FFE309E8 {
        ...
    I state = 0xFFFFFFFF (offset = 12) (jdk/internal/foreign/MemorySessionImpl)

// Async exception is set
// publicFlags has J9_PUBLIC_FLAGS_CLOSE_SCOPE
// scopedError points to the correct error that should be thrown
> !j9vmthread 0x003a8f00
        ...
    0x98: U64 publicFlags = 0x0000000000100020 (1048608)
    0xa70: class J9Object* scopedError = !j9object 0x00000000FFF60928 // jdk/internal/misc/ScopedMemoryAccess$ScopedAccessError

// Top frame is JIT compiled
3XMTHREADINFO3           Java callstack:
4XESTACKTRACE                at jdk/internal/util/ArraysSupport.vectorizedMismatch(ArraysSupport.java(Compiled Code))
4XESTACKTRACE                at jdk/internal/misc/ScopedMemoryAccess.vectorizedMismatchInternal(ScopedMemoryAccess.java:232)
tajila commented 4 months ago

@gacholio re: https://github.com/eclipse-openj9/openj9/pull/19167. The async exception is correctly set, but it doesn't trigger in the below case where the top frame is JIT compiled. Do we need to decompile the top frame?

There has to be an asyncCheck fo a thread to respond to the J9_PUBLIC_FLAGS_CLOSE_SCOPE event. Given that the test is operating over a large data set, it is concievable that the async event was triggered while the thread was still operating on the data set. Given that the vectorizedMismatch is compiled, there needs to be asynchecks inserted to ensure that the thread can respond (somewhat immdiately) after the event is sent.

@jdmpapin do you know if vectorizedMismatch detects async events? I know for things like loops, async checks are typically inserted into backwards branches.

jdmpapin commented 4 months ago

This vectorizedMismatch() should contain asynccheck as usual. If we had compiled the caller, we might have replaced the call with arraycmplen, which doesn't do any asynccheck. However, about this:

it is concievable that the async event was triggered while the thread was still operating on the data set. Given that the vectorizedMismatch is compiled, there needs to be asynchecks inserted to ensure that the thread can respond (somewhat immdiately) after the event is sent.

It sounds as though the event is sent and then (possibly after some delay?) the memory is unmapped without any further synchronization. If so, then threads responding more promptly to the event would reduce the failure rate, but there would still be a race condition. It's not safe to unmap the memory until we know that every thread has seen the event and stopped accessing it (or at least every thread that could be accessing it, if we have a way to narrow that down)

tajila commented 4 months ago

@babsingh does the spec require this behaviour? I think we can work around it by still setting scopeFound = JNI_TRUE until the thread responds to the event. This removes the race condition Devin mentioned above.

/* Skip since an exception is already pending to be thrown.*/
if (NULL != walkThread->scopedError) {
    continue;
}

If so, then threads responding more promptly to the event would reduce the failure rate,

I think we would still require the ability for a thread to detect this event. I think there is an annotation that specifies if a method (a future calls) needs to be able to detect the event.

babsingh commented 4 months ago

I think we can work around it by still setting scopeFound = JNI_TRUE until the thread responds to the event.

In JDK22+,closeScope0 returns void, and it is only called once; unlike the prior impl where it returned boolean and was invoked repeatedly until false was returned.

To add a delay, we will need to loop within closeScope0. Not sure if it is guaranteed that all threads will respond to the event. If not guaranteed, then the wait/delay can turn into a hang.

jdmpapin commented 4 months ago

I think we would still require the ability for a thread to detect this event.

Sure, no thread can run for too long before reaching an asynccheck. But too long is loosely defined, and necessarily so. We can't guarantee that a thread will always asynccheck within any fixed amount of time (at least due to preemptive scheduling). If we wait for the thread to see the event before unmapping, then an occasional delay in checking for events will just cause a corresponding delay in unmapping the memory

I think there is an annotation that specifies if a method (a future calls) needs to be able to detect the event.

I'd be curious to see the annotation. I'm having a hard time conceptualizing what it would mean. No matter how many times a method checks for events, it's always possible for the event we care about to arise after the last check, and without some other synchronization to establish ordering, that situation can't be distinguished from one where the method just skipped checking. In any case, vectorizedMismatch() only has @IntrinsicCandidate

babsingh commented 4 months ago

[1] As per AsyncCheckInsertion.cpp#L126-L131, the JIT doesn't add async checks for involuntary OSR. @jdmpapin Are async checks in the JIT OSR safe points?

More Context for [1]

@gacholio mentioned that

the JIT won't throw an exception from AsyncHandler; it can only throw them from the method entry async check, not from the async checks inserted into loops etc

We are considering to OSR from the async handler code to throw the async exception.

[2] Is exclusive VM access a OSR safe point? Decompiling the top jitted frame in closeScope0 will result in re-entering the interpreter and throwing the async exception.

jdmpapin commented 4 months ago

An asynccheck is an OSR point, yes. In compilations using OSR, we generate them more or less as normal (plus the needed bookkeeping), but once we've finished translating bytecode to IL, we generally don't add new ones later on. The reason for that depends on how OSR is used in the compilation.

The asynccheck insertion pass linked above is not how we usually generate asynccheck. It's a transformation that generates an extra asynccheck just before return in certain cases. It's not needed to ensure that asynccheck runs frequently enough. Rather, it helps the method get seen by JIT sampling. (This might be the only place we add an extra asynccheck when using fear analysis. It's safe because after asynccheck we'll just return, and the return doesn't expect any OSR assumptions to hold)

We are considering to OSR from the async handler code to throw the async exception.

This isn't always possible. We'd have to compile everything with involuntary OSR. And it wouldn't get to the root of the problem anyway, at least if it really is the kind of race condition I described earlier, since that's not related to the JIT in the first place. Even in the interpreter, we could be running vectorizedMismatch() (or whatever else accesses this memory), and we could be within the VM's implementation of some Unsafe access method, just about to perform the access, and the thread could be preempted and starved for an arbitrary length of time. Then when it finally resumes, the memory could have been unmapped. Longer and longer delays of this kind get more and more unlikely, but synchronization (of some kind) is the only way to guarantee that things happen in the correct order, i.e. that the thread stops trying to access the memory strictly before it gets unmapped

the JIT won't throw an exception from AsyncHandler; it can only throw them from the method entry async check, not from the async checks inserted into loops etc

This still really complicates things though... :thinking:

Do we know something about what threads might be accessing the memory and/or what code they could be running? I'm not very worried about waiting for vectorizedMismatch() in particular to finish, but if any thread could be accessing the memory from any code, then asynccheck would need to become an exception point everywhere, lest some (probably unrelated) thread get stuck in a loop and refuse to throw forever. It wouldn't be as bad as compiling everything with involuntary OSR, but I imagine we'd lose some performance by allowing every asynccheck to throw

If we're lucky, the annotation that @tajila mentioned might help here

jdmpapin commented 4 months ago

We are considering to OSR from the async handler code to throw the async exception.

[...] it wouldn't get to the root of the problem anyway [...] synchronization is the only way [...]

Sorry, a more charitable reading would be that you're considering using OSR precisely to allow for the synchronization that I was talking about. It would almost certainly be better for us to do that by allowing asynccheck to throw than to make sure that OSR is always possible. And if we could restrict the places where OSR would need to be possible at asynccheck, then we could similarly restrict the places where asynccheck is allowed to throw instead

tajila commented 4 months ago

If we're lucky, the annotation that @tajila mentioned might help here

The @Scoped annotation indicates that the current frame, and any frame above may be using a scope. Its a way to signal that the thread will need to check if the scope is still alive.

here is the definition of vectorizedMismatch

    @ForceInline @Scoped
    private int vectorizedMismatchInternal(MemorySessionImpl aSession, ...
jdmpapin commented 4 months ago

Thanks Tobi

I'd like to clarify "above" since stacks are sometimes described in either direction. I imagine in this case it means any of the transitive callees

here is the definition of vectorizedMismatch

What version is this? I was looking at the JDK22 definition because I had been debugging a JDK22 crash. This is the tip of the openj9 branch at the moment

babsingh commented 4 months ago

What version is this?

It's in the JDK21+ extensions repo:

AsyncMessageHandler.cpp#L72-L76: closeScope0 relies upon the @Scoped annotation, and each thread also uses this annotation to decide whether to throw or just clear (without throwing) the async exception.

jdmpapin commented 4 months ago

Quoting myself again...

at least if it really is the kind of race condition I described earlier

I don't think it is :sweat_smile:

I found where we set J9_PUBLIC_FLAGS_CLOSE_SCOPE in closeScope0. We're holding exclusive access and it looks like we send the event only to threads currently using the particular scope in a @Scoped frame. This is the synchronization I wasn't aware of. When the thread gets VM access back, it will definitely be able to see the event, and in this situation it makes sense that it should throw instead of resuming what it was doing before it paused. So the only problem is what @gacholio mentioned:

the JIT won't throw an exception from AsyncHandler; it can only throw them from the method entry async check, not from the async checks inserted into loops etc

Interestingly, this means that we ran into this problem only because vectorizedMismatch() does do asynccheck (without throwing)

I'm thinking about how to make sure we throw, ideally without allowing every asynccheck everywhere to throw

babsingh commented 4 months ago

Just to recap, we have discussed the below approaches: [1] Wait after releasing exclusive VM access in closeScope0 until all walkThread->scopedError are NULL. A hang might be possible if async events are not processed. [2] Check if walkThread->scopedError is set during the asyncheck and perform an OSR. This will result in an interpreter re-entry, and the interpreter will throw the exception during reentry. https://github.com/eclipse-openj9/openj9/issues/13211#issuecomment-2060054749: a concern was that we have to compile everything with involuntary OSR. [3] If exclusive VM access is an OSR safe point, we can decompile the top jitted frame in closeScope0. After closeScope0 releases exclusive VM access, the impacted thread will re-enter the interpreter and the async exception will be thrown.

babsingh commented 4 months ago

We couldn't find the historical reason on why the JIT is unable to throw an exception from async checks. @jdmpapin Would you know why? Will it be possible for the JIT to throw exceptions during async checks?

jdmpapin commented 4 months ago

About [1]: Just to confirm that a hang is a real possibility, the async event might not be processed because the JIT guarantees that we'll always reach an asynccheck "soon," but not necessarily one that will throw

[2] and [3] would both require involuntary OSR to be possible at any relevant asyncheck in the JIT-compiled code.

I'll add one more, which is what I'm considering: [4] Treat (some sufficient subset of) asynccheck trees as exception points in the JIT and make the async handler throw the pending exception (if any) when called from those points

gacholio commented 4 months ago

Having two async check JIT helpers is certainly feasible.

jdmpapin commented 4 months ago

I don't think that it would have been difficult to allow them to throw. I expect (based on intuition alone, no evidence...) that the historical reason would have been to give the optimizer more freedom by getting rid of exception control flow paths

One wrinkle that's occurring to me now is that when we're doing voluntary OSR with fear point analysis, we can create an OSR guard following an asynccheck. If the asynccheck is allowed to throw, then while we're paused there maybe we could patch OSR guards and also receive an async exception at the same time, and then we could end up in an exception handler in the same JIT body. So it might take a bit of doing to take care of that, but hopefully not too much considering that the same scenario is definitely already possible with calls (i.e. our OSR assumptions get invalidated while the callee is running, and then the callee throws)