eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.28k stars 721 forks source link

JDK21 serviceability_jvmti_j9_0_FAILED serviceability/jvmti/vthread/VThreadEventTest/VThreadEventTest.java timed out #18712

Closed JasonFengJ9 closed 7 months ago

JasonFengJ9 commented 9 months ago

Failure link

Created from https://github.com/eclipse-openj9/openj9/issues/18675#issuecomment-1880983707 as per https://github.com/eclipse-openj9/openj9/issues/18675#issuecomment-1881681970

From an internal build(win19x86-svl-rt4-1):

13:30:36  openjdk version "21.0.1" 2023-10-17 LTS
13:30:36  IBM Semeru Runtime Open Edition 21.0.1.0-m3 (build 21.0.1+12-LTS)
13:30:36  Eclipse OpenJ9 VM 21.0.1.0-m3 (build v0.42.0-release-69b6ceb69, JRE 21 Windows Server 2019 amd64-64-Bit Compressed References 20231017_74 (JIT enabled, AOT enabled)
13:30:36  OpenJ9   - 69b6ceb69
13:30:36  OMR      - 494d6eb66
13:30:36  JCL      - 5846adac994 based on jdk-21.0.1+12)

Rerun in Grinder - Change TARGET to run only the failed test targets.

Optional info

Failure output (captured from console output)


13:40:34  ===============================================
13:40:35  Running test serviceability_jvmti_j9_0 ...
13:40:35  ===============================================
13:40:36  serviceability_jvmti_j9_0 Start Time: Sun Jan  7 10:40:35 2024 Epoch Time (ms): 1704652835766
13:40:36  variation: Mode150
13:40:36  JVM_OPTIONS:  -XX:+UseCompressedOops -Xverbosegclog 

14:32:36  TEST: serviceability/jvmti/vthread/VThreadEventTest/VThreadEventTest.java

14:32:36  TEST RESULT: Error. Program `C:\Users\jenkins\workspace\Test_openjdk21_j9_extended.openjdk_x86-64_windows_testList_5\jdkbinary\j2sdk-image\bin\java' timed out (timeout set to 960000ms, elapsed time including timeout handling was 1199454ms).
14:32:36  --------------------------------------------------
14:32:36  Test results: passed: 155; error: 1

14:33:05  -----------------------------------
14:33:05  serviceability_jvmti_j9_0_FAILED

50x grinder - 37/50 failed

babsingh commented 9 months ago

@fengxue-IS Can you take a look at this failure? This is targeted for the 0.43 Jan release. I am looking at another 0.43 failure: #18675. I won't be able to look at both of them. I have included my current findings below.

If you run the grinder with the changes from https://github.com/babsingh/aqa-tests/commits/debug_18712, then only VThreadEventTest will be run for serviceability_jvmti_j9_0 (TARGET). The timeout framework only outputs the Java stack trace. @llxia Is there a test option to get a system core file for the timeout? In this case, I don't see a way to use -Xdump to get the system core file.

The timeout failure is only seen on Windows. On Linux x64, the test passes all the time.

The hang happens in VThreadEventTest.java#L175-L179:

        for (int sleepNo = 1; threadEndCount() < THREAD_CNT; sleepNo++) {
            Thread.sleep(100);
            if (sleepNo % 100 == 0) { // 10 sec period of waiting
                log("main: waited seconds: " + sleepNo/10);
            }
        }

The test enables the following JVMTI events: VirtualThreadEnd, VirtualThreadMount, and VirtualThreadUnmount, which are triggered from the virtualThread* methods in javanextvmi.cpp. Then, it launches a number of virtual threads, and expects the following number of events:

VirtualThreadEnd cnt: 18 (expected: 18)
VirtualThreadMount cnt: 14 (expected: 14)
VirtualThreadUnmount cnt: 22 (expected: 22)

The hang happens in the above loop because 18 VirtualThreadEnd events don't occur.

The RI also experiences a similar issue: https://bugs.openjdk.org/browse/JDK-8322206. In their case, the count for VirtualThreadMount and VirtualThreadUnmount are off by 1.

In JVM_VirtualThreadEnd, we unconditionally trigger the VirtualThreadEnd event hook. So, this can be either related to how JVM_VirtualThreadEnd is invoked from the VirtualThread class or a test issue, in which case, we can either fix or disable the test. For the former scenario, we use RI's VirtualThread class so the problem might also exist in the RI.

fyi @tajila @pshipton

fengxue-IS commented 9 months ago

I will look into this. Is this failure machine dependent or can be reproduced on any windows machine?

babsingh commented 9 months ago

Is this failure machine dependent or can be reproduced on any windows machine?

Potentially, it can occur on any Windows machine. As per https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/37171 (50x grinder), it happened on all the five Windows machines.

fengxue-IS commented 9 months ago

After inspecting the test code, it specifically had a warning that test require machine with 8+ Processors, I suspect this have to due with machine resources issue as most of our test machine are 4core. In the grinder run, the only machine where all runs passed is win16x64rt-1-8.fyre.ibm.com which have 8CPU

The test generates 18 virtual threads and 8 of those are using a spin loop to wait for a specific test condition, this would compete for resources with the main test thread trying to load a native library which must be completed within 20000ms.

Im testing with 8CPU machine to see if same error exist.

fengxue-IS commented 9 months ago

https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/37326/ https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/37327/ Grinder runs using 8CPU windows2016 machine

babsingh commented 9 months ago

https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/37326/ https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/37327/

@fengxue-IS The above grinders are running the entire serviceability_jvmti_j9_0 test suite. If you run the grinders with the changes from https://github.com/babsingh/aqa-tests/commits/debug_18712 (ADOPTOPENJDK_REPO/ADOPTOPENJDK_BRANCH), then only VThreadEventTest will be run for serviceability_jvmti_j9_0 (TARGET).

fengxue-IS commented 9 months ago

Launched 10x on https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/37331

https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/37326/ https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/37327/

@fengxue-IS The above grinders are running the entire serviceability_jvmti_j9_0 test suite. If you run the grinders with the changes from https://github.com/babsingh/aqa-tests/commits/debug_18712 (ADOPTOPENJDK_REPO/ADOPTOPENJDK_BRANCH), then only VThreadEventTest will be run for serviceability_jvmti_j9_0 (TARGET).

fengxue-IS commented 9 months ago

10/10 passed on new grinder, I will wait for the other 2 grinder to pass before confirming this as a machine issue.

NOTE: currently our test infrastructure cannot specify machine spec when running test, so either we modify the test to not run when machine have less than 8 CPU or exclude the test by default (IMO this is a poorly written test with all the spin loop in the first place).

JasonFengJ9 commented 9 months ago

Is there a test option to get a system core file for the timeout?

https://github.com/eclipse-openj9/openj9/issues/15536#issuecomment-1183590767 stated The recent change only adds the -Xdump:system:events=user to the command line to ensure system core gets generated when STF kills a hanging process with kill -3. The functionality of STF sending a kill -3 at the event of a hang has not been touched though.

It would be nice to have such a feature in OpenJDK tests as well, created an internal issue.

pshipton commented 9 months ago

We already have this feature for OpenJDK tests, see https://github.com/adoptium/aqa-tests/blob/bde66741902409de2e9e068bd43b2195ff5644fa/openjdk/openjdk.mk#L186

On a timeout it should run jstack, and create a core and javacore.

The serviceability_jvmti_j9 playlist is missing $(TIMEOUT_HANDLER), I will create a PR shortly.

pshipton commented 9 months ago

Created https://github.com/adoptium/aqa-tests/pull/4961

JasonFengJ9 commented 9 months ago

15x grinder w/ https://github.com/adoptium/aqa-tests/pull/4961 merged.

JasonFengJ9 commented 9 months ago

The timed out failure was reproduced, its result file openjdk_test_output.tar.gz captured a core file.

15:26:14  STDERR:
15:26:14  JVMDUMP034I User requested System dump using 'C:\Users\jenkins\workspace\Grinder_iteration_2\aqa-tests\TKG\output_17050883661014\serviceability_jvmti_j9_0\work\scratch\2_44\core.20240112.122544.12704.0001.dmp' through com.ibm.jvm.Dump.systemDumpToFile
15:26:14  JVMDUMP010I System dump written to C:\Users\jenkins\workspace\Grinder_iteration_2\aqa-tests\TKG\output_17050883661014\serviceability_jvmti_j9_0\work\scratch\2_44\core.20240112.122544.12704.0001.dmp
15:26:14  JVMDUMP034I User requested Java dump using 'C:\Users\jenkins\workspace\Grinder_iteration_2\aqa-tests\TKG\output_17050883661014\serviceability_jvmti_j9_0\work\scratch\2_44\javacore.20240112.122610.12704.0002.txt' through com.ibm.jvm.Dump.javaDumpToFile
15:26:14  JVMDUMP010I Java dump written to C:\Users\jenkins\workspace\Grinder_iteration_2\aqa-tests\TKG\output_17050883661014\serviceability_jvmti_j9_0\work\scratch\2_44\javacore.20240112.122610.12704.0002.txt

FYI @babsingh

babsingh commented 9 months ago

I will wait for the other 2 grinder to pass before confirming this as a machine issue. https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/37326/ https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/37327/

The two grinders have passed. It confirms that the test needs at least 8 processors as per the warning in VThreadEventTest.java#L137-L139. Moving the issue to the 0.44 release since this is a test issue. I am in favour of updating the test. The test is valid if it is run on a machine with 8+ processors so we should continue to run the test, but return immediately after the warning message when the test is run on a machine with less than 8 processors. @fengxue-IS Let us know if this update is inaccurate.