Closed JasonFengJ9 closed 7 months ago
@fengxue-IS Can you take a look at this failure? This is targeted for the 0.43 Jan release. I am looking at another 0.43 failure: #18675. I won't be able to look at both of them. I have included my current findings below.
If you run the grinder with the changes from https://github.com/babsingh/aqa-tests/commits/debug_18712, then only VThreadEventTest
will be run for serviceability_jvmti_j9_0
(TARGET
). The timeout framework only outputs the Java stack trace. @llxia Is there a test option to get a system core file for the timeout? In this case, I don't see a way to use -Xdump
to get the system core file.
The timeout failure is only seen on Windows. On Linux x64, the test passes all the time.
The hang happens in VThreadEventTest.java#L175-L179:
for (int sleepNo = 1; threadEndCount() < THREAD_CNT; sleepNo++) {
Thread.sleep(100);
if (sleepNo % 100 == 0) { // 10 sec period of waiting
log("main: waited seconds: " + sleepNo/10);
}
}
The test enables the following JVMTI events: VirtualThreadEnd
, VirtualThreadMount
, and VirtualThreadUnmount
, which are triggered from the virtualThread*
methods in javanextvmi.cpp. Then, it launches a number of virtual threads, and expects the following number of events:
VirtualThreadEnd cnt: 18 (expected: 18)
VirtualThreadMount cnt: 14 (expected: 14)
VirtualThreadUnmount cnt: 22 (expected: 22)
The hang happens in the above loop because 18 VirtualThreadEnd
events don't occur.
The RI also experiences a similar issue: https://bugs.openjdk.org/browse/JDK-8322206. In their case, the count for VirtualThreadMount
and VirtualThreadUnmount
are off by 1.
In JVM_VirtualThreadEnd
, we unconditionally trigger the VirtualThreadEnd
event hook. So, this can be either related to how JVM_VirtualThreadEnd
is invoked from the VirtualThread
class or a test issue, in which case, we can either fix or disable the test. For the former scenario, we use RI's VirtualThread
class so the problem might also exist in the RI.
fyi @tajila @pshipton
I will look into this. Is this failure machine dependent or can be reproduced on any windows machine?
Is this failure machine dependent or can be reproduced on any windows machine?
Potentially, it can occur on any Windows machine. As per https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/37171 (50x grinder), it happened on all the five Windows machines.
After inspecting the test code, it specifically had a warning that test require machine with 8+ Processors, I suspect this have to due with machine resources issue as most of our test machine are 4core. In the grinder run, the only machine where all runs passed is win16x64rt-1-8.fyre.ibm.com which have 8CPU
The test generates 18 virtual threads and 8 of those are using a spin loop to wait for a specific test condition, this would compete for resources with the main test thread trying to load a native library which must be completed within 20000ms.
Im testing with 8CPU machine to see if same error exist.
https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/37326/ https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/37327/ Grinder runs using 8CPU windows2016 machine
https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/37326/ https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/37327/
@fengxue-IS The above grinders are running the entire serviceability_jvmti_j9_0
test suite. If you run the grinders with the changes from https://github.com/babsingh/aqa-tests/commits/debug_18712 (ADOPTOPENJDK_REPO/ADOPTOPENJDK_BRANCH
), then only VThreadEventTest
will be run for serviceability_jvmti_j9_0
(TARGET
).
Launched 10x on https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/37331
https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/37326/ https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/37327/
@fengxue-IS The above grinders are running the entire
serviceability_jvmti_j9_0
test suite. If you run the grinders with the changes from https://github.com/babsingh/aqa-tests/commits/debug_18712 (ADOPTOPENJDK_REPO/ADOPTOPENJDK_BRANCH
), then onlyVThreadEventTest
will be run forserviceability_jvmti_j9_0
(TARGET
).
10/10 passed on new grinder, I will wait for the other 2 grinder to pass before confirming this as a machine issue.
NOTE: currently our test infrastructure cannot specify machine spec when running test, so either we modify the test to not run when machine have less than 8 CPU or exclude the test by default (IMO this is a poorly written test with all the spin loop in the first place).
Is there a test option to get a system core file for the timeout?
https://github.com/eclipse-openj9/openj9/issues/15536#issuecomment-1183590767 stated The recent change only adds the -Xdump:system:events=user to the command line to ensure system core gets generated when STF kills a hanging process with kill -3. The functionality of STF sending a kill -3 at the event of a hang has not been touched though.
It would be nice to have such a feature in OpenJDK tests as well, created an internal issue.
We already have this feature for OpenJDK tests, see https://github.com/adoptium/aqa-tests/blob/bde66741902409de2e9e068bd43b2195ff5644fa/openjdk/openjdk.mk#L186
On a timeout it should run jstack, and create a core and javacore.
The serviceability_jvmti_j9 playlist is missing $(TIMEOUT_HANDLER)
, I will create a PR shortly.
The timed out failure was reproduced, its result file openjdk_test_output.tar.gz captured a core file.
15:26:14 STDERR:
15:26:14 JVMDUMP034I User requested System dump using 'C:\Users\jenkins\workspace\Grinder_iteration_2\aqa-tests\TKG\output_17050883661014\serviceability_jvmti_j9_0\work\scratch\2_44\core.20240112.122544.12704.0001.dmp' through com.ibm.jvm.Dump.systemDumpToFile
15:26:14 JVMDUMP010I System dump written to C:\Users\jenkins\workspace\Grinder_iteration_2\aqa-tests\TKG\output_17050883661014\serviceability_jvmti_j9_0\work\scratch\2_44\core.20240112.122544.12704.0001.dmp
15:26:14 JVMDUMP034I User requested Java dump using 'C:\Users\jenkins\workspace\Grinder_iteration_2\aqa-tests\TKG\output_17050883661014\serviceability_jvmti_j9_0\work\scratch\2_44\javacore.20240112.122610.12704.0002.txt' through com.ibm.jvm.Dump.javaDumpToFile
15:26:14 JVMDUMP010I Java dump written to C:\Users\jenkins\workspace\Grinder_iteration_2\aqa-tests\TKG\output_17050883661014\serviceability_jvmti_j9_0\work\scratch\2_44\javacore.20240112.122610.12704.0002.txt
FYI @babsingh
I will wait for the other 2 grinder to pass before confirming this as a machine issue. https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/37326/ https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/37327/
The two grinders have passed. It confirms that the test needs at least 8 processors as per the warning in VThreadEventTest.java#L137-L139. Moving the issue to the 0.44 release since this is a test issue. I am in favour of updating the test. The test is valid if it is run on a machine with 8+ processors so we should continue to run the test, but return immediately after the warning message when the test is run on a machine with less than 8 processors. @fengxue-IS Let us know if this update is inaccurate.
Failure link
Created from https://github.com/eclipse-openj9/openj9/issues/18675#issuecomment-1880983707 as per https://github.com/eclipse-openj9/openj9/issues/18675#issuecomment-1881681970
From an internal build(
win19x86-svl-rt4-1
):Rerun in Grinder - Change TARGET to run only the failed test targets.
Optional info
Failure output (captured from console output)
50x grinder - 37/50 failed