adoptium / aqa-tests

Home of test infrastructure for Adoptium builds
https://adoptium.net/aqavit
Apache License 2.0
131 stars 310 forks source link

jdk_jfr tests fail on jdk8 hotspot xLinux #3253

Open Haroon-Khel opened 2 years ago

Haroon-Khel commented 2 years ago

The following tests from jdk_jfr in the extended openjdk test suite fail on xLinux

https://ci.adoptopenjdk.net/job/Test_openjdk8_hs_extended.openjdk_x86-64_linux_testList_1/30/testReport/

jdk/jfr/event/compiler/TestCompilerCompile.java.TestCompilerCompile

[2022-01-19T14:19:55.882Z] TEST RESULT: Error. Program 
`/home/jenkins/workspace/Test_openjdk8_hs_extended.openjdk_x86-64_linux_testList_1/openjdkbinary/j2sdk-image/bin/java' 
timed out (timeout set to 960000ms, elapsed time including timeout handling was 960365ms).

jdk/jfr/event/runtime/TestShutdownEvent.java.TestShutdownEvent

[2022-01-19T14:27:29.508Z] TEST RESULT: Error. Program 
`/home/jenkins/workspace/Test_openjdk8_hs_extended.openjdk_x86-64_linux_testList_1/openjdkbinary/j2sdk-image/bin/java' 
timed out (timeout set to 960000ms, elapsed time including timeout handling was 960366ms).

jdk/jfr/event/runtime/TestShutdownEvent.java.TestShutdownEvent has been excluded on aarch64 linux but not yet for xLinux https://github.com/adoptium/aqa-tests/pull/3007

smlambert commented 2 years ago

Deep history shows this test passes occasionally. Can we verify the status of the jdk_jfr tests from the October release on this platform? If we hit the same issues with last release (and reran to see it passing), we will do the same with this release. Try grinding on the machine its shown to pass in the deep history view (test-docker-ubuntu2010-x64-3). If passes, what is special about that machine in relation to what this test is doing (look at test code). If fails, run with ITERATIONS=10 to see what is the pass rate.

We need to take extra action to try and figure out root cause (test / infra / product) and raise an issue in the appropriate location.

If test issue: We should determine if this is a 'flaky' test (JBS report for test material issue)

If infra issue: one that needs a particular machine setup and raise/fix infra requirements (adoptium/infrastructure issue).

If intermittent product issue: JBS report for JFR issue

Haroon-Khel commented 2 years ago

Rerunning on the machine it last passed on, test-docker-ubuntu2010-x64-3, https://ci.adoptopenjdk.net/job/Grinder/3263/console

jiekang commented 2 years ago

@Haroon-Khel @smlambert

The deep history log provides a dump when the timeout occurs: https://trss.adoptium.net/output/test?id=61e848cba7df1a2924028e9a

TestCompilerCompiles has:

[2022-01-19T14:46:27.450Z] "MainThread" #9 prio=5 os_prio=0 tid=0x00007f0e2c12a800 nid=0x158efa waiting on condition [0x00007f0df70ee000]
[2022-01-19T14:46:27.450Z]    java.lang.Thread.State: TIMED_WAITING (sleeping)
[2022-01-19T14:46:27.450Z]  at java.lang.Thread.sleep(Native Method)
[2022-01-19T14:46:27.450Z]  at jdk.test.lib.Utils.waitForCondition(Utils.java:530)
[2022-01-19T14:46:27.450Z]  at jdk.test.lib.Utils.waitForCondition(Utils.java:499)
[2022-01-19T14:46:27.450Z]  at jdk.jfr.event.compiler.TestCompilerCompile.doTest(TestCompilerCompile.java:85)
[...]
[2022-01-19T14:46:27.450Z] Warning: 'NoSuchMethodError' on register of sun.hotspot.WhiteBox::NMTNewArena(J)J
[2022-01-19T14:46:27.450Z] Warning: 'NoSuchMethodError' on register of sun.hotspot.WhiteBox::NMTFreeArena(J)V
[2022-01-19T14:46:27.450Z] Warning: 'NoSuchMethodError' on register of sun.hotspot.WhiteBox::NMTArenaMalloc(JJ)V

which is the test waiting for the request to compile the method via WhiteBox.

TestShutdownEvent has:

[2022-01-19T14:54:31.488Z] "MainThread" #9 prio=5 os_prio=0 tid=0x00007fd7dc12a000 nid=0x159ece in Object.wait() [0x00007fd787db3000]
[2022-01-19T14:54:31.488Z]    java.lang.Thread.State: WAITING (on object monitor)
[2022-01-19T14:54:31.488Z]  at java.lang.Object.wait(Native Method)
[2022-01-19T14:54:31.488Z]  - waiting on <0x00007fd7c309e3a0> (a java.lang.UNIXProcess)
[2022-01-19T14:54:31.488Z]  at java.lang.Object.wait(Object.java:502)
[2022-01-19T14:54:31.488Z]  at java.lang.UNIXProcess.waitFor(UNIXProcess.java:395)
[2022-01-19T14:54:31.488Z]  - locked <0x00007fd7c309e3a0> (a java.lang.UNIXProcess)
[2022-01-19T14:54:31.488Z]  at jdk.test.lib.process.ProcessTools.getOutput(ProcessTools.java:98)
[2022-01-19T14:54:31.488Z]  at jdk.test.lib.process.OutputAnalyzer.<init>(OutputAnalyzer.java:48)
[2022-01-19T14:54:31.488Z]  at jdk.test.lib.process.ProcessTools.executeProcess(ProcessTools.java:412)
[2022-01-19T14:54:31.488Z]  at jdk.jfr.event.runtime.TestShutdownEvent.runSubtest(TestShutdownEvent.java:82)

where it's waiting on the code to actually start up the subprocess. However the stdout has:

[2022-01-19T14:54:31.493Z] ...
[2022-01-19T14:54:31.493Z] Output overflow:
[2022-01-19T14:54:31.493Z] JT Harness has limited the test output to the text
[2022-01-19T14:54:31.493Z] at the beginning and the end, so that you can see how the
[2022-01-19T14:54:31.493Z] test began, and how it completed.
[2022-01-19T14:54:31.493Z] 
[2022-01-19T14:54:31.493Z] If you need to see more of the output from the test,
[2022-01-19T14:54:31.493Z] set the system property javatest.maxOutputSize to a higher
[2022-01-19T14:54:31.493Z] value. The current value is 100000
[2022-01-19T14:54:31.493Z] ...

So we can't see exactly which subprocess was stuck.

I'd consider them both test/infra issues rather than product issues.

smlambert commented 2 years ago

Thanks for this assessment @jiekang - based on it and the fact that this test passes intermittently on some machines (test-docker-ubuntu2010-x64-3 has both passed and failed before which also potentially indicates infra/test issues), we will not block the release on these failures, but indeed we should figure out which subprocess is stuck and why...

@Haroon-Khel - could it be the case that there are some processes laying around that then block some of this tests subprocesses from running? and when we do a cleanup of the machines (static docker containers) we see 1 passing run before the machine is back to a state where this test fails again.

Haroon-Khel commented 2 years ago

could it be the case that there are some processes laying around that then block some of this tests subprocesses from running? and when we do a cleanup of the machines (static docker containers) we see 1 passing run before the machine is back to a state where this test fails again.

I'll take a look

Haroon-Khel commented 2 years ago

On test-docker-ubuntu2010-x64-3 there was a rogue Test_openjdk17_hs_extended.openjdk_x86-64_linux_testList_1 process from jan 18 still running.

Also, I noticed that test-docker-ubuntu2010-x64-3 is actually test-docker-fedora35-x64-1; I replaced test-docker-ubuntu2010-x64-3 with test-docker-ubuntu2110-x64-1 a few weeks back, and I must have forgotten to remove the jenkins node for test-docker-ubuntu2010-x64-3, and then when setting up test-docker-fedora35-x64-1 I gave it the same port on the dockerhost. That may explain why the test no longer passes on test-docker-ubuntu2010-x64-3, and so does hint towards this issue being a machine specific problem

Anyway, rerunning on test-docker-fedora35-x64-1 https://ci.adoptopenjdk.net/job/Grinder/3291/console