eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.27k stars 721 forks source link

testSCCMLTests1_openj9_1 failed with openjdk11 Aarch64 #10741

Open LongyuZhang opened 3 years ago

LongyuZhang commented 3 years ago

Failure link:

https://ci.adoptopenjdk.net/job/Test_openjdk11_j9_sanity.functional_aarch64_linux/1

testSCCMLTests1_openj9_1 Test 26 failed with timeout issue for openjdk11 Aarch64.

Failure output (captured from console output)

21:24:09  Testing: Test 26: CMVC 168131 : Create a non persistent cache
21:24:09  Test start time: 2020/09/29 01:24:08 Coordinated Universal Time
21:24:09  Running command: "/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_aarch64_linux/openjdkbinary/j2sdk-image/bin/java"  -Xcompressedrefs -Xjit -Xgcpolicy:gencon  -Xshareclasses:name=ShareClassesCMLTests,nonpersistent -version
21:24:09  Time spent starting: 1 milliseconds
21:34:08  ***[TEST INFO 2020/09/29 01:34:08] ProcessKiller detected a timeout after 600000 milliseconds!***
21:34:08  INFO: getUnixPID() has failed indicating this is not a UNIX System.'Debug on timeout' is currently only supported on Linux.
21:34:08  
21:34:08  
Cancelling nested steps due to timeout
06:34:24  Sending interrupt signal to process
06:34:26  Time spent executing: 33016381 milliseconds
06:34:26  Test result: FAILED
hangshao0 commented 3 years ago

Is this failure intermittent or consistent ? Does it fail on Java 8 ?

hangshao0 commented 3 years ago

FYI @knn-k

LongyuZhang commented 3 years ago

Is this failure intermittent or consistent ? Does it fail on Java 8 ?

The jdk11 Aarch64 pipeline was just enabled and only has a build so far. I tested with personal build and it passed, so I think it is intermittent. JDK 8 nightly has not been enabled due to machine resources, it also passed personal build.

llxia commented 3 years ago

@LongyuZhang Could you try using the same SDK as the test build (OpenJDK Runtime Environment Openj9 (build 11.0.9+8-202009282343))? The Ginder runs that you have is an older version from Adopt API (OpenJDK Runtime Environment Openj9 (build 11.0.9+8-202009252344))

LongyuZhang commented 3 years ago

@LongyuZhang Could you try using the same SDK as the test build (OpenJDK Runtime Environment Openj9 (build 11.0.9+8-202009282343))? The Ginder runs that you have is an older version from Adopt API (OpenJDK Runtime Environment Openj9 (build 11.0.9+8-202009252344))

@llxia Thanks for the reminder, I have updated the SDK to the same version as the nightly build and tested on multiple machines. Only the same machine as nightly build (test-aws-ubuntu1804-armv8-1) also failed at Test 26 (I manually cancelled after hanging for 1 hour) , other machines (test-packet-ubuntu1604-armv8-2, test-aws-rhel76-armv8-2, test-aws-rhel76-armv8-4, test-packet-ubuntu1604-armv8-1) all passed the test, with links https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/4002 - 4005 . So it should be a machine issue.

pshipton commented 3 years ago

FYI https://openj9.slack.com/archives/C8312LCV9/p1602551615048900

there’s a pthread_cond_signal bug affecting glibc 2.27. Worth being aware of this if unexplained deadlocks are occurring: https://sourceware.org/bugzilla/show_bug.cgi?id=25847

andrew-m-leonard commented 3 years ago

Re-occurred on GA jdk-11.0.9+11_openj9-0.23.0 : https://ci.adoptopenjdk.net/job/Test_openjdk11_j9_sanity.functional_aarch64_linux/19/consoleFull

LongyuZhang commented 3 years ago

Hi @andrew-m-leonard, I discussed with @llxia about the testSCCMLTests1_openj9_1 failure you mentioned above in https://ci.adoptopenjdk.net/job/Test_openjdk11_j9_sanity.functional_aarch64_linux/19/consoleFull, it fails with test 57-63, not the same test 26 failure in this issue. Test 57-63 was newly enabled by Hang Shao’s PR, and has already passed all night build #17 and #18.

Build #19, #20 and #22 you mentioned failed because they are not triggered by nightly, which causes the known issue of Functional testing uses wrong test material in release testing, for which Lan has a WIP PR, that has not been merged yet.

andrew-m-leonard commented 3 years ago

This happened twice last night, aarch64 and Windows: https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1579#issuecomment-726595987 https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1579#issuecomment-726630670

Can we get more debug added to getUnixPid() please? as how can that fail?

andrew-m-leonard commented 3 years ago

See: https://github.com/eclipse/openj9/issues/11177

pshipton commented 3 years ago

Can we get more debug added to getUnixPid() please? as how can that fail?

FYI As I recall getUnixPid() is a hack that reaches into the implementation using reflect to find the pid, as there was no API to get it in Java 8. I believe Java 11 does provide an API.

knn-k commented 3 years ago

Has the test server (test-aws-ubuntu1804-armv8-1) been rebooted recently? If not, I want it to be rebooted to see whether the failure disappears or not.

andrew-m-leonard commented 3 years ago

@pshipton The logic here https://github.com/eclipse/openj9/blob/efdb86514d722cf83747d9d8badc449fe6121658/test/functional/cmdline_options_tester/src/Test.java#L415 will only work on jdk8 as UNIXProcess.java does not exist in jdk11+. For jdk11+ it should use the jdk11 API to get the pid.

The failure on jdk8 Windows is likely to be because the ProcessKiller logic failed to kill the process, Windows processes can be stubborn at being killed, we are seeing many orphaned testcase Processes in Windows. We are looking at adding some post-testcase Process cleanup to avoid this. It would be beneificial though in this situation if the proc.waitFor() did not wait forever for it to finish, as it won't! So maybe some arbitrary 30 minutes timeout or something? https://github.com/eclipse/openj9/blob/efdb86514d722cf83747d9d8badc449fe6121658/test/functional/cmdline_options_tester/src/Test.java#L227 As it stands it is causing the whole build pipeline to hang all night.....!

pshipton commented 3 years ago

Created https://github.com/eclipse/openj9/issues/11196 for the getUnixPid() issue.

pshipton commented 3 years ago

Created https://github.com/eclipse/openj9/issues/11197 for the cmdlinetests waiting forever.