eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.27k stars 720 forks source link

JTReg jdk11-m2 Failure: TestJlmRemoteMemoryAuth_0 #9046

Open M-Davies opened 4 years ago

M-Davies commented 4 years ago

Failure link

Optional info

Failure output (captured from console output)

STF 16:44:08.898 - +------ Step 5 - Wait for the processes to complete
STF 16:44:08.899 - | Wait for processes to meet expectations
STF 16:44:08.899 - |   Processes: [LT1, CL1]
STF 16:44:08.899 - |
STF 16:44:08.899 - Monitoring processes: CL1 LT1
CL1 j> 2020/03/30 16:44:09.614 ServerURL=service:jmx:rmi:///jndi/rmi://localhost:1234/jmxrmi
CL1 j> 2020/03/30 16:44:15.507 Monitored VM not ready at Mar 30, 2020, 4:44:15 PM (attempt 0).
CL1 j> 2020/03/30 16:44:15.509 Wait 5 secs and trying again...
CL1 j> 2020/03/30 16:44:18.157 Connection established!
CL1 j> 2020/03/30 16:44:18.811 Starting to write data
CL1 ...j> 2020/03/30 16:44:51.548 Exiting as JVM we are connected to has finished
CL1 stderr Exception in thread "main" java.lang.AssertionError
CL1 stderr  at org.junit.Assert.fail(Assert.java:86)
CL1 stderr  at org.junit.Assert.fail(Assert.java:95)
CL1 stderr  at net.adoptopenjdk.test.jlm.remote.MemoryProfiler.getStatsViaProxy(MemoryProfiler.java:208)
CL1 stderr  at net.adoptopenjdk.test.jlm.remote.MemoryProfiler.main(MemoryProfiler.java:110)
STF 16:44:51.784 - **FAILED** Process CL1 ended with exit code (1) and not the expected exit code/s (0)
STF 16:44:51.784 - Monitoring Report Summary:
STF 16:44:51.785 -   o Process CL1 ended with exit code (1) and not the expected exit code/s (0)
STF 16:44:51.785 -   o Process LT1 is still running as expected
STF 16:44:51.792 - Killing processes: CL1 LT1
STF 16:45:02.796 -   o Process LT1 killed
STF 16:45:02.796 -   o Process CL1 is not running
**FAILED** at step 5 (Wait for the processes to complete). Expected return value=0 Actual=1 at /home/jenkins/workspace/Test_openjdk11_j9_sanity.system_aarch64_linux_xl/openjdk-tests/TKG/../TKG/test_output_15855859707042/TestJlmRemoteMemoryAuth_0/20200330-164402-TestJlmRemoteMemoryAuth/execute.pl line 175.
STF 16:45:02.894 - **FAILED** execute script failed. Expected return value=0 Actual=1
pshipton commented 4 years ago

@knn-k

knn-k commented 4 years ago

https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_sanity.system_aarch64_linux_Personal/2/ was the last time I saw this failure. Possibly an intermittent problem.

M-Davies commented 4 years ago

yeah it seems intermittent. Couldn't reproduce in a singular grinder https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/2721/

knn-k commented 4 years ago

I ran this test 10 times manually on cent7-aarch64-1, and saw no failures. It takes 11 minutes to run a single round of this test on the test server.

M-Davies commented 4 years ago

I managed to recreate this.

Again, its on test-packet-armv8-ubuntu-16-04 (See https://github.com/eclipse/openj9/issues/9047#issuecomment-608324711 and https://github.com/eclipse/openj9/issues/9048#issuecomment-607897163)

knn-k commented 4 years ago

I ran this test 20 times on test-packet-armv8-ubuntu-16-04, but I couldn't reproduce the failure. I used the following version:

OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.7+9-202004062324)
Eclipse OpenJ9 VM AdoptOpenJDK (build master-956678929, JRE 11 Linux aarch64-64-Bit Compressed References 20200406_263 (JIT enabled, AOT enabled)
OpenJ9   - 956678929
OMR      - 9d422b0b0
JCL      - 634f577760 based on jdk-11.0.7+9)
knn-k commented 4 years ago

I managed to reproduce this on cent7-aarch64-1.

knn-k commented 4 years ago

The failure was recreated on cent7-aarch64-1 and on test-packet-armv8-ubuntu-16-04, but it takes long time and the rate is around 1/10 - 1/20.

I found this test failed with UnmarshalException caused by SSLException in the CL1 process shown below:

Exception in thread "main" java.lang.AssertionError: UnmarshalException:
Error unmarshaling return header; nested exception is:
        javax.net.ssl.SSLException: Tag mismatch!
        at org.junit.Assert.fail(Assert.java:88)
        at net.adoptopenjdk.test.jlm.resources.MemoryData.writeData(MemoryData.java:480)
Exception in thread "main" java.lang.AssertionError: UnmarshalException:
Error unmarshaling return header; nested exception is:
        javax.net.ssl.SSLException: Received fatal alert: bad_record_mac
        at org.junit.Assert.fail(Assert.java:88)
        at net.adoptopenjdk.test.jlm.resources.MemoryData.writeData(MemoryData.java:480)

My colleague and I ran the test repeatedly on those test servers with the following options:

  1. -Xint on cent7-aarch64-1 (16 runs)
  2. -Djdk.nativeCrypto=false on test-packet-armv8-ubuntu-16-04 (50+ runs)

We couldn't recreate failures with either of these options. JIT or native OpenSSL support may be the root cause, but it is difficult to conclude from these results.

knn-k commented 4 years ago

Both servers above have 96 CPU cores and OpenSSL 1.0.2. We used the following OpenJ9 build, which contains the fix for multi-threading with OpenSSL 1.0.2 https://github.com/ibmruntimes/openj9-openjdk-jdk11/pull/283 :

openjdk version "11.0.7" 2020-04-14
OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.7+9-202004091659)
Eclipse OpenJ9 VM AdoptOpenJDK (build master-c6b594b6c, JRE 11 Linux aarch64-64-Bit Compressed References 20200409_277 (JIT enabled, AOT enabled)
OpenJ9   - c6b594b6c
OMR      - dba60e491
JCL      - 58d6fa1862 based on jdk-11.0.7+9)

I haven't been able to recreate this failure on a quad-core machine with OpenSSL 1.1.1.

pshipton commented 4 years ago

https://ci.eclipse.org/openj9/view/Test/job/Grinder/796 0.20.0 XL cent7-aarch64-1 TestJlmRemoteMemoryAuth_0

0xdaryl commented 4 years ago

This issue is difficult to reproduce (only been seen on a large core machine) and may be related to the version of OpenSSL (for example, we have not seen this problem on the machine on which it failed running with OpenSSL 1.1.1). We'll leave it open for a few more days to try and collect more data, but given that AArch64 is still early access there is a good chance this will be deferred.

knn-k commented 4 years ago

9769 reports a failure with SSLException: Tag mismatch! with JDK8 OpenJ9 v0.20.0 on amd64.

pshipton commented 4 years ago

0.21.0 m2 build https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_sanity.system_aarch64_linux_Personal/5/

Error unmarshaling return header; nested exception is: 
    javax.net.ssl.SSLException: Tag mismatch!
    at org.junit.Assert.fail(Assert.java:88)
    at net.adoptopenjdk.test.jlm.resources.MemoryData.writeData(MemoryData.java:480)
    at net.adoptopenjdk.test.jlm.remote.MemoryProfiler.getStatsViaServer(MemoryProfiler.java:251)
    at net.adoptopenjdk.test.jlm.remote.MemoryProfiler.main(MemoryProfiler.java:112)
pshipton commented 4 years ago

0.21.0 m2 build https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_sanity.system_aarch64_linux_xl_Personal/2

CL2 stderr Exception in thread "main" java.lang.AssertionError: UnmarshalException: 
CL2 stderr Error unmarshaling return header; nested exception is: 
CL2 stderr  javax.net.ssl.SSLException: Received fatal alert: bad_record_mac
CL2 stderr  at org.junit.Assert.fail(Assert.java:88)
CL2 stderr  at net.adoptopenjdk.test.jlm.remote.MemoryProfiler.recordMemoryStats(MemoryProfiler.java:307)
CL2 stderr  at net.adoptopenjdk.test.jlm.remote.MemoryProfiler.getStatsViaServer(MemoryProfiler.java:255)
CL2 stderr  at net.adoptopenjdk.test.jlm.remote.MemoryProfiler.main(MemoryProfiler.java:112)
knn-k commented 4 years ago

I ran the test on cent7-aarch64-1 50 times with -Djdk.nativeDigest=false, and the exception was reproduced as follows. OpenSSL version on the machine is 1.0.2k:

CL1 stderr MessageDigest load - Native crypto library disabled.
CL1 stderr Warning: Native ChaCha20 load failed. Need OpenSSL 1.1.0 or above for ChaCha20 support. Using Java crypto implementation
CL1 stderr CipherCore Load - using native crypto library.
CL1 stderr RSACore load - using Native crypto library.

CL2 stderr Error unmarshaling return header; nested exception is:
CL2 stderr      javax.net.ssl.SSLException: Tag mismatch!
CL2 stderr      at org.junit.Assert.fail(Assert.java:88)
CL2 stderr      at net.adoptopenjdk.test.jlm.resources.MemoryData.writeData(MemoryData.java:480)
CL2 stderr      at net.adoptopenjdk.test.jlm.remote.MemoryProfiler.getStatsViaServer(MemoryProfiler.java:251)
CL2 stderr      at net.adoptopenjdk.test.jlm.remote.MemoryProfiler.main(MemoryProfiler.java:112)
DanHeidinga commented 4 years ago

Moving this forward as we've completed the milestone 2 builds for 0.21.0 and it's too late to put this in.

lumpfish commented 3 years ago

This is occurring intermittently on AdoptOpenJDK nightly builds - e.g. https://ci.adoptopenjdk.net/job/Test_openjdk11_j9_sanity.system_aarch64_linux_xl/207/consoleFull

I have gone through the most recent runs and only found it occurring on machines test-packet-ubuntu1604-armv8-1 and test-packet-ubuntu1604-armv8-2 - i.e. Ubuntu 16.04 machines. I did not see any failures on the other AdoptOpenJDK test machines running rhel7.6

pshipton commented 3 years ago

Do we need this in the milestone plan? I'll remove it.

0xdaryl commented 3 years ago

Based on the investigations so far, the strong suspicion on this one is that it is an OpenSSL issue. We haven't found any evidence to suggest it is an OpenJ9 problem, hence deferring is the recommended approach.

One useful datapoint would be the version of OpenSSL found on the Ubuntu 16.04 boxes at AdoptOpenJDK and the RHL 7.6 boxes.

0xdaryl commented 3 years ago

The AArch64 builds use --with-openssl=system despite fetching the latest OpenSSL version 1.1.1i. I'm wondering if this failure can be reproduced if we change the build to use --with-openssl=fetched to ensure we're using the latest regardless of the machine we're running on and what version of OpenSSL happens to be installed there.

pshipton commented 3 years ago

Perhaps so. ~The reason this option isn't used now is so the OS or user is responsible for providing openssl updates. Otherwise Adopt is on the hook in case some out of band update is required. Also, the only way to get an update for a bundled library is by getting the latest version from Adopt, or (unsupported) manually copying the latest library to an older build.~

sxa commented 3 years ago

Just to add something extra to this - the problem appears to only be shown on the ThunderX aarch64 systems (Not sure which OpenJ9 have but if it's the 96-core ones then it's likely to be ThunderX) so it's possible that a fix will be out of our control if it's hardware related. I may try upgrading one of ours to a later level (I'll do it under but I'm not overly confident that it will really change anything.

Current infrastructure issue at adoptopenjdk is https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1897 - it may ultimately by a wontfix but this infromation at least defines the scope of the issue.

pshipton commented 3 years ago

Re-reading Daryl's comment https://github.com/eclipse/openj9/issues/9046#issuecomment-744566908 I realized I misunderstood the options and my earlier comment was bogus. What is the openssl version available on the build system? Considering it only fails on one target, it's likely the fault of the openssl installed on the target, but I'm not sure of the side affects of compiling against an out-dated openssl version.

sxa commented 3 years ago

As per recent updates on the infrastructure issue I would not assume it's related to the version that OpenJ9 is being linked with as I'm able to reproduce the same failures using the latest openssl111i codebase outside OpenJ9 on multiple ThunderX systems.