eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.27k stars 720 forks source link

TestHashTablePerformance_2 failure with JITServer: collisions have been handled too slow expected [true] but found [false] #19610

Open cjjdespres opened 4 months ago

cjjdespres commented 4 months ago

Failure link

https://hyc-runtimes-jenkins.swg-devops.com/job/Test_openjdk11_j9_sanity.functional_x86-64_linux_jit_Personal/1678/

Optional info

Failure output (captured from console output)

# OpenJDK Runtime Environment (build 11.0.24-internal+0-adhoc..BuildJDK11x86-64linuxjitPersonal)
# Eclipse OpenJ9 VM (build master-099aaee, JRE 11 Linux amd64-64-Bit Compressed References 20240602_1863 (JIT enabled, AOT enabled)
# OpenJ9   - 099aaee
# OMR      - 0f74862
# JCL      - 76593ed based on jdk-11.0.24+5)
# DETECTED_RELEASE_INFO=IMPLEMENTOR="Eclipse OpenJ9"
# JAVA_RUNTIME_VERSION="11.0.24-internal+0-adhoc..BuildJDK11x86-64linuxjitPersonal"
# JAVA_VERSION="11.0.24"
# JAVA_VERSION_DATE="2024-07-16"

        ===============================================
        ===============================================
        TestHashTablePerformance_2 Start Time: Sun Jun  2 03:13:48 2024 Epoch Time (ms): 1717323228467
        variation: Mode351
        JVM_OPTIONS: -XX:+UseJITServer -Xgcpolicy:metronome -Xcompressedrefs 

        TEST SETUP:
        JVMSHRC005I No shared class caches available
        JVMSHRC005I No shared class caches available
        cache cleanup done

        TESTING:
        [IncludeExcludeTestAnnotationTransformer] [INFO] EXCLUDE_FILE environment variable: /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_x86-64_linux_jit_Personal_testList_0/aqa-tests/TKG/../TestConfig/resources/excludes/latest_exclude_11.txt
        [IncludeExcludeTestAnnotationTransformer] [INFO] Processing exclude file: /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_x86-64_linux_jit_Personal_testList_0/aqa-tests/TKG/../TestConfig/resources/excludes/latest_exclude_11.txt
        ...
        ... TestNG 6.14.2 by Cédric Beust (cedric@beust.com)
        ...

        FAILED: testHashTablePerformance
        java.lang.AssertionError: !!! Test failed: collisions have been handled too slow expected [true] but found [false]
            at org.testng.Assert.fail(Assert.java:96)
            at org.testng.Assert.failNotEquals(Assert.java:776)
            at org.testng.Assert.assertTrue(Assert.java:44)
            at j9vm.test.hashTable.HashTablePerformanceTest.testHashTablePerformance(HashTablePerformanceTest.java:47)
            at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
            at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            at java.base/java.lang.reflect.Method.invoke(Method.java:572)
            at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:124)
            at org.testng.internal.Invoker.invokeMethod(Invoker.java:580)
            at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:716)
            at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:988)
            at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:125)
            at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:109)
            at org.testng.TestRunner.privateRun(TestRunner.java:648)
            at org.testng.TestRunner.run(TestRunner.java:505)
            at org.testng.SuiteRunner.runTest(SuiteRunner.java:455)
            at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:450)
            at org.testng.SuiteRunner.privateRun(SuiteRunner.java:415)
            at org.testng.SuiteRunner.run(SuiteRunner.java:364)
            at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
            at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:84)
            at org.testng.TestNG.runSuitesSequentially(TestNG.java:1208)
            at org.testng.TestNG.runSuitesLocally(TestNG.java:1137)
            at org.testng.TestNG.runSuites(TestNG.java:1049)
            at org.testng.TestNG.run(TestNG.java:1017)
            at org.testng.TestNG.privateMain(TestNG.java:1354)
            at org.testng.TestNG.main(TestNG.java:1323)
cjjdespres commented 4 months ago

Attn @mpirvu.

mpirvu commented 4 months ago

Could you please add some context. Is this a new failure that started recently? is this a new test? Does it happen only Java11, or is it seen with other versions of Java as well?

cjjdespres commented 4 months ago

It started failing recently - I looked at that particular configuration (JDK 11, x86_64) and saw one failure here that I had missed, but nothing before that. I haven't seen this on other platforms or other Java versions yet.

The test does not appear to be recent - I'm not sure exactly where it's defined, but I saw it mentioned in an issue from 2018.

cjjdespres commented 4 months ago

A 100x grinder using the nightly build from May 15 passed. Version information:

10:52:18  openjdk version "11.0.24-internal" 2024-07-16
10:52:18  OpenJDK Runtime Environment (build 11.0.24-internal+0-adhoc..BuildJDK11x86-64linuxjitPersonal)
10:52:18  Eclipse OpenJ9 VM (build master-40817e8, JRE 11 Linux amd64-64-Bit Compressed References 20240515_1845 (JIT enabled, AOT enabled)
10:52:18  OpenJ9   - 40817e8
10:52:18  OMR      - b9a6ccc
10:52:18  JCL      - 795d069 based on jdk-11.0.24+2)

The earliest failure I saw was the May 31st failure that I noted in https://github.com/eclipse-openj9/openj9/issues/19610#issuecomment-2147606051, and that had version information:

# OpenJDK Runtime Environment (build 11.0.24-internal+0-adhoc..BuildJDK11x86-64linuxjitPersonal)
# Eclipse OpenJ9 VM (build master-ff7726e, JRE 11 Linux amd64-64-Bit Compressed References 20240531_1861 (JIT enabled, AOT enabled)
# OpenJ9   - ff7726e
# OMR      - 6463b76
# JCL      - bea3662 based on jdk-11.0.24+5)

This test completes quite quickly, so I can probably bisect the failure and narrow it down to a single day.

cjjdespres commented 4 months ago

Some mixed results from grinders:

There might have been a change between May 31 and June 3 that made this more likely to occur. I'll run more tests, but at least for the May 15-31 test period (so changes from May 14-30) the failure is more intermittent, if present at all.

cjjdespres commented 4 months ago

I repeated the bisection with 500x grinders.

The latest successful version (0/500 failures, May 27 nightly run)

OpenJDK Runtime Environment (build 11.0.24-internal+0-adhoc..BuildJDK11x86-64linuxjitPersonal)
Eclipse OpenJ9 VM (build master-ce04759, JRE 11 Linux amd64-64-Bit Compressed References 20240527_1857 (JIT enabled, AOT enabled)
OpenJ9   - ce04759
OMR      - e32d21b
JCL      - b3e5f08 based on jdk-11.0.24+4)

and the earliest failing version (2/500 failures, May 28 nightly run)

openjdk version "11.0.24-internal" 2024-07-16
OpenJDK Runtime Environment (build 11.0.24-internal+0-adhoc..BuildJDK11x86-64linuxjitPersonal)
Eclipse OpenJ9 VM (build master-4ff10c9, JRE 11 Linux amd64-64-Bit Compressed References 20240528_1858 (JIT enabled, AOT enabled)
OpenJ9   - 4ff10c9
OMR      - 01374a4
JCL      - b3e5f08 based on jdk-11.0.24+4)

The ranges are ce04759...4ff10c9 and https://github.com/eclipse/omr/compare/e32d21b...01374a4

cjjdespres commented 3 months ago

In the range given in https://github.com/eclipse-openj9/openj9/issues/19610#issuecomment-2152982016, there is the PR https://github.com/eclipse-openj9/openj9/pull/19545 that re-enables the IProfiler for JITServer clients. If that is the cause, then IProfiler being disabled may have masked the appearance of the failure. The PR https://github.com/eclipse-openj9/openj9/pull/19355 (merged in bfaeffa721246eeed84345ce1dea45ad264385d7 on May 7) was what disabled the IProfiler. The commit e92c6946102af608ff0711987f538331e4a9a290 re-enabled it, so that commit can be cherry-picked to re-enable IProfiler at other times in the repo history.

Some preliminary bisection results, with e92c6946102af608ff0711987f538331e4a9a290 cherry-picked onto the indicated openj9 commit:

That possibly narrows it down to 2da7e0a283c65b1ce458621737c2f0f61bfcc3ca...ec6dc8e0b598e870d1c70388e0a1d4df45ea118e and https://github.com/eclipse/omr/compare/dd4c5d5459a1e2a890f18b68d5c6b7b105b658d0...e32d21b256728bc90598eb68fddfebbf52f8b740.

The failure rate in grinder seems to be pretty low, which makes me less confident in the lower bound of these ranges. It's possible that the initial 11/100 failure grinder I linked to in https://github.com/eclipse-openj9/openj9/issues/19610#issue-2331885111 was simply very unlucky in terms of number of failures, as the rate has usually been 2/500 in subsequent testing.


This did fail again in https://hyc-runtimes-jenkins.swg-devops.com/job/Test_openjdk17_j9_sanity.functional_x86-64_linux_jit_Personal/1019/.

cjjdespres commented 3 months ago

I'm also not sure where this test is defined. Edit - this is an internal test.

cjjdespres commented 3 months ago

I got a failure (1/475, though one iteration had an infra failure halfway through) with 0e57b7a83a824e024f7c5f4e8f6eba69a85401d8 and https://github.com/eclipse/omr/commit/5faa3ec5fa4166271c7c9ac24acdb3eb508fee3b, and got 0/500 failures with 0fcba08e6f9e21add71098f96c65abd61147473d and https://github.com/eclipse/omr/commit/5faa3ec5fa4166271c7c9ac24acdb3eb508fee3b. (EDIT - those builds were based on the indicated commit, with https://github.com/eclipse-openj9/openj9/commit/e92c6946102af608ff0711987f538331e4a9a290 cherry-picked on top).

That's the range 0fcba08e6f9e21add71098f96c65abd61147473d...0e57b7a83a824e024f7c5f4e8f6eba69a85401d8, so it's possible it's https://github.com/eclipse-openj9/openj9/pull/19523, but I'm still not very confident about the lower bound on the range. I'll grind a recent build to see the current failure rate, and also a build with those changes reverted.

mpirvu commented 3 months ago

The test seems to measure speed, though I might be wrong. If that's the case, is this a matter of JITServer performance rather than a functional bug?

cjjdespres commented 3 months ago

Yes, the comments in the test code seem to indicate that it's a performance test. There's a TestHashTableCorrectness test defined in the same directory, which seems like the correctness-testing counterpart to TestHashTablePerformance.

cjjdespres commented 3 months ago

I used a build starting with https://github.com/eclipse/omr/commit/befee22 and 293f2b7812795bdf5ca611bc832bf5a1dc3d052a (two recent-ish commits), but with 0cb4aa5951855c28b6e6fc61e3d8570b6ff220a7 reverted, and still got failures in this test, so I have no idea what the problem is.