eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.23k stars 712 forks source link

OpenJDK java/util/concurrent/SynchronousQueue/Fairness - Error: fair=true i=6 j=7 #18771

Open pshipton opened 5 months ago

pshipton commented 5 months ago

https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_s390x_linux_Release_testList_0/10/ jdk_util_1 java/util/concurrent/SynchronousQueue/Fairness.java

19:29:58  java.lang.Error: fair=true i=6 j=7
19:29:58  
19:29:58    at Fairness.testFairness(Fairness.java:64)
19:29:58    at Fairness.main(Fairness.java:73)
19:29:58    at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
19:29:58    at java.base/java.lang.reflect.Method.invoke(Method.java:586)
19:29:58    at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:333)
19:29:58    at java.base/java.lang.Thread.run(Thread.java:1595)
pshipton commented 5 months ago

https://openj9-jenkins.osuosl.org/job/Grinder/3181/ - jdk_util testing https://openj9-jenkins.osuosl.org/job/Grinder/3182/ - specific tests

pshipton commented 5 months ago

The grinder isn't finished, but this failure has already been repeated a number of times. https://openj9-jenkins.osuosl.org/job/Grinder_iteration_0/413 https://openj9-jenkins.osuosl.org/job/Grinder_iteration_2/371 https://openj9-jenkins.osuosl.org/job/Grinder_iteration_0/414/

The failure is repeatable when running just the single test.

@tobi someone needs to look at this.

keithc-ca commented 5 months ago

I think the test is broken: There's nothing in the code that prioritizes threads started earlier over those started later.

It is possible that the test is in a state where all the started threads are immediately after the call to lock.unlock() (but before the call to q.put(I)) before CPU cores are available to advance any of them. When processors are available, they can be scheduled in any order, providing no guarantee that the values are queued in order of ascending value.

By artificially inserting delays between those two calls (longer for smaller-numbered threads) the test can be made to fail at will.

keithc-ca commented 5 months ago

In fact, even if all threads call Thread.sleep(0) the test fails regularly.

pshipton commented 5 months ago

It's not a new test, and the test hasn't changed in years. It's not failing on other platforms, and we haven't seen it fail previously. I'm wondering what changed, and why it only fails on zlinux now.

keithc-ca commented 5 months ago

There are changes to SynchronousQueue between the openj9 and openj9-staging branches. In particular, this commit makes reference to that test failing

8300663: java/util/concurrent/SynchronousQueue/Fairness.java failed with "Error: fair=true i=0 j=1"

I'll dig deeper to see if I can find an explanation there.

keithc-ca commented 5 months ago

In https://bugs.java.com/bugdatabase/view_bug?bug_id=8300663 they seem to agree that testing fairness is difficult (emphasis mine):

I strongly suspect that any test trying to assert for fairness in this case will have to establish a total order, and as such total order is not likely to exist in any real-world use-case my bet is that fairness in this case (synchronous queue) is not observable.

pshipton commented 5 months ago

If we're going to ignore the test failure, the test needs to be excluded. I'd start with just on zlinux unless we start to see it fail elsewhere.

keithc-ca commented 5 months ago

The code for SynchronousQueue is the same in jdk21, jdk22 and jdknext. I just reproduced the problem with jdknext on xLinux. I suggest it should be excluded on all platforms.

pshipton commented 5 months ago

ok

keithc-ca commented 5 months ago

See https://github.com/adoptium/aqa-tests/pull/4987.

pshipton commented 5 months ago

For the record, it's not failing on the zlinux 21.0.2 Temurin build. https://openj9-jenkins.osuosl.org/job/Grinder/3184/ (10x on 5 machines)

14:47:33  openjdk version "21.0.2-beta" 2024-01-16
14:47:33  OpenJDK Runtime Environment Temurin-21.0.2+13-202401172331 (build 21.0.2-beta+13-ea)
14:47:33  OpenJDK 64-Bit Server VM Temurin-21.0.2+13-202401172331 (build 21.0.2-beta+13-ea, mixed mode, sharing)
pshipton commented 5 months ago

Pls create a PR for https://github.com/adoptium/aqa-tests/tree/v1.0.0-release as well.

keithc-ca commented 5 months ago

Pls create a PR for https://github.com/adoptium/aqa-tests/tree/v1.0.0-release as well.

Done: https://github.com/adoptium/aqa-tests/pull/4988

keithc-ca commented 5 months ago

The test waits for the newly launched thread to leave the RUNNING state; normally that happens by parking the thread:

3XMTHREADINFO      "supplier#6" J9VMThread:0x0000000000228400, omrthread_t:0x00007F5D1C7AF070, java/lang/Thread:0x00000000FFEB90A0, state:P, prio=5
3XMJAVALTHREAD            (java/lang/Thread getId:0x1C, isDaemon:false)
3XMJAVALTHRCCL            jdk/internal/loader/ClassLoaders$AppClassLoader(0x00000000258F20C8)
3XMTHREADINFO1            (native thread ID:0x1B6F9, native priority:0x5, native policy:UNKNOWN, vmstate:P, vm thread flags:0x00020081)
3XMTHREADINFO2            (native stack address range from:0x00007F5CFC003000, to:0x00007F5CFC083000, size:0x80000)
3XMCPUTIME               CPU usage total: 0.002249038 secs, current category="Application"
3XMTHREADBLOCK     Parked on: java/util/concurrent/SynchronousQueue$Transferer@0x00000000258FB378 Owned by: <unknown>
3XMHEAPALLOC             Heap bytes allocated since last GC cycle=4096 (0x1000)
3XMTHREADINFO3           Java callstack:
4XESTACKTRACE                at jdk/internal/misc/Unsafe.park(Native Method)
4XESTACKTRACE                at java/util/concurrent/locks/LockSupport.park(LockSupport.java:371)
4XESTACKTRACE                at java/util/concurrent/LinkedTransferQueue$DualNode.await(LinkedTransferQueue.java:458)
4XESTACKTRACE                at java/util/concurrent/LinkedTransferQueue.xfer(LinkedTransferQueue.java:613)
4XESTACKTRACE                at java/util/concurrent/SynchronousQueue.xfer(SynchronousQueue.java:233)
4XESTACKTRACE                at java/util/concurrent/SynchronousQueue.put(SynchronousQueue.java:264)
4XESTACKTRACE                at Fairness.unlockAndPut(Fairness.java:73)
4XESTACKTRACE                at Fairness.access$1(Fairness.java:70)
4XESTACKTRACE                at Fairness$1.run(Fairness.java:122)

Sometimes, however, there's contention for Thread.interruptLock between getState() and interrupted(). When the new thread loses, it becomes BLOCKED:

3XMTHREADINFO      "supplier#8" J9VMThread:0x000000000022FE00, omrthread_t:0x00007F5D1C7B8810, java/lang/Thread:0x00000000FFEC48A0, state:R, prio=5
3XMJAVALTHREAD            (java/lang/Thread getId:0x1E, isDaemon:false)
3XMJAVALTHRCCL            jdk/internal/loader/ClassLoaders$AppClassLoader(0x00000000258F20C8)
3XMTHREADINFO1            (native thread ID:0x1B6FB, native priority:0x5, native policy:UNKNOWN, vmstate:CW, vm thread flags:0x00000081)
3XMTHREADINFO2            (native stack address range from:0x00007F5CC42FF000, to:0x00007F5CC437F000, size:0x80000)
3XMCPUTIME               CPU usage total: 0.000705496 secs, current category="Application"
3XMHEAPALLOC             Heap bytes allocated since last GC cycle=0 (0x0)
3XMTHREADINFO3           Java callstack:
4XESTACKTRACE                at java/lang/Thread.getAndClearInterrupt(Thread.java:1801)
5XESTACKTRACE                   (entered lock: java/lang/Object@0x00000000FFEC4928, entry count: 1)
4XESTACKTRACE                at java/lang/Thread.interrupted(Thread.java:1755)
4XESTACKTRACE                at java/util/concurrent/SynchronousQueue.put(SynchronousQueue.java:263)
4XESTACKTRACE                at Fairness.unlockAndPut(Fairness.java:73)
4XESTACKTRACE                at Fairness.access$1(Fairness.java:70)
4XESTACKTRACE                at Fairness$1.run(Fairness.java:122)

The call to Thread.interrupted() is a recent addition to SynchronousQueue.put() which explains why the test started failing.

Thread is modified for OpenJ9: interrupted() has synchronization (in getAndClearInterrupt()) that is not present upstream. It's not immediately obvious that that synchronization is necessary; perhaps it can be removed which would improve performance and allow the test to pass.

pshipton commented 4 months ago

Are any changes for this expected in the next couple of weeks, or should this be moved forward to the next release?

keithc-ca commented 4 months ago

The changes to Thread.getAndClearInterrupt() that were made in https://github.com/ibmruntimes/openj9-openjdk-jdk/pull/746 may have resolved this for jdk23+. I think we should back-port that change to jdk22 and jdk21 which should fix this for those versions as well.

keithc-ca commented 4 months ago

As we discovered in testing the related pull requests, removing the synchronization that breaks this test reintroduces a more serious, functional problem. I think the synchronization should be restored in the head stream until a solution that doesn't need it is discovered.

keithc-ca commented 4 months ago

See https://github.com/ibmruntimes/openj9-openjdk-jdk/pull/748 which restores that synchronization;

I think we should move this forward to the next release. I don't think it could cause any defensible real-world problems.