Open pshipton opened 5 months ago
https://openj9-jenkins.osuosl.org/job/Grinder/3181/ - jdk_util testing https://openj9-jenkins.osuosl.org/job/Grinder/3182/ - specific tests
The grinder isn't finished, but this failure has already been repeated a number of times. https://openj9-jenkins.osuosl.org/job/Grinder_iteration_0/413 https://openj9-jenkins.osuosl.org/job/Grinder_iteration_2/371 https://openj9-jenkins.osuosl.org/job/Grinder_iteration_0/414/
The failure is repeatable when running just the single test.
@tobi someone needs to look at this.
I think the test is broken: There's nothing in the code that prioritizes threads started earlier over those started later.
It is possible that the test is in a state where all the started threads are immediately after the call to lock.unlock()
(but before the call to q.put(I)
) before CPU cores are available to advance any of them. When processors are available, they can be scheduled in any order, providing no guarantee that the values are queued in order of ascending value.
By artificially inserting delays between those two calls (longer for smaller-numbered threads) the test can be made to fail at will.
In fact, even if all threads call Thread.sleep(0)
the test fails regularly.
It's not a new test, and the test hasn't changed in years. It's not failing on other platforms, and we haven't seen it fail previously. I'm wondering what changed, and why it only fails on zlinux now.
There are changes to SynchronousQueue
between the openj9 and openj9-staging branches. In particular, this commit makes reference to that test failing
8300663: java/util/concurrent/SynchronousQueue/Fairness.java failed with "Error: fair=true i=0 j=1"
I'll dig deeper to see if I can find an explanation there.
In https://bugs.java.com/bugdatabase/view_bug?bug_id=8300663 they seem to agree that testing fairness is difficult (emphasis mine):
I strongly suspect that any test trying to assert for fairness in this case will have to establish a total order, and as such total order is not likely to exist in any real-world use-case my bet is that fairness in this case (synchronous queue) is not observable.
If we're going to ignore the test failure, the test needs to be excluded. I'd start with just on zlinux unless we start to see it fail elsewhere.
The code for SynchronousQueue
is the same in jdk21, jdk22 and jdknext. I just reproduced the problem with jdknext on xLinux. I suggest it should be excluded on all platforms.
ok
For the record, it's not failing on the zlinux 21.0.2 Temurin build. https://openj9-jenkins.osuosl.org/job/Grinder/3184/ (10x on 5 machines)
14:47:33 openjdk version "21.0.2-beta" 2024-01-16
14:47:33 OpenJDK Runtime Environment Temurin-21.0.2+13-202401172331 (build 21.0.2-beta+13-ea)
14:47:33 OpenJDK 64-Bit Server VM Temurin-21.0.2+13-202401172331 (build 21.0.2-beta+13-ea, mixed mode, sharing)
Pls create a PR for https://github.com/adoptium/aqa-tests/tree/v1.0.0-release as well.
Pls create a PR for https://github.com/adoptium/aqa-tests/tree/v1.0.0-release as well.
The test waits for the newly launched thread to leave the RUNNING
state; normally that happens by parking the thread:
3XMTHREADINFO "supplier#6" J9VMThread:0x0000000000228400, omrthread_t:0x00007F5D1C7AF070, java/lang/Thread:0x00000000FFEB90A0, state:P, prio=5
3XMJAVALTHREAD (java/lang/Thread getId:0x1C, isDaemon:false)
3XMJAVALTHRCCL jdk/internal/loader/ClassLoaders$AppClassLoader(0x00000000258F20C8)
3XMTHREADINFO1 (native thread ID:0x1B6F9, native priority:0x5, native policy:UNKNOWN, vmstate:P, vm thread flags:0x00020081)
3XMTHREADINFO2 (native stack address range from:0x00007F5CFC003000, to:0x00007F5CFC083000, size:0x80000)
3XMCPUTIME CPU usage total: 0.002249038 secs, current category="Application"
3XMTHREADBLOCK Parked on: java/util/concurrent/SynchronousQueue$Transferer@0x00000000258FB378 Owned by: <unknown>
3XMHEAPALLOC Heap bytes allocated since last GC cycle=4096 (0x1000)
3XMTHREADINFO3 Java callstack:
4XESTACKTRACE at jdk/internal/misc/Unsafe.park(Native Method)
4XESTACKTRACE at java/util/concurrent/locks/LockSupport.park(LockSupport.java:371)
4XESTACKTRACE at java/util/concurrent/LinkedTransferQueue$DualNode.await(LinkedTransferQueue.java:458)
4XESTACKTRACE at java/util/concurrent/LinkedTransferQueue.xfer(LinkedTransferQueue.java:613)
4XESTACKTRACE at java/util/concurrent/SynchronousQueue.xfer(SynchronousQueue.java:233)
4XESTACKTRACE at java/util/concurrent/SynchronousQueue.put(SynchronousQueue.java:264)
4XESTACKTRACE at Fairness.unlockAndPut(Fairness.java:73)
4XESTACKTRACE at Fairness.access$1(Fairness.java:70)
4XESTACKTRACE at Fairness$1.run(Fairness.java:122)
Sometimes, however, there's contention for Thread.interruptLock
between getState()
and interrupted()
. When the new thread loses, it becomes BLOCKED
:
3XMTHREADINFO "supplier#8" J9VMThread:0x000000000022FE00, omrthread_t:0x00007F5D1C7B8810, java/lang/Thread:0x00000000FFEC48A0, state:R, prio=5
3XMJAVALTHREAD (java/lang/Thread getId:0x1E, isDaemon:false)
3XMJAVALTHRCCL jdk/internal/loader/ClassLoaders$AppClassLoader(0x00000000258F20C8)
3XMTHREADINFO1 (native thread ID:0x1B6FB, native priority:0x5, native policy:UNKNOWN, vmstate:CW, vm thread flags:0x00000081)
3XMTHREADINFO2 (native stack address range from:0x00007F5CC42FF000, to:0x00007F5CC437F000, size:0x80000)
3XMCPUTIME CPU usage total: 0.000705496 secs, current category="Application"
3XMHEAPALLOC Heap bytes allocated since last GC cycle=0 (0x0)
3XMTHREADINFO3 Java callstack:
4XESTACKTRACE at java/lang/Thread.getAndClearInterrupt(Thread.java:1801)
5XESTACKTRACE (entered lock: java/lang/Object@0x00000000FFEC4928, entry count: 1)
4XESTACKTRACE at java/lang/Thread.interrupted(Thread.java:1755)
4XESTACKTRACE at java/util/concurrent/SynchronousQueue.put(SynchronousQueue.java:263)
4XESTACKTRACE at Fairness.unlockAndPut(Fairness.java:73)
4XESTACKTRACE at Fairness.access$1(Fairness.java:70)
4XESTACKTRACE at Fairness$1.run(Fairness.java:122)
The call to Thread.interrupted()
is a recent addition to SynchronousQueue.put()
which explains why the test started failing.
Thread
is modified for OpenJ9: interrupted()
has synchronization (in getAndClearInterrupt()
) that is not present upstream. It's not immediately obvious that that synchronization is necessary; perhaps it can be removed which would improve performance and allow the test to pass.
Are any changes for this expected in the next couple of weeks, or should this be moved forward to the next release?
The changes to Thread.getAndClearInterrupt()
that were made in https://github.com/ibmruntimes/openj9-openjdk-jdk/pull/746 may have resolved this for jdk23+. I think we should back-port that change to jdk22 and jdk21 which should fix this for those versions as well.
As we discovered in testing the related pull requests, removing the synchronization that breaks this test reintroduces a more serious, functional problem. I think the synchronization should be restored in the head stream until a solution that doesn't need it is discovered.
See https://github.com/ibmruntimes/openj9-openjdk-jdk/pull/748 which restores that synchronization;
I think we should move this forward to the next release. I don't think it could cause any defensible real-world problems.
https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_s390x_linux_Release_testList_0/10/ jdk_util_1 java/util/concurrent/SynchronousQueue/Fairness.java