Closed pshipton closed 2 months ago
@hzongaro what looks like a regression. @ymanton is it related to https://github.com/eclipse-openj9/openj9/pull/19613 ?
@ymanton is it related to https://github.com/eclipse-openj9/openj9/pull/19613 ?
At first glance, doesn't appear to be. The 3 methods we've recognized for #19556 and #19613 aren't involved.
One of the failing tests is https://github.com/eclipse-aqavit/openj9-systemtest/blob/0cc1cdb8e33b41de4478f8f07d53a461905e5f49/openj9.test.daa/src/test.daa/net/openj9/test/arithmetics/TestArithmetics.java#L53-L72 which creates random number strings and calls the BigInteger(string)
constructor, which is not the constructor we recognize, nor does that constructor call the other two BigInteger methods we've recognized AFAICT.
The other test is hitting the same constructor.
The error is that the random number string is outside the range that BigInteger supports, but the largest number the first test creates is only 16 digits, which BigInteger should be able to represent.
what looks like a regression.
Investigating. . . .
I haven't seen this in 0.46 builds so far, but I've targeted 0.46 for now since there have been many changes double delivered.
JDK22 s390x_linux(sles12s390x-svl-rt7-1
)
[2024-06-08T20:17:46.497Z] variation: Mode610
[2024-06-08T20:17:46.497Z] JVM_OPTIONS: -Xcompressedrefs -Xjit -Xgcpolicy:gencon
[2024-06-08T20:18:26.272Z] DLT testFailure: testSign(net.openj9.test.arithmetics.TestValidityChecking): BigInteger would overflow supported range
[2024-06-08T20:18:26.272Z] DLT java.lang.ArithmeticException: BigInteger would overflow supported range
[2024-06-08T20:18:26.272Z] DLT at java.base/java.math.BigInteger.reportOverflow(BigInteger.java:1220)
[2024-06-08T20:18:26.272Z] DLT at java.base/java.math.BigInteger.<init>(BigInteger.java:573)
[2024-06-08T20:18:26.272Z] DLT at java.base/java.math.BigInteger.<init>(BigInteger.java:721)
[2024-06-08T20:18:26.272Z] DLT at net.openj9.test.Utils.getBigInteger(Utils.java:64)
[2024-06-08T20:18:26.272Z] DLT at net.openj9.test.arithmetics.TestArithmeticComparisonBase.getWrongFormatPDHelper2(TestArithmeticComparisonBase.java:131)
[2024-06-08T20:18:26.272Z] DLT at net.openj9.test.arithmetics.TestValidityChecking.testSign(TestValidityChecking.java:226)
[2024-06-08T20:18:26.272Z] DLT at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
[2024-06-08T20:18:26.272Z] DLT at java.base/java.lang.reflect.Method.invoke(Method.java:586)
[2024-06-08T20:18:26.272Z] DLT at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
[2024-06-08T20:18:26.272Z] DLT at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
[2024-06-08T20:18:26.273Z] DLT at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
[2024-06-08T20:18:26.273Z] DLT at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
[2024-06-08T20:18:26.273Z] DLT at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
[2024-06-08T20:18:26.273Z] DLT at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
[2024-06-08T20:18:26.273Z] DLT at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
[2024-06-08T20:18:26.273Z] DLT at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
[2024-06-08T20:18:26.273Z] DLT at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
[2024-06-08T20:18:26.273Z] DLT at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
[2024-06-08T20:18:26.273Z] DLT at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
[2024-06-08T20:18:26.273Z] DLT at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
[2024-06-08T20:18:26.273Z] DLT at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
[2024-06-08T20:18:26.273Z] DLT at org.junit.runners.Suite.runChild(Suite.java:128)
[2024-06-08T20:18:26.273Z] DLT at org.junit.runners.Suite.runChild(Suite.java:27)
[2024-06-08T20:18:26.273Z] DLT at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
[2024-06-08T20:18:26.273Z] DLT at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
[2024-06-08T20:18:26.273Z] DLT at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
[2024-06-08T20:18:26.273Z] DLT at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
[2024-06-08T20:18:26.273Z] DLT at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
[2024-06-08T20:18:26.273Z] DLT at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
[2024-06-08T20:18:26.273Z] DLT at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
[2024-06-08T20:18:26.273Z] DLT at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
[2024-06-08T20:18:26.273Z] DLT at net.adoptopenjdk.loadTest.adaptors.JUnitAdaptor.executeTest(JUnitAdaptor.java:130)
[2024-06-08T20:18:26.273Z] DLT at net.adoptopenjdk.loadTest.LoadTestRunner$2.run(LoadTestRunner.java:182)
[2024-06-08T20:18:26.273Z] DLT at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
[2024-06-08T20:18:26.273Z] DLT at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
[2024-06-08T20:18:26.273Z] DLT at java.base/java.lang.Thread.run(Thread.java:1582)
[2024-06-08T20:18:26.273Z] DLT testFinished: testSign(net.openj9.test.arithmetics.TestValidityChecking)
[2024-06-08T20:18:50.949Z] DaaLoadTest_daa1_5m_1_FAILED
I was able to reproduce this locally, 7/10 with TKG_ITERATIONS=10. Trying to narrow things down.
Interestingly it failed in the nightly build (Jason's is a nightly as well), so it's not a failure we'd only see in the weekly builds. Meaning the change that caused it is likely a more recent one. https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.system_s390x_linux_Nightly_testList_0/741 https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.system_s390x_linux_Nightly_testList_2/741 https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_extended.system_s390x_linux_Nightly_testList_2/742 https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.system_s390x_linux_Nightly_testList_0/245/ https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_extended.system_s390x_linux_Nightly_testList_1/245
Changes from the previous nightly build that didn't fail.
https://github.com/eclipse-openj9/openj9/compare/cc375ba4845...6fae8827bc5
Also it does seem to be specific to Z.
@VermaSh, @rmnattas, may I ask you to take a look at this issue? Looking at a log file for a failing case, I see TREE VERIFICATION ERROR
and BLOCK VERIFICATION ERROR
in a scorching compilation of BigInteger.<init>(Ljava/lang/String;I)V
after Idiom Recognition. For instance, before Idiom Recognition, n3674n
has rc=1
:
n3639n BBStart <block_208> (freq 1997) (in loop 930) [ 0x3ff2c756130] bci=[19,1,47] rc=0 vc=15771 vn=- li=-1 udi=- nc=0
n3678n ificmplt --> block_200 BBStart at n3668n () [ 0x3ff2c756d60] bci=[19,1,47] rc=0 vc=15771 vn=- li=-1 udi=- nc=2 flg=0x20
n3674n iload cursor<auto slot 3>[#404 Auto] [flags 0x3 0x0 ] (cannotOverflow ) [ 0x3ff2c756c20] bci=[19,0,47] rc=1 vc=15771 vn=- li=- udi=1492 nc=0 flg=0x1000
n3675n iconst 0 (X==0 X>=0 X<=0 ) [ 0x3ff2c756c70] bci=[19,1,47] rc=1 vc=15771 vn=- li=- udi=- nc=0 flg=0x302
n3673n BBEnd </block_208> ===== [ 0x3ff2c756bd0] bci=[19,1,47] rc=0 vc=15771 vn=- li=-1 udi=- nc=0
after Idiom Recognition, it was a refcount of 2:
n3639n BBStart <block_208> (freq 1997) (in loop 930) [ 0x3ff2c756130] bci=[19,1,47] rc=0 vc=16095 vn=- li=- udi=- nc=0
n3678n ificmplt --> block_200 BBStart at n3668n () [ 0x3ff2c756d60] bci=[19,1,47] rc=0 vc=16095 vn=- li=- udi=- nc=2 flg=0x20
n3674n iload cursor<auto slot 3>[#404 Auto] [flags 0x3 0x0 ] (cannotOverflow ) [ 0x3ff2c756c20] bci=[19,0,47] rc=2 vc=16095 vn=- li=- udi=1209 nc=0 flg=0x1000
n3675n iconst 0 (X==0 X>=0 X<=0 ) [ 0x3ff2c756c70] bci=[19,1,47] rc=1 vc=16095 vn=- li=- udi=- nc=0 flg=0x302
n3673n BBEnd </block_208> ===== [ 0x3ff2c756bd0] bci=[19,1,47] rc=0 vc=16095 vn=- li=- udi=- nc=0
TREE VERIFICATION ERROR -- node [ 0x3ff2c756c20] ref count is 2 and should be 1
One of the changes in the range mentioned in @pshipton's previous comment was in Idiom Recognition: commit a8295da97d41181047db882efa05542a8e71e557
As the suspected cause of the failure wasn't delivered to the v0.46.0-release branch, I've reassigned this to the 0.48 release.
see TREE VERIFICATION ERROR and BLOCK VERIFICATION ERROR
I should have mentioned that those verification errors appear in logs for a build — https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Build_JDK17_s390x_linux_Personal/858/OpenJ9-JDK17-s390x_linux-20240608-080546.tar.gz — that exhibits the unexpected Exception, while an earlier build for which the test passes — https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Build_JDK17_s390x_linux_Nightly/763/OpenJ9-JDK17-s390x_linux-20240607-202826.tar.gz — does not show those verification errors.
looking into the failure
In the meantime we can revert this one commit https://github.com/eclipse-openj9/openj9/pull/19697 to have builds passing. https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/41359/ Testing with the revert.
The commit need to be squashed with an upcoming commit (still doesn't fix the issue) in addition to a fix to this issue.
Testing out a fix for the failure. Old code was duplicating base array node duplicate through createLoad()
. This wasn't copied over in the off-heap change which I suspect is what lead to the BLOCK VERIFICATION errors. I'll create a PR for the fix as soon as I am done testing it.
TREE VERIFICATION ERROR -- node [ 0x3ff2c756c20] ref count is 2 and should be 1
TREE VERIFICATION ERROR -- node [ 0x3ff2c756e00] ref count is 3 and should be 1
TREE VERIFICATION ERROR -- node [ 0x3fea9963e30] ref count is 2 and should be 1
TREE VERIFICATION ERROR -- node [ 0x3fea9963c00] ref count is 3 and should be 1
BLOCK VERIFICATION ERROR -- node [ 0x3ff2c756c20] accessed outside of its (extended) basic block: 1 time(s)
BLOCK VERIFICATION ERROR -- node [ 0x3ff2c756e00] accessed outside of its (extended) basic block: 2 time(s)
BLOCK VERIFICATION ERROR -- node [ 0x3fea9963e30] accessed outside of its (extended) basic block: 1 time(s)
BLOCK VERIFICATION ERROR -- node [ 0x3fea9963c00] accessed outside of its (extended) basic block: 2 time(s)
Thanks @VermaSh . @pshipton / @rmnattas We are waiting for Shubham's tests to finish before opening up the PR. DO you still want to go ahead with the reverting the problematic change ?
Would be better to just merge the fix, there's also the other related commit https://github.com/eclipse-openj9/openj9/commit/f3241fca1af209d99e5e2d851debea6fae5857fb to merge with it too. It's already in another WIP PR I have but if @VermaSh would grab it and include it in his PR I can remove it from mine. Or I can take his fix and include it in my PR.
We've gone a week without reverting, we can wait for the fix.
WIP PR with fix and @rmnattas's commit https://github.com/eclipse-openj9/openj9/pull/19724
The failure is still there https://hyc-runtimes-jenkins.swg-devops.com/job/Test_openjdk21_j9_sanity.system_s390x_linux_Personal_testList_2/45/consoleFull. Continuing the investigation.
Haven't been able to reproduce the failure consistently on my dev machine so launched a grinder to collect diagnostic files.
Looking into the core dump, I don't see anything suspicious so far. Will post another update after I have had a chance to do a deeper dive into the core dump.
The string that was passed in to the constructor[1] was 00
[2] so we should have returned at cursor == len
[3]. cursor
was 0 but should have been incremented to 2, same value as len
, in [4]. I have launched another grinder to get trace log to better understand some of the instructions.
[1] https://github.com/ibmruntimes/openj9-openjdk-jdk/blob/ff40ef0fed223100852dca63a72804b4b03ad48d/src/java.base/share/classes/java/math/BigInteger.java#L567-L575 "Big Integer constructor"
[3] https://github.com/ibmruntimes/openj9-openjdk-jdk21/blob/bdd0f758a673263028fd2fb07eb31be1099e68fe/src/java.base/share/classes/java/math/BigInteger.java#L518 "cursor == len condition"
[4] https://github.com/ibmruntimes/openj9-openjdk-jdk21/blob/bdd0f758a673263028fd2fb07eb31be1099e68fe/src/java.base/share/classes/java/math/BigInteger.java#L513-L516 "loop to skip leading zeros"
I didn't notice anything suspicious in trace log of BigInteger constructor from the failing. The trees and generated instructions look ok for the while loop. The string being passed in looks good as well, however, I wasn't able to verify the value of the index cursor
used to iterate over characters in the string. I am planning to compare scorching logs from failing and passing log to see if we are doing anything different on the failing run.
Comparing the two logs, only difference I saw was that in the failing run we inline java/lang/CharacterData.digit(II)I
and java/lang/CharacterData.of(I)Ljava/lang/CharacterData;
into BigInteger constructor, but the trees are pretty much the same. So planning to see if removing off-heap commits helps identify the change which caused the failure.
For future reference
Machine used for this investigation: rtj-ubu24s390x-svl-test-6lslg-1.fyre.ibm.com
cmd:
> make _DaaLoadTest_daa1_special_5m_29 \
TEST_JDK_HOME=$prototype_build BUILD_LIST=system TKG_ITERATIONS=20 \
EXTRA_OPTIONS='-Xnoaot -Xnoclassgc -Xjit:verbose=\{compileStart\|compileEnd\|inlining\},vlog=verbose.vlog,\{java/math/BigInteger.\<init\>\(Ljava/lang/String\;I\)V\}\(traceFull,traceCG,log=method.log\)'
Opened a WIP PR with the fix: https://github.com/eclipse-openj9/openj9/pull/19810
We should have returned from[1] but didn't because we used an incorrect cursor
value after the last iteration of[2]. We enter the constructor with array of three 0s and after looking at the 3rd element jump to block_2651
from n28666n
. But since the isub
hasn't been stored back to cursor
auto slot, we must subtract -3 from cursor again in block_2651
. We write the new cursor
value to the auto slot(#404
) after each iteration but always uses the initial value of 0 from n3674n
to calculate the new value. So when we add 3 to #404
in n30193n
we end up with 5 as the new cursor
value[3], value greater than array length. This is why we don't enter [1] and end up throwing an exception later on instead. With the fix, the slot used in isub
(n30178n
) of the exit block is the same as the loop n28647n
[4].
[1] java.base/share/classes/java/math/BigInteger.java#L561-L565 "if check"
[2] java.base/share/classes/java/math/BigInteger.java#L556-L559 "skip leading zeros"
Thanks @VermaSh for the summary , it is an interesting manifestation of incorrect reference count set on nodes, highlights the importance of testing the changes with paranoidOptCheck
in development stage.
https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_special.system_s390x_linux_Personal_testList_2/134 DaaLoadTest_daa1_special_5m_29
-Xgcpolicy:gencon -Xgc:dynamicBreadthFirstScanOrdering
https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_special.system_s390x_linux_Personal_testList_3/134 DaaLoadTest_daa1_special_5m_22
-XX:+UseCompressedOops -Xjit -Xgcpolicy:gencon -Xaggressive
Changes from last build (a week) https://github.com/eclipse-openj9/openj9/compare/099aaee5c9...6fae8827bc5 https://github.com/eclipse-openj9/openj9-omr/compare/0f74862420bcf4...361c9df886c