Closed pshipton closed 1 year ago
The previous grinder failed 5/6, passing only once in mode 1. Another 2 runs in mode 0 passed https://openj9-jenkins.osuosl.org/job/Grinder/1973/ - p8-java1-ibm07. Another 5 runs in mode 0 passed https://openj9-jenkins.osuosl.org/job/Grinder/1973/ - p8-java1-ibm03. Another 3 runs in mode 0 passed https://openj9-jenkins.osuosl.org/job/Grinder/1976 - p8-java1-ibm09. 7/20 failures, 1 in mode 0, 6 in mode 1 https://openj9-jenkins.osuosl.org/job/Grinder/1977/ - p8-java1-ibm09 11/20 failures, 6 in mode 0, 5 in mode 1 https://openj9-jenkins.osuosl.org/job/Grinder/1982/ - p8-java1-ibm09
Trying again with -Xint https://openj9-jenkins.osuosl.org/job/Grinder/1972/ - p8-java1-ibm03 - this passed, 3x mode 0, 1 Trying more runs in https://openj9-jenkins.osuosl.org/job/Grinder/1975/ - p8-java1-ibm11 - passed, 5x mode 0, 1 passed https://openj9-jenkins.osuosl.org/job/Grinder/1978/ - p8-java1-ibm09 - 10x mode 0, 1 (20 runs) passed https://openj9-jenkins.osuosl.org/job/Grinder/1979/ - p8-java1-ibm09 - 10x mode 0, 1 (20 runs) passed https://openj9-jenkins.osuosl.org/job/Grinder/1981/ - p8-java1-ibm09 - 30x mode 0, 1 (60 runs)
@0xdaryl fyi. I can't repeat this with -Xint so JIT should take the first look.
@zl-wang : could you assign someone to investigate this problem seen on AIX please?
@bhavanisn please take a look
@zl-wang Testcase works when run with TR_disableDirectMove=true. Without this option VSX instructions are used for conversions.
When TR_disableDirectMove=true:
0x7b41ec5c8e44 {java/lang/Float.floatToFloat16} +18 40000ec0 lfs f0, 0x40(r14)
0x7b41ec5c8e48 {java/lang/Float.floatToFloat16} +19 900020fc fmr f1, f0
0x7b41ec5c8e4c {java/lang/Float.floatToFloat16} +20 38002ed0 stfs f1, 0x38(r14)
0x7b41ec5c8e50 {java/lang/Float.floatToFloat16} +21 3800ce83 lwz r30, 0x38(r14)
0x7b41ec5c8e54 {java/lang/Float.floatToFloat16} +22 0000c057 rlwinm r0, r30, 0, 0, 0
0x7b41ec5c8e58 {java/lang/Float.floatToFloat16} +23 70861d7c srawi r29, r0, 0x10
0x7b41ec5c8e5c {java/lang/Float.floatToFloat16} +24 0000032c cmpwi r3, 0
0x7b41ec5c8e60 {java/lang/Float.floatToFloat16} +25 10018240 bne 0x7b41ec5c8f70 C>> +93
When TR_disableDirectMove is unset:
0x731910373744 {java/lang/Float.floatToFloat16} +18 30000ec0 lfs f0, 0x30(r14)
0x731910373748 {java/lang/Float.floatToFloat16} +19 900020fc fmr f1, f0
**0x73191037374c {java/lang/Float.floatToFloat16} +20 2c0c20f0 xscvdpspn vs1, f1
0x731910373750 {java/lang/Float.floatToFloat16} +21 900a20f0 xxspltw vs1, vs1, 0
0x731910373754 {java/lang/Float.floatToFloat16} +22 e6003e7c mfvsrwz r30, f1**
0x731910373758 {java/lang/Float.floatToFloat16} +23 0000c057 rlwinm r0, r30, 0, 0, 0
0x73191037375c {java/lang/Float.floatToFloat16} +24 70861d7c srawi r29, r0, 0x10
0x731910373760 {java/lang/Float.floatToFloat16} +25 0000032c cmpwi r3, 0
0x731910373764 {java/lang/Float.floatToFloat16} +26 10018240 bne 0x731910373874 C>> +94
the problem is with xscvdpspn (convert a SP-value in DP-format to SP-format) instruction: it automatically/silently converts a SNaN to a QNaN (or possibly Infinity if there is no 1-bit in the top 23bits of the mantissa). the results in "error" looked like that happening as well (the QNaN bit is set --- the 6th bit).
It looks like a legitimate conversion to me, as long as there is a conversion involved. So, the question is: should the test case expect bit by bit identical conversion? or there is no conversion expected at all (i.e. that conversion is a side effect of code generation instead of test case)?
Another question: why didn't it fail on Linux?
why didn't it fail on Linux?
The failure was intermittent. We haven't ramped up to test jdk20 regularly yet. Possibly it occurs on Linux but hasn't yet been seen.
Trying a grinder https://openj9-jenkins.osuosl.org/view/Test/job/Grinder/1997/
As per the grinder, it does fail on plinux.
The OpenJDK issues that added/modified this test are: https://bugs.openjdk.org/browse/JDK-8289551 https://bugs.openjdk.org/browse/JDK-8293991 https://bugs.openjdk.org/browse/JDK-8289552
8293991 excluded the test on x under some conditions
* @requires (os.arch != "x86" & os.arch != "i386") | vm.opt.UseSSE == "null" | vm.opt.UseSSE > 0
8289552 added some Hotspot options including DisableIntrinsic:
* @run main/othervm -XX:-TieredCompilation -XX:CompileThresholdScaling=0.1 Binary16ConversionNaN
* @run main/othervm -XX:+UnlockDiagnosticVMOptions
* -XX:DisableIntrinsic=_float16ToFloat,_floatToFloat16 Binary16ConversionNaN
@pshipton it sounds like it should be disabled on p as well.
@bhavanisn the xxspltw instruction is completely unnecessary in this case at least, since xscvdpspn already populates the needed part of the register. try to get rid of it for as many cases as you can see. at lower priority of course.
@pshipton Yes it does fail on linux too. I am debugging it on plinux.
Can we control the behavior via a java command line option, similar in behavior to TR_disableDirectMove=true
so we can still run the test and have it pass?
@pshipton Unfortunately I am not seeing any other option other than the env variable.
unfortunately, it is currently only through that env-var. unless we creates an option to replace that env-var explicitly ... by the way, directMove has 3-8X performance advantage on that conversion, depending on the specific hardware.
@llxia can we separate this test out so it can be run on P with the environment variable set.
I have a PR to separate Binary16ConversionNaN out and run with export TR_disableDirectMove=true
. However, the test still failed intermittently.
https://openj9-jenkins.osuosl.org/view/Test/job/Grinder/2010/
https://openj9-jenkins.osuosl.org/view/Test/job/Grinder/2006/
13:22:54 export TR_disableDirectMove=true; \
13:22:54 "/home/jenkins/workspace/Grinder/openjdkbinary/j2sdk-image/bin/java" -Xmx512m -jar "/home/jenkins/workspace/Grinder/aqa-tests/TKG/../../jvmtest/openjdk/jtreg/lib/jtreg.jar" \
...
Please let me know if I missed anything.
if both approaches can fail, the intermittent behaviour must be related to the timing of relevant method's JIT compilation. So, -Xjit:count=0 (no matter with export or not) must fail it consistently. Could you verify it?
@bhavanisn could you debug to verify xscvdpspn comes out with QNaN when the input is a SNaN? set breakpoints at +19 and +21 respectively. SNaN: the 12th bit is 0 in fpr0(since in DP format); QNaN: the 9th bit is 1 in vsr1 (since in SP format). Bit counted from 0 in IBM way (i.e. the 0th bit is the MSB --- most significant bit).
re https://github.com/eclipse-openj9/openj9/issues/16660#issuecomment-1428557631, yes, with -Xjit:count=0 and TR_disableDirectMove=true, it failed consistently - 10/10 failed in both mode https://openj9-jenkins.osuosl.org/view/Test/job/Grinder/2018/
then, @bhavanisn this is a failure due to a completely different reason. but it is good that it can fail consistently. back to square 1.
@llxia Ran your branch against my branch with the fix and it passed. https://hyc-runtimes-jenkins.swg-devops.com/view/Test_grinder/job/Grinder/31132/console
Run on ppc64_aix with the fix and passed: https://hyc-runtimes-jenkins.swg-devops.com/view/OpenJ9%20-%20Personal/job/Pipeline-Build-Test-Personal/15532/
PR created: https://github.com/eclipse/omr/pull/6897
Thanks @bhavanisn . Since this failure is intermittent, could you run it with multiple iterations? i.e., ITERATIONS=10
@llxia Ran a pipeline with TKG_ITERATIONS=10 and ITERATIONS=10 which passed. https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/31162/
https://github.com/eclipse/omr/pull/6897 is merged. To double-check, do we still want to deliver PR https://github.com/adoptium/aqa-tests/pull/4329/ to separate Binary16ConversionNaN out and run with export TR_disableDirectMove=true on aix and plinux?
I think no. If the test is passing now we don't need to take any further action. We'll confirm if the testing is passing in nightly builds after the OMR change promotes.
The change seems to have been promoted manually and will go through nightly builds tonight.
Although the problem was exposed by a new test in jdk20, I assume the problem is generic and we should deliver it to the 0.38 release.
i haven't looked into this in details, but will ask here: is there Float16 support in pre-java20 releases, relevant to this bug? then, we can decide if this should be double-delivered into 0.38.
Ok, this is only applicable to jdk20, I didn't realize there were new APIs added to Float.
Closing as resolved for jdk20+ by https://github.com/eclipse/omr/pull/6897
https://openj9-jenkins.osuosl.org/job/Test_openjdk20_j9_sanity.openjdk_ppc64_aix_OpenJDK20/1 - p8-java1-ibm07 jdk_lang_0 java/lang/Float/Binary16ConversionNaN.java
This is a new test in jdk20.