eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.28k stars 720 forks source link

jdk20 OpenJDK java/lang/Float/Binary16ConversionNaN Roundtrip failure on NaN value #16660

Closed pshipton closed 1 year ago

pshipton commented 1 year ago

https://openj9-jenkins.osuosl.org/job/Test_openjdk20_j9_sanity.openjdk_ppc64_aix_OpenJDK20/1 - p8-java1-ibm07 jdk_lang_0 java/lang/Float/Binary16ConversionNaN.java

This is a new test in jdk20.

10:49:57  STDOUT:
10:49:57  Roundtrip failure on NaN value fded    got back ffed
10:49:57  Roundtrip failure on NaN value 7dee    got back 7fee
10:49:57  Roundtrip failure on NaN value fdee    got back ffee
10:49:57  Roundtrip failure on NaN value 7def    got back 7fef
10:49:57  Roundtrip failure on NaN value fdef    got back ffef
10:49:57  Roundtrip failure on NaN value 7df0    got back 7ff0
10:49:57  Roundtrip failure on NaN value fdf0    got back fff0
10:49:57  Roundtrip failure on NaN value 7df1    got back 7ff1
10:49:57  Roundtrip failure on NaN value fdf1    got back fff1
10:49:57  Roundtrip failure on NaN value 7df2    got back 7ff2
10:49:57  Roundtrip failure on NaN value fdf2    got back fff2
10:49:57  Roundtrip failure on NaN value 7df3    got back 7ff3
10:49:57  Roundtrip failure on NaN value fdf3    got back fff3
10:49:57  Roundtrip failure on NaN value 7df4    got back 7ff4
10:49:57  Roundtrip failure on NaN value fdf4    got back fff4
10:49:57  Roundtrip failure on NaN value 7df5    got back 7ff5
10:49:57  Roundtrip failure on NaN value fdf5    got back fff5
10:49:57  Roundtrip failure on NaN value 7df6    got back 7ff6
10:49:57  Roundtrip failure on NaN value fdf6    got back fff6
10:49:57  Roundtrip failure on NaN value 7df7    got back 7ff7
10:49:57  Roundtrip failure on NaN value fdf7    got back fff7
10:49:57  Roundtrip failure on NaN value 7df8    got back 7ff8
10:49:57  Roundtrip failure on NaN value fdf8    got back fff8
10:49:57  Roundtrip failure on NaN value 7df9    got back 7ff9
10:49:57  Roundtrip failure on NaN value fdf9    got back fff9
10:49:57  Roundtrip failure on NaN value 7dfa    got back 7ffa
10:49:57  Roundtrip failure on NaN value fdfa    got back fffa
10:49:57  Roundtrip failure on NaN value 7dfb    got back 7ffb
10:49:57  Roundtrip failure on NaN value fdfb    got back fffb
10:49:57  Roundtrip failure on NaN value 7dfc    got back 7ffc
10:49:57  Roundtrip failure on NaN value fdfc    got back fffc
10:49:57  Roundtrip failure on NaN value 7dfd    got back 7ffd
10:49:57  Roundtrip failure on NaN value fdfd    got back fffd
10:49:57  Roundtrip failure on NaN value 7dfe    got back 7ffe
10:49:57  Roundtrip failure on NaN value fdfe    got back fffe
10:49:57  Roundtrip failure on NaN value 7dff    got back 7fff
10:49:57  Roundtrip failure on NaN value fdff    got back ffff
10:49:57  STDERR:
10:49:57  java.lang.RuntimeException: 37 errors
10:49:57    at Binary16ConversionNaN.main(Binary16ConversionNaN.java:51)
10:49:57    at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
pshipton commented 1 year ago

Grinder https://openj9-jenkins.osuosl.org/job/Grinder/1971/ - p8-java1-ibm09

pshipton commented 1 year ago

The previous grinder failed 5/6, passing only once in mode 1. Another 2 runs in mode 0 passed https://openj9-jenkins.osuosl.org/job/Grinder/1973/ - p8-java1-ibm07. Another 5 runs in mode 0 passed https://openj9-jenkins.osuosl.org/job/Grinder/1973/ - p8-java1-ibm03. Another 3 runs in mode 0 passed https://openj9-jenkins.osuosl.org/job/Grinder/1976 - p8-java1-ibm09. 7/20 failures, 1 in mode 0, 6 in mode 1 https://openj9-jenkins.osuosl.org/job/Grinder/1977/ - p8-java1-ibm09 11/20 failures, 6 in mode 0, 5 in mode 1 https://openj9-jenkins.osuosl.org/job/Grinder/1982/ - p8-java1-ibm09

Trying again with -Xint https://openj9-jenkins.osuosl.org/job/Grinder/1972/ - p8-java1-ibm03 - this passed, 3x mode 0, 1 Trying more runs in https://openj9-jenkins.osuosl.org/job/Grinder/1975/ - p8-java1-ibm11 - passed, 5x mode 0, 1 passed https://openj9-jenkins.osuosl.org/job/Grinder/1978/ - p8-java1-ibm09 - 10x mode 0, 1 (20 runs) passed https://openj9-jenkins.osuosl.org/job/Grinder/1979/ - p8-java1-ibm09 - 10x mode 0, 1 (20 runs) passed https://openj9-jenkins.osuosl.org/job/Grinder/1981/ - p8-java1-ibm09 - 30x mode 0, 1 (60 runs)

pshipton commented 1 year ago

@0xdaryl fyi. I can't repeat this with -Xint so JIT should take the first look.

0xdaryl commented 1 year ago

@zl-wang : could you assign someone to investigate this problem seen on AIX please?

zl-wang commented 1 year ago

@bhavanisn please take a look

bhavanisn commented 1 year ago

@zl-wang Testcase works when run with TR_disableDirectMove=true. Without this option VSX instructions are used for conversions.

When TR_disableDirectMove=true:

0x7b41ec5c8e44 {java/lang/Float.floatToFloat16} +18                       40000ec0 lfs       f0, 0x40(r14)
0x7b41ec5c8e48 {java/lang/Float.floatToFloat16} +19                       900020fc fmr       f1, f0
0x7b41ec5c8e4c {java/lang/Float.floatToFloat16} +20                       38002ed0 stfs      f1, 0x38(r14)
0x7b41ec5c8e50 {java/lang/Float.floatToFloat16} +21                       3800ce83 lwz       r30, 0x38(r14)
0x7b41ec5c8e54 {java/lang/Float.floatToFloat16} +22                       0000c057 rlwinm    r0, r30, 0, 0, 0
0x7b41ec5c8e58 {java/lang/Float.floatToFloat16} +23                       70861d7c srawi     r29, r0, 0x10
0x7b41ec5c8e5c {java/lang/Float.floatToFloat16} +24                       0000032c cmpwi     r3, 0
0x7b41ec5c8e60 {java/lang/Float.floatToFloat16} +25                       10018240 bne       0x7b41ec5c8f70 C>> +93 

When TR_disableDirectMove is unset:

0x731910373744 {java/lang/Float.floatToFloat16} +18                       30000ec0 lfs       f0, 0x30(r14)
0x731910373748 {java/lang/Float.floatToFloat16} +19                       900020fc fmr       f1, f0
**0x73191037374c {java/lang/Float.floatToFloat16} +20                       2c0c20f0 xscvdpspn vs1, f1
0x731910373750 {java/lang/Float.floatToFloat16} +21                       900a20f0 xxspltw   vs1, vs1, 0
0x731910373754 {java/lang/Float.floatToFloat16} +22                       e6003e7c mfvsrwz   r30, f1**
0x731910373758 {java/lang/Float.floatToFloat16} +23                       0000c057 rlwinm    r0, r30, 0, 0, 0
0x73191037375c {java/lang/Float.floatToFloat16} +24                       70861d7c srawi     r29, r0, 0x10
0x731910373760 {java/lang/Float.floatToFloat16} +25                       0000032c cmpwi     r3, 0
0x731910373764 {java/lang/Float.floatToFloat16} +26                       10018240 bne       0x731910373874 C>> +94
zl-wang commented 1 year ago

the problem is with xscvdpspn (convert a SP-value in DP-format to SP-format) instruction: it automatically/silently converts a SNaN to a QNaN (or possibly Infinity if there is no 1-bit in the top 23bits of the mantissa). the results in "error" looked like that happening as well (the QNaN bit is set --- the 6th bit).

It looks like a legitimate conversion to me, as long as there is a conversion involved. So, the question is: should the test case expect bit by bit identical conversion? or there is no conversion expected at all (i.e. that conversion is a side effect of code generation instead of test case)?

Another question: why didn't it fail on Linux?

pshipton commented 1 year ago

why didn't it fail on Linux?

The failure was intermittent. We haven't ramped up to test jdk20 regularly yet. Possibly it occurs on Linux but hasn't yet been seen.

Trying a grinder https://openj9-jenkins.osuosl.org/view/Test/job/Grinder/1997/

pshipton commented 1 year ago

As per the grinder, it does fail on plinux.

pshipton commented 1 year ago

The OpenJDK issues that added/modified this test are: https://bugs.openjdk.org/browse/JDK-8289551 https://bugs.openjdk.org/browse/JDK-8293991 https://bugs.openjdk.org/browse/JDK-8289552

8293991 excluded the test on x under some conditions

 * @requires (os.arch != "x86" & os.arch != "i386") | vm.opt.UseSSE == "null" | vm.opt.UseSSE > 0

8289552 added some Hotspot options including DisableIntrinsic:

 * @run main/othervm -XX:-TieredCompilation -XX:CompileThresholdScaling=0.1 Binary16ConversionNaN
 * @run main/othervm -XX:+UnlockDiagnosticVMOptions
 * -XX:DisableIntrinsic=_float16ToFloat,_floatToFloat16 Binary16ConversionNaN
zl-wang commented 1 year ago

@pshipton it sounds like it should be disabled on p as well.

@bhavanisn the xxspltw instruction is completely unnecessary in this case at least, since xscvdpspn already populates the needed part of the register. try to get rid of it for as many cases as you can see. at lower priority of course.

bhavanisn commented 1 year ago

@pshipton Yes it does fail on linux too. I am debugging it on plinux.

pshipton commented 1 year ago

Can we control the behavior via a java command line option, similar in behavior to TR_disableDirectMove=true so we can still run the test and have it pass?

bhavanisn commented 1 year ago

@pshipton Unfortunately I am not seeing any other option other than the env variable.

zl-wang commented 1 year ago

unfortunately, it is currently only through that env-var. unless we creates an option to replace that env-var explicitly ... by the way, directMove has 3-8X performance advantage on that conversion, depending on the specific hardware.

pshipton commented 1 year ago

@llxia can we separate this test out so it can be run on P with the environment variable set.

llxia commented 1 year ago

I have a PR to separate Binary16ConversionNaN out and run with export TR_disableDirectMove=true. However, the test still failed intermittently. https://openj9-jenkins.osuosl.org/view/Test/job/Grinder/2010/ https://openj9-jenkins.osuosl.org/view/Test/job/Grinder/2006/

13:22:54  export TR_disableDirectMove=true; \
13:22:54    "/home/jenkins/workspace/Grinder/openjdkbinary/j2sdk-image/bin/java" -Xmx512m -jar "/home/jenkins/workspace/Grinder/aqa-tests/TKG/../../jvmtest/openjdk/jtreg/lib/jtreg.jar" \
...

Please let me know if I missed anything.

zl-wang commented 1 year ago

if both approaches can fail, the intermittent behaviour must be related to the timing of relevant method's JIT compilation. So, -Xjit:count=0 (no matter with export or not) must fail it consistently. Could you verify it?

zl-wang commented 1 year ago

@bhavanisn could you debug to verify xscvdpspn comes out with QNaN when the input is a SNaN? set breakpoints at +19 and +21 respectively. SNaN: the 12th bit is 0 in fpr0(since in DP format); QNaN: the 9th bit is 1 in vsr1 (since in SP format). Bit counted from 0 in IBM way (i.e. the 0th bit is the MSB --- most significant bit).

llxia commented 1 year ago

re https://github.com/eclipse-openj9/openj9/issues/16660#issuecomment-1428557631, yes, with -Xjit:count=0 and TR_disableDirectMove=true, it failed consistently - 10/10 failed in both mode https://openj9-jenkins.osuosl.org/view/Test/job/Grinder/2018/

zl-wang commented 1 year ago

then, @bhavanisn this is a failure due to a completely different reason. but it is good that it can fail consistently. back to square 1.

bhavanisn commented 1 year ago

@llxia Ran your branch against my branch with the fix and it passed. https://hyc-runtimes-jenkins.swg-devops.com/view/Test_grinder/job/Grinder/31132/console

Run on ppc64_aix with the fix and passed: https://hyc-runtimes-jenkins.swg-devops.com/view/OpenJ9%20-%20Personal/job/Pipeline-Build-Test-Personal/15532/

PR created: https://github.com/eclipse/omr/pull/6897

llxia commented 1 year ago

Thanks @bhavanisn . Since this failure is intermittent, could you run it with multiple iterations? i.e., ITERATIONS=10

bhavanisn commented 1 year ago

@llxia Ran a pipeline with TKG_ITERATIONS=10 and ITERATIONS=10 which passed. https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/31162/

llxia commented 1 year ago

https://github.com/eclipse/omr/pull/6897 is merged. To double-check, do we still want to deliver PR https://github.com/adoptium/aqa-tests/pull/4329/ to separate Binary16ConversionNaN out and run with export TR_disableDirectMove=true on aix and plinux?

pshipton commented 1 year ago

I think no. If the test is passing now we don't need to take any further action. We'll confirm if the testing is passing in nightly builds after the OMR change promotes.

pshipton commented 1 year ago

The change seems to have been promoted manually and will go through nightly builds tonight.

pshipton commented 1 year ago

Although the problem was exposed by a new test in jdk20, I assume the problem is generic and we should deliver it to the 0.38 release.

zl-wang commented 1 year ago

i haven't looked into this in details, but will ask here: is there Float16 support in pre-java20 releases, relevant to this bug? then, we can decide if this should be double-delivered into 0.38.

pshipton commented 1 year ago

Ok, this is only applicable to jdk20, I didn't realize there were new APIs added to Float.

pshipton commented 1 year ago

Closing as resolved for jdk20+ by https://github.com/eclipse/omr/pull/6897