eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.27k stars 721 forks source link

JDK11 OpenJ9 Gives Non-deterministic Results #17033

Open connglli opened 1 year ago

connglli commented 1 year ago

Java version

openjdk version "11.0.19-internal" 2023-04-18
OpenJDK Runtime Environment (build 11.0.19-internal+0-adhoc..openj9-openjdk-jdk11)
Eclipse OpenJ9 VM (build master-f14707f, JRE 11 Linux amd64-64-Bit Compressed References 20230303_000000 (JIT enabled, AOT enabled)
OpenJ9   - f14707f
OMR      - 342b647
JCL      - a8cc681 based on jdk-11.0.19+4)

Javac version

javac 11.0.19-internal

Code and summary of the problem

The following test makes OpenJ9 (with above version) generate different result every time you run it.

import java.net.Socket;

class T {
  int a;
  long z;

  void foo(boolean b, int c) {
    c *= --c;
    Socket t = new Socket();
    String[][] q = {};
    String p = "-000-0";
    for (int k = 395; k < 5172; k += 1) {
      z += c;
      try {
        if (q[a][1].equals(p)) {}
      } catch (Throwable x) {
      } finally {
      }
    }
  }

  void too(int i, int k) {
    int j = 1;
    while (++j < 104) foo(true, k);
  }

  public static void main(String[] g) {
    T t = new T();
    t.too(t.a, t.a);
    System.out.println(t.z);
  }
}

Results of OpenJ9: every time you run it, OpenJ9 gives a different result

> ./OpenJ9/jdk11/bin/java T
277066
> ./OpenJ9/jdk11/bin/java T
291397
> ./OpenJ9/jdk11/bin/java T
286620
> ./OpenJ9/jdk11/bin/java T
296174

The correct answer is 0:

> ./OpenJ9/jdk11/bin/java -Xint T
296174
> ./HotSpot/jdk11u/bin/java T
0

Diagnosis

I have reported a similar bug in #15306, but I reported this bug in a new thread because (1) this one looks different from 15306, (2) 15306 is marked as fixed, and (3) this one is generated in a very recent development trunk.

I have tried a little bit, and noticed that the non-determinisn starts from optlevel=hot with method T.foo(). See below

> ./OpenJ9/jdk11/bin/java '-Xjit:exclude={T.foo(ZI)V}' T   
0

> ./OpenJ9/jdk11/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=noopt)' T
0

> ./OpenJ9/jdk11/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=warm)' T
0

> ./OpenJ9/jdk11/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=hot)' T
458592

> ./OpenJ9/jdk11/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=veryhot)' T
453815

> ./OpenJ9/jdk11/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=scorching)' T
463369
pshipton commented 1 year ago

@0xdaryl @hzongaro fyi

hzongaro commented 1 year ago

@connglli, I haven't been able to reproduce the failure. I'll keep trying, but in case I am unable, may I ask you to try producing a log file with the following command invocation?

./OpenJ9/jdk11/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=hot,traceFull,log=issue17033.log)' T

connglli commented 1 year ago
> ./OpenJ9/jdk11/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=hot,traceFull,log=issue17033.log)' T
210188

@hzongaro See issue17033.tar.gz

hzongaro commented 1 year ago

See issue17033.tar.gz

@connglli, thank you for uploading the log file. May I ask you to try reproducing the failure with this invocation? You can try it with a variety of optLevel settings, if you have the opportunity.

/OpenJ9/jdk11/bin/java '-Xjit:{T.foo(ZI)V}(optLevel=hot,disableDynamicLoopTransfer)' T
pshipton commented 1 year ago

I took the latest build and I'm able to reproduce it often, although I couldn't with 11.0.18.

I can reproduce it with '-Xjit:{T.foo(ZI)V}(optLevel=scorching,disableDynamicLoopTransfer)' and '-Xjit:{T.foo(ZI)V}(optLevel=hot,disableDynamicLoopTransfer)'. It seems to stop when I add -Xnoaot.

pshipton commented 1 year ago

I destroyed my shared cache for jdk-11.0.18, and on the 9th run the problem started to occur.

connglli commented 1 year ago

@hzongaro See the following logs. Hope they can help!

For optlevel=hot: see issue17033.hot.2223504.66143.20230329.130223.2223504.log

> ./OpenJ9/jdk11/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=hot,disableDynamicLoopTransfer,traceFull,log=issue17033.hot)' T
353498

For optlevel=veryhot: see issue17033.veryhot.2223590.66155.20230329.130235.2223590.log

> ./OpenJ9/jdk11/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=veryhot,disableDynamicLoopTransfer,traceFull,log=issue17033.veryhot)' T
176749

For optlevel=scorching: see issue17033.scorching.2223674.66166.20230329.130246.2223674.log

> ./OpenJ9/jdk11/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=scorching,disableDynamicLoopTransfer,traceFull,log=issue17033.scorching)' T
324836
hzongaro commented 1 year ago

Thanks, Peter @pshipton . I was eventually able to reproduce the failure. I've found that once the failure occurs, I can reproduce it with lastOptIndex=206, but not with lastOptIndex=205:

$ ./jdk/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=hot,lastOptIndex=205)' T
0
$ ./jdk/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=hot,lastOptIndex=206)' T
458592
$ ./jdk/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=hot,lastOptIndex=205)' T
0
$ ./jdk/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=hot,lastOptIndex=206)' T
458592

That appears to correspond to Store Sinking:

$ ./jdk/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=hot,disableStoreSinking)' T
0
$ ./jdk/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=hot)' T
463369

I'll continue to investigate

connglli commented 1 year ago

Hi Henry @hzongaro, could you please explain the lastOptIndex to me?

hzongaro commented 1 year ago

Hi, @connglli. The lastOptIndex option is short for "last optimization index" - it's used in attempting to narrow down which optimization might be causing a failure. No optimizations that are not absolutely required will run after the optimization that corresponds to that particular optimization number. As each optimization in a compilation is attempted, the optimization index is incremented, so it isn't immediately obvious which optimization 206 might be. I had to check a trace log to determine it was Store Sinking in this case.

You can read more about it in the Problem Determination guide. It's not something that should be relied upon outside of the context of investigating failures.

hzongaro commented 1 year ago

Looking at the transformations that are happening in Store Sinking, beforehand we see this in the IL:

n26n      treetop                                                                             [0x7f129b005240] bci=[-1,1,8] rc=0 vc=2376 vn=- li=- udi=- nc=1
n21n        iload  <parm 2 I>[#419  Parm] [flags 0x40000103 0x0 ] (cannotOverflow )           [0x7f129b0050b0] bci=[-1,0,8] rc=2 vc=2376 vn=- li=50 udi=52 nc=0 flg=0x1000
n25n      istore  <parm 2 I>[#419  Parm] [flags 0x40000103 0x0 ]                              [0x7f129b0051f0] bci=[-1,1,8] rc=0 vc=2376 vn=- li=14 udi=1 nc=1
n24n        iadd                                                                              [0x7f129b0051a0] bci=[-1,1,8] rc=1 vc=2376 vn=- li=- udi=- nc=2
n22n          iload  <parm 2 I>[#419  Parm] [flags 0x40000103 0x0 ] (cannotOverflow )         [0x7f129b005100] bci=[-1,1,8] rc=1 vc=2376 vn=- li=51 udi=53 nc=0 flg=0x1000
n23n          iconst -1 (X!=0 X<=0 )                                                          [0x7f129b005150] bci=[-1,1,8] rc=1 vc=2376 vn=- li=- udi=- nc=0 flg=0x204
n29n      istore  <parm 2 I>[#419  Parm] [flags 0x40000103 0x0 ]                              [0x7f129b005330] bci=[-1,6,8] rc=0 vc=2376 vn=- li=15 udi=2 nc=1
n28n        imul                                                                              [0x7f129b0052e0] bci=[-1,5,8] rc=1 vc=2376 vn=- li=- udi=- nc=2
n21n          ==>iload
n27n          iload  <parm 2 I>[#419  Parm] [flags 0x40000103 0x0 ] (cannotOverflow )         [0x7f129b005290] bci=[-1,4,8] rc=1 vc=2376 vn=- li=52 udi=54 nc=0 flg=0x1000

That corresponds to this assignment at the start of T.foo:

 c *= --c;

After Store Sinking, we see this - the istore at node n29n has been replaced with a treetop

 n26n      treetop                                                                             [0x7f129b005240] bci=[-1,1,8] rc=0 vc=2585 vn=- li=- udi=- nc=1
n21n        iload  <parm 2 I>[#419  Parm] [flags 0x40000103 0x0 ] (cannotOverflow )           [0x7f129b0050b0] bci=[-1,0,8] rc=2 vc=2585 vn=- li=- udi=52 nc=0 flg=0x1000
n25n      istore  <parm 2 I>[#419  Parm] [flags 0x40000103 0x0 ]                              [0x7f129b0051f0] bci=[-1,1,8] rc=0 vc=2584 vn=- li=- udi=1 nc=1
n24n        iadd                                                                              [0x7f129b0051a0] bci=[-1,1,8] rc=1 vc=2584 vn=- li=- udi=- nc=2
n22n          iload  <parm 2 I>[#419  Parm] [flags 0x40000103 0x0 ] (cannotOverflow )         [0x7f129b005100] bci=[-1,1,8] rc=1 vc=2584 vn=- li=- udi=53 nc=0 flg=0x1000
n23n          iconst -1 (X!=0 X<=0 )                                                          [0x7f129b005150] bci=[-1,1,8] rc=1 vc=2584 vn=- li=- udi=- nc=0 flg=0x204
n29n      treetop    <<<                                                                      [0x7f129b005330] bci=[-1,6,8] rc=0 vc=2583 vn=- li=- udi=2 nc=1
n28n        imul                                                                              [0x7f129b0052e0] bci=[-1,5,8] rc=1 vc=2583 vn=- li=- udi=- nc=2
n21n          ==>iload
n27n          iload  <parm 2 I>[#419  Parm] [flags 0x40000103 0x0 ] (cannotOverflow )         [0x7f129b005290] bci=[-1,4,8] rc=1 vc=2583 vn=- li=- udi=54 nc=0 flg=0x1000
hzongaro commented 1 year ago

Digging into things a bit more, the problem lies with the way Store Sinking has propagated the istore from node n29n to later blocks. Before Store Sinking, the two stores to #419 appear in block_126, which is highlighted in the attached PDF file cfg.prestoresink.issue17033.pdf. After Store Sinking, the second store to #419 has been propagated to blocks 121, 123, 125, 169 and 170, which are highlighted in this second attached PDF file cfg.poststoresink.issue17033.pdf

Blocks 169 and 170 are predecessors of the loop that appears in the method. The problem is that the stores that have been propagated to those later blocks look like this:

n3714n    istore  <parm 2 I>[#419  Parm] [flags 0x40000103 0x0 ]                              [0x7f70175f78a0] bci=[-1,6,8] rc=0 vc=2583 vn=- li=- udi=2 nc=1
n3715n      imul                                                                              [0x7f70175f78f0] bci=[-1,5,8] rc=1 vc=2583 vn=- li=- udi=- nc=2
n3716n        iload  <parm 2 I>[#419  Parm] [flags 0x40000103 0x0 ] (cannotOverflow )         [0x7f70175f7940] bci=[-1,0,8] rc=1 vc=2585 vn=- li=- udi=52 nc=0 flg=0x1000
n3717n        iload  <parm 2 I>[#419  Parm] [flags 0x40000103 0x0 ] (cannotOverflow )         [0x7f70175f7990] bci=[-1,4,8] rc=1 vc=2583 vn=- li=- udi=54 nc=0 flg=0x1000

That is, they multiply by itself the value of #419 that was stored by node n25n above, whereas the original multiplication at node n28n used a commoned node n21n as one operand and the result of the store from n25n as the other operand.

0xdaryl commented 1 year ago

@hzongaro : should this still be targeted to 0.40? Please advise.

hzongaro commented 1 year ago

@0xdaryl, I haven't been able to spend time on it. Although Peter mentioned above that he wasn't able to reproduce the failure with 11.0.18, I don't see any changes in Store Sinking that would indicate that this is a recent regression. I think it's safe to defer it beyond 0.40.

0xdaryl commented 1 year ago

Thanks. Moving to 0.41 (I mis-spoke about the release in my earlier comment).

hzongaro commented 1 year ago

No further progress - moving to the 0.43 release.

hzongaro commented 12 months ago

Following is a slightly modified version of the test case with which I can more reliably reproduce the failure:

import java.net.Socket;

class Issue17033c {
  int a = 1;
  long z;

  void foo(boolean b, int c, int d, int e, int f, int g, int h) {
    c += e + h + (e = c + d) + (h = f + g) + e + h;
    Socket t = new Socket();
    String[][] q = {};
    String p = "-000-0";
    if (b) {
       for (int k = 395; k < 5172; k += 1) {
         z += c;
         try {
           if (q[a][1].equals(p)) {}
         } catch (Throwable thr) {
         } finally {
         }
       }
    }
  }

  void too(int i, int b) {
    int j = 1;
    while (++j < 104) foo(true, i, i, i, i, i, i);
  }

  public static void main(String[] g) {
    Issue17033c t = new Issue17033c();
    t.too(1, 1);
    System.out.println(t.z);
  }
}
$ java -Xint Issue17033c
5359794
$ java -Xjit:{Issue17033c.foo*}\(optLevel=hot,optDetails,traceGeneralStoreSinking,traceLiveness,log=issue17033c.log\) -XX:-EnableHCR Issue17033c
6095452

Before general store sinking:

n29n      treetop                                                                             [       0x125d60fc0] bci=[-1,10,8] rc=0 vc=1500 vn=- li=- udi=- nc=1
n24n        iadd                                                                              [       0x125d60e30] bci=[-1,5,8] rc=2 vc=1500 vn=- li=- udi=- nc=2
n22n          iload  <parm 4 I>[#353  Parm] [flags 0x40000103 0x0 ] (cannotOverflow )         [       0x125d60d90] bci=[-1,1,8] rc=1 vc=1500 vn=- li=19 udi=13 nc=0 flg=0x1000
n23n          iload  <parm 7 I>[#356  Parm] [flags 0x40000103 0x0 ] (cannotOverflow )         [       0x125d60de0] bci=[-1,3,8] rc=1 vc=1500 vn=- li=20 udi=14 nc=0 flg=0x1000
n28n      istore  <parm 4 I>[#353  Parm] [flags 0x40000103 0x0 ]                              [       0x125d60f70] bci=[-1,10,8] rc=0 vc=1500 vn=- li=9 udi=1 nc=1
n27n        iadd                                                                              [       0x125d60f20] bci=[-1,8,8] rc=2 vc=1500 vn=- li=- udi=- nc=2
n25n          iload  <parm 2 I>[#351  Parm] [flags 0x40000103 0x0 ] (cannotOverflow )         [       0x125d60e80] bci=[-1,6,8] rc=1 vc=1500 vn=- li=21 udi=15 nc=0 flg=0x1000
n26n          iload  <parm 3 I>[#352  Parm] [flags 0x40000103 0x0 ] (cannotOverflow )         [       0x125d60ed0] bci=[-1,7,8] rc=1 vc=1500 vn=- li=- udi=16 nc=0 flg=0x1000
n34n      istore  <parm 7 I>[#356  Parm] [flags 0x40000103 0x0 ]                              [       0x125d61150] bci=[-1,19,8] rc=0 vc=1500 vn=- li=10 udi=2 nc=1
n33n        iadd                                                                              [       0x125d61100] bci=[-1,17,8] rc=2 vc=1500 vn=- li=- udi=- nc=2
n31n          iload  <parm 5 I>[#354  Parm] [flags 0x40000103 0x0 ] (cannotOverflow )         [       0x125d61060] bci=[-1,13,8] rc=1 vc=1500 vn=- li=- udi=17 nc=0 flg=0x1000
n32n          iload  <parm 6 I>[#355  Parm] [flags 0x40000103 0x0 ] (cannotOverflow )         [       0x125d610b0] bci=[-1,15,8] rc=1 vc=1500 vn=- li=- udi=18 nc=0 flg=0x1000
n42n      istore  <parm 2 I>[#351  Parm] [flags 0x40000103 0x0 ]                              [       0x125d613d0] bci=[-1,29,8] rc=0 vc=1500 vn=- li=11 udi=3 nc=1
n41n        iadd                                                                              [       0x125d61380] bci=[-1,28,8] rc=1 vc=1500 vn=- li=- udi=- nc=2
n40n          iadd                                                                            [       0x125d61330] bci=[-1,27,8] rc=1 vc=1500 vn=- li=- udi=- nc=2
n38n            iadd                                                                          [       0x125d61290] bci=[-1,24,8] rc=1 vc=1500 vn=- li=- udi=- nc=2
n36n              iadd                                                                        [       0x125d611f0] bci=[-1,21,8] rc=1 vc=1500 vn=- li=- udi=- nc=2
n30n                iadd                                                                      [       0x125d61010] bci=[-1,12,8] rc=1 vc=1500 vn=- li=- udi=- nc=2
n27n                  ==>iadd
n24n                  ==>iadd
n33n                ==>iadd
n37n              iload  <parm 4 I>[#353  Parm] [flags 0x40000103 0x0 ] (cannotOverflow )     [       0x125d61240] bci=[-1,22,8] rc=1 vc=1500 vn=- li=22 udi=19 nc=0 flg=0x1000
n39n            iload  <parm 7 I>[#356  Parm] [flags 0x40000103 0x0 ] (cannotOverflow )       [       0x125d612e0] bci=[-1,25,8] rc=1 vc=1500 vn=- li=23 udi=20 nc=0 flg=0x1000
n21n          iload  <parm 2 I>[#351  Parm] [flags 0x40000103 0x0 ] (cannotOverflow )         [       0x125d60d40] bci=[-1,0,8] rc=1 vc=1500 vn=- li=24 udi=21 nc=0 flg=0x1000

After general store sinking:

n1703n    istore  <parm 2 I>[#351  Parm] [flags 0x40000103 0x0 ]                              [       0x1260fc410] bci=[-1,29,8] rc=0 vc=1593 vn=- li=- udi=3 nc=1
n1704n      iadd                                                                              [       0x1260fc460] bci=[-1,28,8] rc=1 vc=1593 vn=- li=- udi=- nc=2
n1705n        iadd                                                                            [       0x1260fc4b0] bci=[-1,27,8] rc=1 vc=1593 vn=- li=- udi=- nc=2
n1706n          iadd                                                                          [       0x1260fc500] bci=[-1,24,8] rc=1 vc=1593 vn=- li=- udi=- nc=2
n1707n            iadd                                                                        [       0x1260fc550] bci=[-1,21,8] rc=1 vc=1593 vn=- li=- udi=- nc=2
n1708n              iadd                                                                      [       0x1260fc5a0] bci=[-1,12,8] rc=1 vc=1593 vn=- li=- udi=- nc=2
n1709n                iadd                                                                    [       0x1260fc5f0] bci=[-1,8,8] rc=1 vc=1595 vn=- li=- udi=- nc=2
n1710n                  iload  <parm 2 I>[#351  Parm] [flags 0x40000103 0x0 ] (cannotOverflow )  [       0x1260fc640] bci=[-1,6,8] rc=1 vc=1595 vn=- li=- udi=15 nc=0 flg=0x1000
n1711n                  iload  <parm 3 I>[#352  Parm] [flags 0x40000103 0x0 ] (cannotOverflow )  [       0x1260fc690] bci=[-1,7,8] rc=1 vc=1595 vn=- li=- udi=16 nc=0 flg=0x1000
n1712n                iadd                                                                    [       0x1260fc6e0] bci=[-1,5,8] rc=1 vc=1596 vn=- li=- udi=- nc=2
n1713n                  iload  <parm 4 I>[#353  Parm] [flags 0x40000103 0x0 ] (cannotOverflow )  [       0x1260fc730] bci=[-1,1,8] rc=1 vc=1596 vn=- li=- udi=13 nc=0 flg=0x1000
n1714n                  iload  <parm 7 I>[#356  Parm] [flags 0x40000103 0x0 ] (cannotOverflow )  [       0x1260fc780] bci=[-1,3,8] rc=1 vc=1596 vn=- li=- udi=14 nc=0 flg=0x1000
n1715n              iadd                                                                      [       0x1260fc7d0] bci=[-1,17,8] rc=1 vc=1594 vn=- li=- udi=- nc=2
n1716n                iload  <parm 5 I>[#354  Parm] [flags 0x40000103 0x0 ] (cannotOverflow )  [       0x1260fc820] bci=[-1,13,8] rc=1 vc=1594 vn=- li=- udi=17 nc=0 flg=0x1000
n1717n                iload  <parm 6 I>[#355  Parm] [flags 0x40000103 0x0 ] (cannotOverflow )  [       0x1260fc870] bci=[-1,15,8] rc=1 vc=1594 vn=- li=- udi=18 nc=0 flg=0x1000
n1718n            iload  <parm 4 I>[#353  Parm] [flags 0x40000103 0x0 ] (cannotOverflow )     [       0x1260fc8c0] bci=[-1,22,8] rc=1 vc=1593 vn=- li=- udi=19 nc=0 flg=0x1000
n1719n          iload  <parm 7 I>[#356  Parm] [flags 0x40000103 0x0 ] (cannotOverflow )       [       0x1260fc910] bci=[-1,25,8] rc=1 vc=1593 vn=- li=- udi=20 nc=0 flg=0x1000
n1720n        iload  <parm 2 I>[#351  Parm] [flags 0x40000103 0x0 ] (cannotOverflow )         [       0x1260fc960] bci=[-1,0,8] rc=1 vc=1593 vn=- li=- udi=21 nc=0 flg=0x1000

I think what's happening is that general store sinking doesn't take into account the possibility that a store to a local might occur between the point at which a commoned load first appears and the candidate for store sinking where the commoned load is actually used.

In this example, n24n uses #353 and #356. After that, there are stores to those two local symbols at n28n and n34n, respectively. Finally, n24n is used in the tree of the candidate for store sinking at n42n.

After store sinking, n28n and n34n remain in their original block, and all the references to #353 and #356 in n1703n use the values loaded from those symbols after they've been stored to.

hzongaro commented 10 months ago

I have a fix in progress for this that I hope to complete for the 0.44 release.

hzongaro commented 4 months ago

I haven't managed to progress my prototype fix for this problem, but I'd like to try to get this fixed. Moving it to the 0.48 release.