eclipse-omr / omr

Eclipse OMR™ Cross platform components for building reliable, high performance language runtimes
http://www.eclipse.org/omr
Other
948 stars 396 forks source link

Re-enable byteswap support on x86 and Power #5642

Open fjeremic opened 4 years ago

fjeremic commented 4 years ago

There were issues spotted since #5623 on x86 and Power.

On 32-bit x86 we seem to have problems when we use register pairs in that the byte swap is not correctly performed. Here is what we generate:

------------------------------
 n387n    (  0)  treetop                                                                              [0xd3d83870] bci=[1,76,641] rc=0 vc=319 vn=- li=22 udi=- nc=1
 n386n    (  1)    lbits2d (in FPR_0214) ()                                                           [0xd3d83840] bci=[1,76,641] rc=1 vc=319 vn=- li=22 udi=36928 nc=1 flg=0x20
 n702n    (  0)      lbyteswap (in GPR_0213:GPR_0212)                                                 [0xd3d87380] bci=[1,73,641] rc=0 vc=319 vn=- li=22 udi=36192 nc=1
 n701n    (  0)        lloadi  <generic int shadow>[#422  Shadow] [flags 0x603 0x0 ] (in GPR_0213:GPR_0212)  [0xd3d87350] bci=[1,73,641] rc=0 vc=319 vn=- li=22 udi=36192 nc=1
 n700n    (  0)          aiadd (internalPtr sharedMemory )                                            [0xd3d87320] bci=[1,73,641] rc=0 vc=319 vn=- li=22 udi=- nc=2 flg=0x8000
 n383n    (  0)            ==>aload (in &GPR_0208) (X!=0 sharedMemory )
 n699n    (  0)            iadd                                                                       [0xd3d872f0] bci=[1,76,641] rc=0 vc=319 vn=- li=22 udi=3 nc=2
 n698n    (  0)              iconst 8 (in GPR_0211)                                                   [0xd3d872c0] bci=[1,76,641] rc=0 vc=319 vn=- li=22 udi=35616 nc=0
 n384n    (  0)              ==>iconst 0 (in GPR_0209) (X==0 X>=0 X<=0 )
------------------------------

 [0xd3eb8b70]   mov     GPR_0211, 0x00000008    # MOV4RegImm4
 [0xd3eb8c20]   mov     GPR_0212, dword ptr [GPR_0211+1*&GPR_0208]              # L4RegMem, SymRef  <generic int shadow>[#422  Shadow] [flags 0x603 0x0 ]
 [0xd3eb8d10]   mov     GPR_0213, dword ptr [GPR_0211+1*&GPR_0208+0x4]          # L4RegMem, SymRef  <generic int shadow>[#490  Shadow +4] [flags 0x603 0x0 ]
 [0xd3eb8db0]   bswap   GPR_0213:GPR_0212(GPR_0213:GPR_0212)                    # BSWAP4Reg
 [0xd3eb8ee0]   mov     dword ptr [vfp], GPR_0212               # S4MemReg, SymRef  <auto slot -1>[#492  Auto] [flags 0x4 0x0 ]
 [0xd3eb8fa0]   mov     dword ptr [vfp+0x4], GPR_0213           # S4MemReg, SymRef  <auto slot -1>[#494  Auto +4] [flags 0x4 0x0 ]
 [0xd3eb9090]   movsd   FPR_0214, qword ptr [vfp]               # MOVSDRegMem, SymRef  <auto slot -1>[#496  Auto] [flags 0x4 0x0 ]

...
...

0xd4f2b2b6 00000272 [0xd3eb8b70] b8 08 00 00 00             mov eax, 0x00000008 # MOV4RegImm4
0xd4f2b2bb 00000277 [0xd3eb8c20] 8b 54 38 00                mov edx, dword ptr [eax+1*edi]              # L4RegMem, SymRef  <generic int shadow>[#422  Shadow] [flags 0x603 0x0 ]
0xd4f2b2bf 0000027b [0xd3eb8d10] 8b 44 38 04                mov eax, dword ptr [eax+1*edi+0x4]          # L4RegMem, SymRef  <generic int shadow>[#490  Shadow +4] [flags 0x603 0x0 ]
0xd4f2b2c3 0000027f [0xd3eb8db0] 0f cf                      bswap       edi                     # BSWAP4Reg
0xd4f2b2c5 00000281 [0xd3eb8ee0] 89 54 24 34                mov dword ptr [esp+0x34], edx               # S4MemReg, SymRef  <auto slot -1>[#492  Auto +52] [flags 0x4 0x0 ]
0xd4f2b2c9 00000285 [0xd3eb8fa0] 89 44 24 38                mov dword ptr [esp+0x38], eax               # S4MemReg, SymRef  <auto slot -1>[#494  Auto +56] [flags 0x4 0x0 ]
0xd4f2b2cd 00000289 [0xd3eb9090] c5 fb 10 44 24 34          movsd       xmm0, qword ptr [esp+0x34]              # MOVSDRegMem, SymRef  <auto slot -1>[#496  Auto +52] [flags 0x4 0x0
0xd4f2b2d3 0000028f [0xd3eb9590] c5 fb 11 44 24 2c          movsd       qword ptr [esp+0x2c], xmm0              # MOVSDMemReg, SymRef  <auto slot 5>[#497  Auto +44] [flags 0x6 0x0

This is incorrect as we are not byte swapping the full 8-byte value. In fact it seems to use a totally incorrect register edi where as the 64-bit value got loaded into a register pair edx:eax.

There also appears to be an issue with 64-bit x86 which I'm currently tracking down.

For Power there are compile time segfaults encountered. The jitdump shows the following:

------------------------------
 n29717n  (  0)  lRegStore gr28                                                                       [    0x7ad923f63890] bci=[1,76,344] rc=0 vc=7 vn=- li=200 udi=- nc=1
 n8165n   (  3)    lbyteswap ()                                                                       [    0x7ad92a80e7e0] bci=[1,76,344] rc=3 vc=7 vn=- li=200 udi=- nc=1 flg=0x20
 n24908n  (  1)      lloadi  <generic int shadow>[#1371  Shadow] [flags 0x603 0x0 ]                   [    0x7ad923505960] bci=[1,73,344] rc=1 vc=7 vn=- li=200 udi=- nc=1
 n24907n  (  1)        aladd (X>=0 internalPtr sharedMemory )                                         [    0x7ad923505910] bci=[1,73,344] rc=1 vc=7 vn=- li=200 udi=- nc=2 flg=0x810
 n387n    (  2)          ==>newarray (in &GPR_    0x7ad922eccde0) (highWordZero Unsigned X!=0 allocationCanBeRemoved sharedMemory )
 n24922n  (  1)          ==>lconst 8 (highWordZero X!=0 X>=0 )
------------------------------
------------------------------
 n29717n  (  0)  lRegStore gr28                                                                       [    0x7ad923f63890] bci=[1,76,344] rc=0 vc=7 vn=- li=200 udi=- nc=1
 n8165n   (  2)    lbyteswap (in GPR_    0x7ad922ea42d0) ()                                           [    0x7ad92a80e7e0] bci=[1,76,344] rc=2 vc=7 vn=- li=200 udi=17104 nc=1 flg=0
 n24908n  (  1)      lloadi  <generic int shadow>[#1371  Shadow] [flags 0x603 0x0 ]                   [    0x7ad923505960] bci=[1,73,344] rc=1 vc=7 vn=- li=200 udi=- nc=1
 n24907n  (  0)        aladd (X>=0 internalPtr sharedMemory )                                         [    0x7ad923505910] bci=[1,73,344] rc=0 vc=7 vn=- li=200 udi=- nc=2 flg=0x810
 n387n    (  1)          ==>newarray (in &GPR_    0x7ad922eccde0) (highWordZero Unsigned X!=0 allocationCanBeRemoved sharedMemory )
 n24922n  (  0)          ==>lconst 8 (highWordZero X!=0 X>=0 )
------------------------------

 [    0x7ad922ea4420]   73      li      GPR_    0x7ad922ea43b0, 0000000000000008
 [    0x7ad922ea44c0]   76      ldbrx   GPR_    0x7ad922ea42d0, [&GPR_    0x7ad922eccde0, GPR_    0x7ad922ea43b0]               # SymRef  <generic int shadow>[#1371  Shadow] [flags

============================================================
; Live regs: GPR=3 FPR=0 CCR=0 VRF=0 VSX_SCALAR=0 VSX_VECTOR=0 {GPR_    0x7ad922ea42d0, &GPR_    0x7ad922eccde0, GPR_    0x7ad922ecb280}
------------------------------
 n8183n   (  0)  lstore  <auto slot 5>[#400  Auto] [flags 0x4 0x0 ]                                   [    0x7ad92a80ed80] bci=[-1,27,23] rc=0 vc=7 vn=- li=200 udi=- nc=1
 n8165n   (  2)    ==>lbyteswap (in GPR_    0x7ad922ea42d0) ()
------------------------------
------------------------------
 n8183n   (  0)  lstore  <auto slot 5>[#400  Auto] [flags 0x4 0x0 ]                                   [    0x7ad92a80ed80] bci=[-1,27,23] rc=0 vc=7 vn=- li=200 udi=- nc=1
 n8165n   (  2)    ==>lbyteswap (in GPR_    0x7ad922ea42d0) ()
------------------------------

 [    0x7ad922ea48b0]   73      ld      GPR_    0x7ad922ea47d0, [&GPR_    0x7ad922eccde0, 8]            # SymRef  <generic int shadow>[#1371  Shadow] [flags 0x603 0x0 ]
 [    0x7ad922ea4a30]   27      stdbrx  [gr14, GPR_    0x7ad922ea49c0], GPR_    0x7ad922ea47d0          # SymRef  <auto slot 5>[#400  Auto] [flags 0x4 0x0 ]

============================================================
; Live regs: GPR=2 FPR=0 CCR=0 VRF=0 VSX_SCALAR=0 VSX_VECTOR=0 {GPR_    0x7ad922ea42d0, GPR_    0x7ad922ecb280}
------------------------------
 n382n    (  0)  BBEnd </block_200> =====                                                             [    0x7ad92a096720] bci=[-1,27,23] rc=0 vc=7 vn=- li=200 udi=- nc=1
 n29721n  (  1)    GlRegDeps ()                                                                       [    0x7ad923f639d0] bci=[-1,27,23] rc=1 vc=7 vn=- li=200 udi=- nc=3 flg=0x20
 n29718n  (  1)      PassThrough gr28                                                                 [    0x7ad923f638e0] bci=[1,76,344] rc=1 vc=7 vn=- li=200 udi=- nc=1
 n8165n   (  2)        ==>lbyteswap (in GPR_    0x7ad922ea42d0) ()
 n29719n  (  1)      PassThrough gr27                                                                 [    0x7ad923f63930] bci=[-1,0,14] rc=1 vc=7 vn=- li=200 udi=- nc=1
 n383n    (  1)        ==>lload (in GPR_    0x7ad922ecb280) (cannotOverflow )
 n29720n  (  1)      PassThrough gr26                                                                 [    0x7ad923f63980] bci=[-1,6,15] rc=1 vc=7 vn=- li=200 udi=- nc=1
 n387n    (  0)        ==>newarray (in &GPR_    0x7ad922eccde0) (highWordZero Unsigned X!=0 allocationCanBeRemoved sharedMemory )
------------------------------

There are two issues here. Node n24908n does not seem to get its reference count properly decremented following evaluation. We encounter a segfault when we try to do a decReferenceCount on node n387n because it's reference count when evaluating glRegDeps is 0 which should not be the case. It seems following evaluation of n8183n we somehow decrement the reference count of n387n. There is also a problem with the refcount of n8165n as its reference count does not seem to be decremented following evaluation of n8183n.

The backtrace on Power is:

p> 20201031-22:21:14 <gdb> #12 <signal handler called>
p> 20201031-22:21:14 <gdb> #13 0x00003fffa38353f8 in OMR::CodeGenerator::decReferenceCount(TR::Node*) () from /bluebird/builds/bld_458652/sdk/xl6480/jre/lib/ppc64le/compressedrefs/libj9jit29.so
p> 20201031-22:21:14 <gdb> #14 0x00003fffa3cc5ce4 in OMR::Power::TreeEvaluator::GlRegDepsEvaluator(TR::Node*, TR::CodeGenerator*) () from /bluebird/builds/bld_458652/sdk/xl6480/jre/lib/ppc64le/compressedrefs/libj9jit29.so
p> 20201031-22:21:14 <gdb> #15 0x00003fffa3835504 in OMR::CodeGenerator::evaluate(TR::Node*) () from /bluebird/builds/bld_458652/sdk/xl6480/jre/lib/ppc64le/compressedrefs/libj9jit29.so
p> 20201031-22:21:14 <gdb> #16 0x00003fffa3cc684c in OMR::Power::TreeEvaluator::BBEndEvaluator(TR::Node*, TR::CodeGenerator*) () from /bluebird/builds/bld_458652/sdk/xl6480/jre/lib/ppc64le/compressedrefs/libj9jit29.so
p> 20201031-22:21:14 <gdb> #17 0x00003fffa3835504 in OMR::CodeGenerator::evaluate(TR::Node*) () from /bluebird/builds/bld_458652/sdk/xl6480/jre/lib/ppc64le/compressedrefs/libj9jit29.so
p> 20201031-22:21:14 <gdb> #18 0x00003fffa3479964 in J9::CodeGenerator::doInstructionSelection() () from /bluebird/builds/bld_458652/sdk/xl6480/jre/lib/ppc64le/compressedrefs/libj9jit29.so
p> 20201031-22:21:14 <gdb> #19 0x00003fffa3845998 in OMR::CodeGenPhase::performInstructionSelectionPhase(TR::CodeGenerator*, TR::CodeGenPhase*) () from /bluebird/builds/bld_458652/sdk/xl6480/jre/lib/ppc64le/compressedrefs/libj9jit29.so
p> 20201031-22:21:14 <gdb> #20 0x00003fffa3843b14 in OMR::CodeGenPhase::performAll() () from /bluebird/builds/bld_458652/sdk/xl6480/jre/lib/ppc64le/compressedrefs/libj9jit29.so
p> 20201031-22:21:14 <gdb> #21 0x00003fffa38414b8 in OMR::CodeGenerator::generateCode() () from /bluebird/builds/bld_458652/sdk/xl6480/jre/lib/ppc64le/compressedrefs/libj9jit29.so
p> 20201031-22:21:14 <gdb> #22 0x00003fffa385bc8c in OMR::Compilation::compile() () from /bluebird/builds/bld_458652/sdk/xl6480/jre/lib/ppc64le/compressedrefs/libj9jit29.so

These issues come from the OpenJ9 project. I'll add notes now on how to reproduce.

fjeremic commented 4 years ago

Both issues can be reproduced with OpenJ9 latest nightly driver.

For x86: Test.java.zip

bin/java '-Xjit:limit={Test.test*}(count=1,traceFull,log=log.trace,lastOptIndex=44,lastOptSubIndex=2),disableasynccompilation' Test

For Power: Test.Utils.zip

bin/java -cp ./junit-4.11.jar:. -Xjit:count=0 Test
fjeremic commented 4 years ago

@aviansie-ben could you please help take a look at the Power issues noted above? It seems the refcounts are all messed up somehow.

aviansie-ben commented 4 years ago

Oops, it looks like the byteswapped store optimization on Power wasn't correctly checking that the *byteswap node wasn't commoned. I'd actually encountered this issue and was sure I'd fixed it before submitting the PR, but it looks like the fix ended up on the wrong branch by mistake. :sweat_smile: That issue should be fixed by #5645.