eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.28k stars 720 forks source link

Getting cache line size doesn't work on plinux with gcc 10 #15056

Closed pshipton closed 2 years ago

pshipton commented 2 years ago

This is related to https://github.com/eclipse-openj9/openj9/issues/14799

OMR code is doing the following to get the cache line size on plinux. When compiled with gcc-9 + and -O3 it doesn't work. I can fix it by adding a pragma to use -O0 to compile the code. Looking for thoughts before I proceed with that solution. @zl-wang @keithc-ca

The pragma is:

#if __GNUC__ >= 9
#pragma GCC push_options
#pragma GCC optimize ("O0")
#endif
#include <stdio.h>
#include <stdint.h>
#include <string.h>

int main(int argc, char **argv) {
    int32_t ppcCacheLineSize;

    int  i;
    char buf[1024];
    memset(buf, 255, 1024);

    __asm__ __volatile__(
        "dcbz 0, %0"
        : /* no outputs */
        :"r"((void *) &buf[512]));

    for (i = 0, ppcCacheLineSize = 0; i < 1024; i++) {
        if (buf[i] == 0) {
            ppcCacheLineSize++;
        }
    }
    printf("cacheline %d\n", ppcCacheLineSize);
    return 0;
}
gcc-10 -o test -O3 test.c
./test
cacheline 0

objdump -drwC test shows:

0000000000000740 <main>:
 740:   02 00 4c 3c     addis   r2,r12,2
 744:   c0 77 42 38     addi    r2,r2,30656
 748:   a6 02 08 7c     mflr    r0
 74c:   00 04 a0 38     li      r5,1024
 750:   ff 00 80 38     li      r4,255
 754:   10 00 01 f8     std     r0,16(r1)
 758:   91 fb 21 f8     stdu    r1,-1136(r1)
 75c:   68 00 61 38     addi    r3,r1,104
 760:   f0 8f 2d e9     ld      r9,-28688(r13)
 764:   68 04 21 f9     std     r9,1128(r1)
 768:   00 00 20 39     li      r9,0
 76c:   35 ff ff 4b     bl      6a0 <00000048.plt_call.memset@@GLIBC_2.17>
 770:   18 00 41 e8     ld      r2,24(r1)
 774:   68 02 21 39     addi    r9,r1,616
 778:   ec 4f 00 7c     dcbz    0,r9
 77c:   fe ff 82 3c     addis   r4,r2,-2
 780:   00 00 a0 38     li      r5,0
 784:   e8 8b 84 38     addi    r4,r4,-29720
 788:   01 00 60 38     li      r3,1
 78c:   75 ff ff 4b     bl      700 <00000048.plt_call.__printf_chk@@GLIBC_2.17>
 790:   18 00 41 e8     ld      r2,24(r1)
 794:   68 04 21 e9     ld      r9,1128(r1)
 798:   f0 8f 4d e9     ld      r10,-28688(r13)
 79c:   79 52 29 7d     xor.    r9,r9,r10
 7a0:   00 00 40 39     li      r10,0
 7a4:   18 00 82 40     bne     7bc <main+0x7c>
 7a8:   70 04 21 38     addi    r1,r1,1136
 7ac:   00 00 60 38     li      r3,0
 7b0:   10 00 01 e8     ld      r0,16(r1)
 7b4:   a6 03 08 7c     mtlr    r0
 7b8:   20 00 80 4e     blr
 7bc:   65 ff ff 4b     bl      720 <00000048.plt_call.__stack_chk_fail@@GLIBC_2.17>
 7c0:   18 00 41 e8     ld      r2,24(r1)
 7c4:   00 00 00 00     .long 0x0
 7c8:   00 00 00 01     .long 0x1000000
 7cc:   80 00 00 00     .long 0x80

It does work with -O0 objdump -drwC test shows:

00000000000008dc <main>:
 8dc:   02 00 4c 3c     addis   r2,r12,2
 8e0:   24 76 42 38     addi    r2,r2,30244
 8e4:   a6 02 08 7c     mflr    r0
 8e8:   10 00 01 f8     std     r0,16(r1)
 8ec:   f8 ff e1 fb     std     r31,-8(r1)
 8f0:   71 fb 21 f8     stdu    r1,-1168(r1)
 8f4:   78 0b 3f 7c     mr      r31,r1
 8f8:   78 1b 69 7c     mr      r9,r3
 8fc:   60 00 9f f8     std     r4,96(r31)
 900:   e6 01 09 7c     mtfprwz f0,r9
 904:   6c 00 3f 39     addi    r9,r31,108
 908:   ae 4f 00 7c     stfiwx  f0,0,r9
 90c:   f0 8f 2d e9     ld      r9,-28688(r13)
 910:   78 04 3f f9     std     r9,1144(r31)
 914:   00 00 20 39     li      r9,0
 918:   78 00 3f 39     addi    r9,r31,120
 91c:   00 04 a0 38     li      r5,1024
 920:   ff 00 80 38     li      r4,255
 924:   78 4b 23 7d     mr      r3,r9
 928:   b9 fd ff 4b     bl      6e0 <00000019.plt_call.memset@@GLIBC_2.17>
 92c:   18 00 41 e8     ld      r2,24(r1)
 930:   78 00 3f 39     addi    r9,r31,120
 934:   00 02 29 39     addi    r9,r9,512
 938:   ec 4f 00 7c     dcbz    0,r9
 93c:   00 00 20 39     li      r9,0
 940:   74 00 3f 91     stw     r9,116(r31)
 944:   00 00 20 39     li      r9,0
 948:   70 00 3f 91     stw     r9,112(r31)
 94c:   34 00 00 48     b       980 <main+0xa4>
 950:   76 00 3f e9     lwa     r9,116(r31)
 954:   80 04 5f 39     addi    r10,r31,1152
 958:   14 4a 2a 7d     add     r9,r10,r9
 95c:   f8 fb 29 89     lbz     r9,-1032(r9)
 960:   00 00 29 2c     cmpdi   r9,0
 964:   10 00 82 40     bne     974 <main+0x98>
 968:   70 00 3f 81     lwz     r9,112(r31)
 96c:   01 00 29 39     addi    r9,r9,1
 970:   70 00 3f 91     stw     r9,112(r31)
 974:   74 00 3f 81     lwz     r9,116(r31)
 978:   01 00 29 39     addi    r9,r9,1
 97c:   74 00 3f 91     stw     r9,116(r31)
 980:   74 00 3f 81     lwz     r9,116(r31)
 984:   ff 03 09 2c     cmpwi   r9,1023
 988:   c8 ff 81 40     ble     950 <main+0x74>
 98c:   72 00 3f e9     lwa     r9,112(r31)
 990:   78 4b 24 7d     mr      r4,r9
 994:   fe ff 62 3c     addis   r3,r2,-2
 998:   68 8c 63 38     addi    r3,r3,-29592
 99c:   25 fd ff 4b     bl      6c0 <00000019.plt_call.printf@@GLIBC_2.17>
 9a0:   18 00 41 e8     ld      r2,24(r1)
 9a4:   00 00 20 39     li      r9,0
 9a8:   78 04 5f e9     ld      r10,1144(r31)
 9ac:   f0 8f 0d e9     ld      r8,-28688(r13)
 9b0:   79 42 4a 7d     xor.    r10,r10,r8
 9b4:   00 00 00 39     li      r8,0
 9b8:   0c 00 82 41     beq     9c4 <main+0xe8>
 9bc:   65 fd ff 4b     bl      720 <00000019.plt_call.__stack_chk_fail@@GLIBC_2.17>
 9c0:   18 00 41 e8     ld      r2,24(r1)
 9c4:   78 4b 23 7d     mr      r3,r9
 9c8:   90 04 3f 38     addi    r1,r31,1168
 9cc:   10 00 01 e8     ld      r0,16(r1)
 9d0:   a6 03 08 7c     mtlr    r0
 9d4:   f8 ff e1 eb     ld      r31,-8(r1)
 9d8:   20 00 80 4e     blr
 9dc:   00 00 00 00     .long 0x0
 9e0:   00 00 00 01     .long 0x1000000
 9e4:   80 01 00 01     .long 0x1000180
 9e8:   00 00 00 60     nop
 9ec:   00 00 00 60     nop
zl-wang commented 2 years ago

it looks like dcbz side-effect not visible to the optimizer, such that buf array is still considered not modified. as a result, the whole loop of counting the line size is ripped out.

I am wondering a fix: 1) the cache line size has been 128 for the past 20 years, and is not expected to change anytime soon. Also, OpenJ9 has been assuming 64bit processor (supported) which all had/have 128-byte cache line. We could fix it by replacing a 128 constant there; 2) add something about dcbz side-effect to be visible to the optimizer (i will need to ask exactly what to add there).

Any preference? @pshipton

@bhavani-sn-ibm your gcc issue might be related to this as well.

zl-wang commented 2 years ago

possible objection to 128 constant replacement: in case, we need to support 32-bit embedded processors with 32bit build.

pshipton commented 2 years ago

Yes, it seems better to continue calculating than to return a constant. If nothing else, adding the pragma will fix it.

FYI your ping to @bhavani-sn-ibm didn't work.

keithc-ca commented 2 years ago

Adding the 'clobbers' parameter fixes it (otherwise the optimizer can assume buf[0] = 255).

    __asm__ __volatile__(
        "dcbz 0, %0"
        : /* no outputs */
        :"r"((void *) &buf[512])
        : "memory");
keithc-ca commented 2 years ago

For the record, I tested with this:

$ gcc-10 --version
gcc-10 (Ubuntu 10.3.0-1ubuntu1~20.04) 10.3.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
keithc-ca commented 2 years ago

I'll open an OMR PR shortly.

pshipton commented 2 years ago

Thanks. Note there is a second copy in j9memclr.cpp

keithc-ca commented 2 years ago

See https://github.com/eclipse/omr/pull/6515, which I will extend with a fix for j9memclr.cpp as well.