clangupc / clang-upc

Clang UPC Front-End
https://clangupc.github.io/
Other
16 stars 5 forks source link

freebsd 386 - test_int_precision tests fail #99

Open nenadv opened 7 years ago

nenadv commented 7 years ago

Clang UPC 3.9 exhibits a check failure of some values in the array.

test_int_precision.c 78 upc_forall (i = 0; i < SIZE; i++; &clone2[i]) { 79 DTYPE volatile d = (DTYPE)(1.123 i); / see BUPC bug 168 */ 80 if (clone2[i] != d) 81 err[MYTHREAD] += 1; 82 }

clone2[] array is 3200 in size, and checking fail for indexes 1000, 2000, and 3000. Somehow the value of 'd' is off by 1 (e.g. 1122 instead of 1123). Value inside shared clone2 is correct.

I created a smaller test that fails too, but only for the upc_forall loop. The for loop works as expected.

    for (i = MYTHREAD; i < SIZE; i += THREADS) {
            int volatile d = (int)(1.123 * i);
            if (clone2[i] != d)
                    printf ("for clone2 [%d] %d: %d (%d)\n", MYTHREAD, i, clone2[i], d);
    }
    upc_forall (i = 0; i < SIZE; i++; &clone2[i]) {
            int volatile d = (int)(1.123 * i);
            if (clone2[i] != d)
                    printf ("forall clone2 [%d] %d: %d (%d)\n", MYTHREAD, i, clone2[i], d);
    }

It seems that this happens only on freebsd, 32 bit version. And only if compiled with optimization.

nenadv commented 7 years ago

I looked at the IR code and asm. I am attaching my test program, as well as IR for the optimized version.

IR code has three sets of code (I placed '-- [123]' at the beginning of each section) where multiply/conversion is done. Somehow, upc_forall creates two sets of almost identical code but only one of them is being executed (at least this is the case for running with only one thread, confirmed by debugging). Non optimized version (that works) creates only one set of code for upc_forall. You can simple confirm this by looking for printf statements, there are three of them.

Asm code is different for calculating fmul/convert then a simple loop.

bug-freebsd.upc.txt bad.ll.txt

swatanabe commented 7 years ago

AMDG

On 04/24/2017 10:08 AM, Nenad Vukicevic wrote:

Asm code is different for calculating fmul/convert then a simple loop.

Different how?

In Christ, Steven Watanabe

nenadv commented 7 years ago

I enclosed an asm code that is relevant, with some comments. Summary is like this:

I tried to duplicate with the C code but was not able to create exact sequence.

compare-result.txt

gary-funck commented 7 years ago

On 04/24/17 20:09:36, Nenad Vukicevic wrote:

  • upc_forloop calculates ref value by multiplying FP value of index and 1.123, and adding 1.0 to the FP value of index in each iteration.

Adding 1.0 and then truncating to 'int' implements the ceiling function?

nenadv commented 7 years ago

I checked the code on Fedora Core 23. It uses SSE (xmm registers), while freebsd uses FP x87 (st registers). Also, the code on FC23 looks like the regular 'for loop' on freebsd. Even though, both of the machines are VMs on the same host, freebsd is probably using different machine target.

Is there a way to ask clang for the default machine attribute?

swatanabe commented 7 years ago

AMDG

On 04/25/2017 03:55 PM, Nenad Vukicevic wrote:

I checked the code on Fedora Core 23. It uses SSE (xmm registers), while freebsd uses FP x87 (st registers). Also, the code on FC23 looks like the regular 'for loop' on freebsd. Even though, both of the machines are VMs on the same host, freebsd is probably using different machine target.

Right. x87 can have weird behavior like this. I think the best way to fix this is to avoid relying on the vagaries of the floating point implementation by changing 1.23 to 1.230000001, so that a 1 ulp difference won't cause an error.

Is there a way to ask clang for the default machine attribute?

The options should be the same as gcc: e.g. -march=native

In Christ, Steven Watanabe

nenadv commented 7 years ago

I think the problem is caused by the scheduler where instructions are moved into the region where lower computation precision is used.

Here is the code for non optimized (good) version (which is similar to the 'for' loop).


GOOD (-O0)

    fildl  0xcc(%esp)
    fldl   0x8053f20
[ multiply with 1.123 ]
    fmulp  %st,%st(1)
[ change FP control word]
    fnstcw 0xb6(%esp)
    mov    0xb6(%esp),%dx
    movw   $0xc7f,0xb6(%esp)
    fldcw  0xb6(%esp)
[ new FP control word - 0xc7f
  precision bits (8 and 9) - 00 - 24 bits, single precission
  rounding bits (10 and 11) - 10 - toward 0 ]
    mov    %dx,0xb6(%esp)
[ convert to integer ]
    fistpl 0xd0(%esp)
[ restore FP control word]
    fldcw  0xb6(%esp)

BAD (-O3)
    fstp   %st(0)
    mov    0x80538e4,%edx
    fnstcw 0xc(%esp)
    movzwl 0xc(%esp),%eax
    movw   $0xc7f,0xc(%esp)
    fldcw  0xc(%esp)
[ new FP control word - 24 bits, single precision ]
    mov    %ax,0xc(%esp)
    fstl   0x6c(%esp)
[ multiplay with 1.123 ]
    fmull  0x8051f80
    fistpl 0x58(%esp)
[ restore FP control word ]
    fldcw  0xc(%esp)
'''

Original FP control word is 0x127F which is double precision and in the good case all multiplications are done with double precision.