Missed optimizations for comparison of doubles in loops

Quuxplusone commented 6 years ago


Bugzilla Link	PR37313
Status	NEW
Importance	P enhancement
Reported by	David Bolvansky (david.bolvansky@gmail.com)
Reported on	2018-05-02 08:26:44 -0700
Last modified on	2018-05-03 16:13:26 -0700
Version	6.0
Hardware	PC Linux
CC	codeman.consulting@gmail.com, hfinkel@anl.gov, llvm-bugs@lists.llvm.org, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

Hello,

double test(double *arr, int size) {
    double m = INFINITY;
    int flag = 1;
    int i;
    double a;

    for (i=0; i<size;++i) {
        a = arr[i];
        if (a <= m ) {
            m = a;
            flag = 0;
        }
        ++i;
    }
    if (flag) {
        return 0.0;
    }
    return m;
}

ASM outputs:
https://godbolt.org/g/iSjSGW

Seems like Clang generates worse code because LLVM missed same patterns..

Quuxplusone commented 6 years ago

This is an unfortunate consequence of the x86 ISA. You have to choose which of
these you like better:
https://godbolt.org/g/oBLyEz

If you trust your branch predictor, then obviously the version with a branch is
better.

But what about if you add "-mattr=avx" to that compilation? Now you have
"vblendvpd" to make the select less ugly (but might be worse latency depending
on uarch).

But we're not done yet...change the comparison predicate to:
        if (a < m ) {

Now, LLVM produces "minsd". Is that better than gcc?

In summary: bad architecture and uarch implementation makes the compiler's job
hard. You'll have to measure the actual perf for all the options on different
CPUs  and with different data sets to know which version of the code is better.

Quuxplusone commented 6 years ago

Oki, so simple test program...

double test(double *arr, int size, double p) {
    double m = INFINITY;
    int flag = 1;
    int i;
    double a;

    for (i=0; i<size;++i) {
        a = arr[i];
        if (a <= m ) {
            m = a;
            flag = 0;
        }
        ++i;
    }
    if (flag) {
        return 0.0;
    }
    return p + m;
}

int main(int argc, char **argv) {
    size_t len = 50000;
    double arr[50000] = {0.0,};
    srand(time(NULL));

    for (unsigned i = 0; i < len; ++i)
            arr[i] = rand();

    double res = 1;
    for (unsigned i = 0; i < 80000 * argc; ++i)
        res = test(arr, len, res);

    return res;
}

Results:

xbolva00@xbolva00:~/opt$ gcc -O3 i.c
xbolva00@xbolva00:~/opt$ time ./a.out 5

real    0m2,466s
user    0m2,465s
sys 0m0,001s
xbolva00@xbolva00:~/opt$ time ./a.out 5

real    0m2,422s
user    0m2,418s
sys 0m0,004s
xbolva00@xbolva00:~/opt$ clang -O3 i.c
xbolva00@xbolva00:~/opt$ time ./a.out 5

real    0m7,865s
user    0m7,861s
sys 0m0,004s
xbolva00@xbolva00:~/opt$ time ./a.out 5

real    0m7,871s
user    0m7,866s
sys 0m0,004s

Quuxplusone commented 6 years ago

with avx

xbolva00@xbolva00:~/opt$ clang  -Xclang -disable-O0-optnone -S -emit-llvm i.c
xbolva00@xbolva00:~/opt$ opt -O3 -mattr=avx i.ll -S > opt.ll
xbolva00@xbolva00:~/opt$ llc -O3 -mattr=avx opt.ll
xbolva00@xbolva00:~/opt$ clang opt.s -o dr
xbolva00@xbolva00:~/opt$ time ./dr 5

real    0m6,576s
user    0m6,572s
sys 0m0,004s

Quuxplusone commented 6 years ago

and results for case for "if (a < m )"...

xbolva00@xbolva00:~/opt$ clang -O3 i.c
xbolva00@xbolva00:~/opt$ time ./a.out 5

real    0m4,576s
user    0m4,576s
sys 0m0,000s
xbolva00@xbolva00:~/opt$ gcc -O3 i.c
xbolva00@xbolva00:~/opt$ time ./a.out 5

real    0m2,442s
user    0m2,442s
sys 0m0,000s

Quuxplusone commented 6 years ago

That looks compelling so far, but you must specify exactly which CPU you are
testing on too.

For your test case, there's a more general possibility:

  %12 = fcmp ugt double %11, %9
  %13 = select i1 %12, double %9, double %11
  %14 = select i1 %12, i32 %8, i32 0

When you have 2 or more selects that use the same condition value, should we
convert them into branch + phi?

Quuxplusone commented 6 years ago

Tested on:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               60
Model name:          Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz
Stepping:            3
CPU MHz:             1404.472
CPU max MHz:         3600,0000
CPU min MHz:         800,0000
BogoMIPS:            5188.11
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            6144K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2
ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt
tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb
invpcid_single pti retpoline spec_ctrl tpr_shadow vnmi flexpriority ept vpid
fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat
pln pts

Quuxplusone commented 6 years ago

(In reply to Sanjay Patel from comment #5)
> When you have 2 or more selects that use the same condition value, should we
> convert them into branch + phi?

You might also invert that question: which pass thought it was a good idea to
form 2 selects from the branch + phis?

Most likely answer: SimplifyCFG.

Quuxplusone commented 6 years ago

Yes, SimplifyCFG is good for this branch + phi case..

Quuxplusone commented 6 years ago

Note that if you enable -ffast-math the clang code becomes close to the GCC output.

https://godbolt.org/g/Ab67Lc

I suspect they're possibly enabling it at -O3 in GCC?

Just another angle to look at.

Quuxplusone commented 6 years ago

We should check it too whether is possible to move some optimizations from "-ffast-math" safely.

This case seems like the right candidate because it helps quite a lot (as we can see from time measures) and there is nothing which could break after this change, or am I wrong?

Experts in this area are needed here :)

Quuxplusone commented 6 years ago

With -ffast-math:

xbolva00@xbolva00:~/opt$ time ./a.out 5

real    0m4,258s
user    0m4,251s
sys 0m0,000s
xbolva00@xbolva00:~/opt$ time ./a.out 5

real    0m4,393s
user    0m4,384s
sys 0m0,004s

Still, we miss branching here, which helps a lot.

Quuxplusone commented 6 years ago

llvm mca for code compiled by gcc (1)
https://pastebin.com/raw/VebrMbCF

and by clang + fast-math (2)
https://pastebin.com/raw/u7FSndj9

no fast-math (3)
https://pastebin.com/raw/kPm3R6fm

almost 2x more cycles when comparing (1) with (3).

Quuxplusone commented 6 years ago

(In reply to David Bolvansky from comment #10)
> We should check it too whether is possible to move some optimizations from
> "-ffast-math" safely.
>
> This case seems like the right candidate because it helps quite a lot (as we
> can see from time measures) and there is nothing which could break after
> this change, or am I wrong?
>
> Experts in this area are needed here :)

You're wrong. :)

-ffast-math (or more accurately -ffinite-math-only...although I'm not sure if
that will just uncover other bugs) changes the semantics of the compare:

 %12 = fcmp fast ugt double %11, %9
 %13 = select i1 %12, double %9, double %11

This is "minsd" only if we can ignore NAN values. By default, you can't, so
we're forced to produce the 'blendv' (or its pre-AVX equivalent) when we're
operating with strict FP semantics. See the Intel manual for exactly what is
possible with x86 instructions.

I think you'd be better off investigating whether we should limit the select
creation in SimplifyCFG (or reverse it in some later pass). That's the bigger
issue.

Quuxplusone commented 6 years ago

So here in this case GCC is wrong? Since ignores NaN?

Quuxplusone commented 6 years ago

"Select" issue.. I can help and test your patches if any, even on SPEC2006.

Quuxplusone commented 6 years ago

(In reply to David Bolvansky from comment #14)
> So here in this case GCC is wrong? Since ignores NaN?

It doesn't produce minsd in the example that Gordon posted (even if you add -
ffast-math to that compilation):
https://godbolt.org/g/Ab67Lc

Quuxplusone commented 6 years ago

Another interesting case is when we change it to "double m = NAN;"

https://godbolt.org/g/AhxoqX

Quuxplusone / LLVMBugzillaTest

Missed optimizations for comparison of doubles in loops #36286