Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Instcombine shouldn't turn x86_sse2_storel_dq into a full-width store #2475

Closed Quuxplusone closed 15 years ago

Quuxplusone commented 16 years ago
Bugzilla Link PR2296
Status RESOLVED FIXED
Importance P normal
Reported by Eli Friedman (efriedma@quicinc.com)
Reported on 2008-05-08 00:33:56 -0700
Last modified on 2008-07-16 02:31:13 -0700
Version unspecified
Hardware PC Linux
CC evan.cheng@apple.com, llvm-bugs@lists.llvm.org
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

Obvious bug: visitCallInst shouldn't turn x86_sse2_storel_dq into a store because it's not equivalent to a full-width store.

Trivial patch, but I lost track of which testcase ran into this issue. Putting here so it won't get lost.

Quuxplusone commented 15 years ago
Ah, found it; here's a C testcase:

#include <xmmintrin.h>
void x(__m128i* x, __m128i y) {_mm_storel_epi64(x,y);}
Quuxplusone commented 15 years ago
I get this code with both llvm-gcc and gcc:

$ llvm-gcc t.c -S -o - -O3 -fomit-frame-pointer

_x:
    movl    4(%esp), %eax
    movq    %xmm0, (%eax)
    ret

Are you seeing something else?
Quuxplusone commented 15 years ago
Ah, this testcase repros it for me:

#include <xmmintrin.h>
double G __attribute__((aligned(16)));
void x(__m128i y) {_mm_storel_epi64(&G,y);}

$ llvm-gcc-4.2 t.c -S -o - -O3 -fomit-frame-pointer -static
_x:
    movaps  %xmm0, _G
    ret

gcc:

_x:
    subl    $12, %esp
    movq    %xmm0, _G
    addl    $12, %esp
    ret
Quuxplusone commented 15 years ago
Ok, well it should probably be lowered by instcombine to one of these two
functions:

target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-
f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128"
target triple = "i386-apple-darwin8"
@G = common global double 0.000000e+00, align 16        ; <double*> [#uses=1]

define void @x(<2 x i64>* %x, <2 x i64> %y) nounwind  {
entry:
    %a = extractelement <2 x i64> %y, i32 0
    %b = bitcast i64 %a to double
    store double %b, double* @G
    ret void
}

define void @y(<2 x i64>* %x, <2 x i64> %y) nounwind  {
entry:
    %Y = bitcast <2 x i64> %y to <2 x double>
    %a = extractelement <2 x double> %Y, i32 0
    store double %a, double* @G, align 16
    ret void
}

It should probably be the second one, as the first one generates really bad
code.  This also exposes a codegen deficiency, as the second one doesn't
compile into movq anymore:

_x:
    subl    $44, %esp
    movaps  %xmm0, (%esp)
    movl    4(%esp), %eax
    movl    %eax, _G+4
    movl    (%esp), %eax
    movl    %eax, _G
    addl    $44, %esp
    ret
...
_y:
    movlpd  %xmm0, _G
    ret

If movq is better than movlpd, then this is a missed optimization.
Quuxplusone commented 15 years ago

In the short term, it seems best to just remove 'x86_sse2_storel_dq' from instcombine. Evan, what do you think?

Quuxplusone commented 15 years ago
(In reply to comment #4)
> It should probably be the second one, as the first one generates really bad
> code.

Doesn't the second one run into issues with SNans?  Or are we assuming that the
extracted double never touches the 387 unit?
Quuxplusone commented 15 years ago

AFAIK movlpd is at least as fast as movq. Also, it's not really extracting a double, it's really just storing the lower 64-bits of the XMM register.

Quuxplusone commented 15 years ago
This is fixed.
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20080714/065092.html

It now compiles to:
_x:
        movq    %xmm0, _G
        ret