Instcombine shouldn't turn x86_sse2_storel_dq into a full-width store - Githubissues

Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Instcombine shouldn't turn x86_sse2_storel_dq into a full-width store #2475

Closed Quuxplusone closed 15 years ago

Quuxplusone commented 16 years ago


Bugzilla Link	PR2296
Status	RESOLVED FIXED
Importance	P normal
Reported by	Eli Friedman (efriedma@quicinc.com)
Reported on	2008-05-08 00:33:56 -0700
Last modified on	2008-07-16 02:31:13 -0700
Version	unspecified
Hardware	PC Linux
CC	evan.cheng@apple.com, llvm-bugs@lists.llvm.org
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

Obvious bug: visitCallInst shouldn't turn x86_sse2_storel_dq into a store because it's not equivalent to a full-width store.

Trivial patch, but I lost track of which testcase ran into this issue. Putting here so it won't get lost.

Quuxplusone commented 15 years ago

Ah, found it; here's a C testcase:

#include <xmmintrin.h>
void x(__m128i* x, __m128i y) {_mm_storel_epi64(x,y);}

Quuxplusone commented 15 years ago

I get this code with both llvm-gcc and gcc:

$ llvm-gcc t.c -S -o - -O3 -fomit-frame-pointer

_x:
    movl    4(%esp), %eax
    movq    %xmm0, (%eax)
    ret

Are you seeing something else?

Quuxplusone commented 15 years ago

Ah, this testcase repros it for me:

#include <xmmintrin.h>
double G __attribute__((aligned(16)));
void x(__m128i y) {_mm_storel_epi64(&G,y);}

$ llvm-gcc-4.2 t.c -S -o - -O3 -fomit-frame-pointer -static
_x:
    movaps  %xmm0, _G
    ret

gcc:

_x:
    subl    $12, %esp
    movq    %xmm0, _G
    addl    $12, %esp
    ret

Quuxplusone commented 15 years ago

Ok, well it should probably be lowered by instcombine to one of these two
functions:

target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-
f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128"
target triple = "i386-apple-darwin8"
@G = common global double 0.000000e+00, align 16        ; <double*> [#uses=1]

define void @x(<2 x i64>* %x, <2 x i64> %y) nounwind  {
entry:
    %a = extractelement <2 x i64> %y, i32 0
    %b = bitcast i64 %a to double
    store double %b, double* @G
    ret void
}

define void @y(<2 x i64>* %x, <2 x i64> %y) nounwind  {
entry:
    %Y = bitcast <2 x i64> %y to <2 x double>
    %a = extractelement <2 x double> %Y, i32 0
    store double %a, double* @G, align 16
    ret void
}

It should probably be the second one, as the first one generates really bad
code.  This also exposes a codegen deficiency, as the second one doesn't
compile into movq anymore:

_x:
    subl    $44, %esp
    movaps  %xmm0, (%esp)
    movl    4(%esp), %eax
    movl    %eax, _G+4
    movl    (%esp), %eax
    movl    %eax, _G
    addl    $44, %esp
    ret
...
_y:
    movlpd  %xmm0, _G
    ret

If movq is better than movlpd, then this is a missed optimization.

Quuxplusone commented 15 years ago

In the short term, it seems best to just remove 'x86_sse2_storel_dq' from instcombine. Evan, what do you think?

Quuxplusone commented 15 years ago

(In reply to comment #4)
> It should probably be the second one, as the first one generates really bad
> code.

Doesn't the second one run into issues with SNans?  Or are we assuming that the
extracted double never touches the 387 unit?

Quuxplusone commented 15 years ago

AFAIK movlpd is at least as fast as movq. Also, it's not really extracting a double, it's really just storing the lower 64-bits of the XMM register.

Quuxplusone commented 15 years ago

This is fixed.
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20080714/065092.html

It now compiles to:
_x:
        movq    %xmm0, _G
        ret