Closed Quuxplusone closed 15 years ago
Ah, found it; here's a C testcase:
#include <xmmintrin.h>
void x(__m128i* x, __m128i y) {_mm_storel_epi64(x,y);}
I get this code with both llvm-gcc and gcc:
$ llvm-gcc t.c -S -o - -O3 -fomit-frame-pointer
_x:
movl 4(%esp), %eax
movq %xmm0, (%eax)
ret
Are you seeing something else?
Ah, this testcase repros it for me:
#include <xmmintrin.h>
double G __attribute__((aligned(16)));
void x(__m128i y) {_mm_storel_epi64(&G,y);}
$ llvm-gcc-4.2 t.c -S -o - -O3 -fomit-frame-pointer -static
_x:
movaps %xmm0, _G
ret
gcc:
_x:
subl $12, %esp
movq %xmm0, _G
addl $12, %esp
ret
Ok, well it should probably be lowered by instcombine to one of these two
functions:
target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-
f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128"
target triple = "i386-apple-darwin8"
@G = common global double 0.000000e+00, align 16 ; <double*> [#uses=1]
define void @x(<2 x i64>* %x, <2 x i64> %y) nounwind {
entry:
%a = extractelement <2 x i64> %y, i32 0
%b = bitcast i64 %a to double
store double %b, double* @G
ret void
}
define void @y(<2 x i64>* %x, <2 x i64> %y) nounwind {
entry:
%Y = bitcast <2 x i64> %y to <2 x double>
%a = extractelement <2 x double> %Y, i32 0
store double %a, double* @G, align 16
ret void
}
It should probably be the second one, as the first one generates really bad
code. This also exposes a codegen deficiency, as the second one doesn't
compile into movq anymore:
_x:
subl $44, %esp
movaps %xmm0, (%esp)
movl 4(%esp), %eax
movl %eax, _G+4
movl (%esp), %eax
movl %eax, _G
addl $44, %esp
ret
...
_y:
movlpd %xmm0, _G
ret
If movq is better than movlpd, then this is a missed optimization.
In the short term, it seems best to just remove 'x86_sse2_storel_dq' from instcombine. Evan, what do you think?
(In reply to comment #4)
> It should probably be the second one, as the first one generates really bad
> code.
Doesn't the second one run into issues with SNans? Or are we assuming that the
extracted double never touches the 387 unit?
AFAIK movlpd is at least as fast as movq. Also, it's not really extracting a double, it's really just storing the lower 64-bits of the XMM register.
This is fixed.
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20080714/065092.html
It now compiles to:
_x:
movq %xmm0, _G
ret
Obvious bug: visitCallInst shouldn't turn x86_sse2_storel_dq into a store because it's not equivalent to a full-width store.
Trivial patch, but I lost track of which testcase ran into this issue. Putting here so it won't get lost.