Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

[SSE] Scalar intrinsics should use IntrinsicsInfo #23448

Open Quuxplusone opened 9 years ago

Quuxplusone commented 9 years ago
Bugzilla Link PR23449
Status NEW
Importance P normal
Reported by Ahmed Bougacha (ahmed@bougacha.org)
Reported on 2015-05-07 17:48:11 -0700
Last modified on 2015-06-20 10:53:00 -0700
Version trunk
Hardware PC All
CC ahmed@bougacha.org, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, quentin.colombet@gmail.com, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also PR23349

We shouldn't need dedicated patterns (and instructions !) for those.

This isn't as straightforward as packed intrinsics, because of types, but is still worth doing and not too complicated.

Quuxplusone commented 9 years ago

Ahmed - thanks for filing this (and hopefully fixing it!).

I thought about this with http://reviews.llvm.org/D9301 and again while working on bug 21507 ( http://reviews.llvm.org/D9504 ), but I didn't see a quick fix.

Quuxplusone commented 9 years ago
Yes, this might be trickier than I thought ;)

The few builtins still backed by intrinsics (e.g. maxss) are easy enough to
lower (just new IntrinsicsInfo opcodes should be enough?), but the scalar
patterns used to catch stuff like (movss/blend (op)) are the hard part, I think:

     def : Pat<(v4f32 (X86Movss (v4f32 VR128:$dst), (v4f32 (scalar_to_vector
           (Op (f32 (vector_extract (v4f32 VR128:$dst), (iPTR 0))),
           FR32:$src))))),
       (!cast<I>(OpcPrefix#SSrr_Int) v4f32:$dst,
           (COPY_TO_REGCLASS FR32:$src, VR128))>;

Because of the type difference, it's not obviously writable as SSrr (which
defines (f32 FR32)).  We can hack around that;  what do you guys think of
something like:

     def : Pat<(v4f32 (X86Movss (v4f32 VR128:$dst), (v4f32 (scalar_to_vector
           (Op (f32 (vector_extract (v4f32 VR128:$dst), (iPTR 0))),
           FR32:$src))))),
           (v4f32 (COPY_TO_REGCLASS
             (f32 (!cast<I>(OpcPrefix#SSrr)
               (COPY_TO_REGCLASS VR128:$dst, FR32),
               FR32:$src)),
             VR128))>;

I haven't looked too hard into the other users of _Int, but since the only
difference between _Int and non-_Int is VR vs FR, the same COPY tricks should
be enough everywhere.

I'm not sure the tricks are legal though:  copying a VR128 into an FR32
"discards" the high lanes, but I can't think of a way the coalescer and friends
could exploit that.
Quuxplusone commented 9 years ago

Nevermind, after discussing this with Matthias, I see that won't work (because of the spiller).

I'm out of ideas!

Quuxplusone commented 9 years ago
I (thankfully!) have not looked at x86 tablegen in a while, but one idea that I
had considered to reduce the pattern bloat was combining/lowering the vector
versions:

    // vector math op with insert via movss
    def : Pat<(v4f32 (X86Movss (v4f32 VR128:$dst),
          (Op (v4f32 VR128:$dst), (v4f32 VR128:$src)))),
      (!cast<I>(OpcPrefix#SSrr_Int) v4f32:$dst, v4f32:$src)>;

into scalar math ops...because it's possible that the scalar op could execute
faster than its vector op version on some micro-arch.

This would actually increase our reliance on those *_Int...but maybe there's
some inverse transform (scalar op -> vector op) that could be used?
Quuxplusone commented 9 years ago
I'm not sure I understand what you mean by "vector" / "scalar" op:  both the
scalar and the MOVSS/ADDSSrr_Int versions will lower into the exact same code
(a simple addss).

But I agree with combining this away into scalar code: we talked about this a
little more with Quentin & Matthias, and I think changing FR32 to be "anonymous
subregisters" of VR128 would be very helpful.  It'd let us do:

     def : Pat<(v4f32 (X86Movss (v4f32 VR128:$dst), (v4f32 (scalar_to_vector
           (Op (f32 (vector_extract (v4f32 VR128:$dst), (iPTR 0))),
           FR32:$src))))),
           (v4f32 (INSERT_SUBREG (VR128 undef),
             (f32 (!cast<I>(OpcPrefix#SSrr)
               (COPY_TO_REGCLASS VR128:$dst, FR32),
               FR32:$src))))>;