Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

spilling general registers through xmm registers #46843

Open Quuxplusone opened 4 years ago

Quuxplusone commented 4 years ago
Bugzilla Link PR47874
Status NEW
Importance P normal
Reported by Jeff Roberts (jeffr@radgametools.com)
Reported on 2020-10-16 00:11:46 -0700
Last modified on 2020-10-16 15:41:31 -0700
Version 11.0
Hardware PC Windows NT
CC craig.topper@gmail.com, efriedma@quicinc.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, pengfei.wang@intel.com, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also
We just updated to clang 11 (from 9), and I'm seeing a lot of this kind of
codegen, when spilling registers to the stack:

 66 0f 6e c1                    movd    %ecx, %xmm0
 66 0f 7f 85 60 fe ff ff        movdqa  %xmm0, -416(%rbp)
 66 0f 6e c6                    movd    %esi, %xmm0
 66 0f 7f 85 70 fe ff ff        movdqa  %xmm0, -400(%rbp)

What's going on here?  Why aren't we just copying ecr and esi directly to those
splill locations?
Quuxplusone commented 4 years ago
Could you provide a small reproducer?
Are these just spill or they need to clear the upper 127:32 bits as well?
Quuxplusone commented 4 years ago
(In reply to Pengfei Wang from comment #1)
> Could you provide a small reproducer?
> Are these just spill or they need to clear the upper 127:32 bits as well?

It's in the middle of an enormous function (to trigger the spill), sadly.

It does (much later) load all 128 bits from that address, but does a bunch of
*scalar* SSE on it, so the 127:32 bits are never used.
Quuxplusone commented 4 years ago

Are up able to share the enormous function?

Quuxplusone commented 4 years ago
(In reply to Craig Topper from comment #3)
> Are up able to share the enormous function?

I don't think so - it's most of the entire guts to our codec.  LLVM11 seems to
inline a LOT more functions at the expense of size, so I'd have to send the
entire file I think.  If I add even one or two noinlines, then the problem goes
away.

Separately, what was the inline threshold changes from 9 to 11?
Quuxplusone commented 4 years ago
Synthetic testcase with the described behavior:

#include <emmintrin.h>
void a(__m128 *z, __m128 *z2, int f, int n) {
    for (int i = 0; i < n; ++i) {
        asm("":::"xmm0","xmm1","xmm2","xmm3","xmm4","xmm5","xmm6",
                 "xmm7","xmm8","xmm9","xmm10","xmm11","xmm12","xmm13",
                 "xmm14","xmm15");
        z[i] = _mm_add_ss(z[i], _mm_set_ss(__builtin_bit_cast(float, n)));
    }
}

I have no idea if that's anything close to the original code, though.
Quuxplusone commented 4 years ago

That code is very different, but yeah, the emitted code is very similar - weird promotion to xmm, and then reloading just the original 32-bits...

Quuxplusone commented 4 years ago

(I like the trick to force a spill, btw, heh...)