Closed Quuxplusone closed 6 years ago
Bugzilla Link | PR35295 |
Status | RESOLVED FIXED |
Importance | P enhancement |
Reported by | Serguei Katkov (serguei.katkov@azul.com) |
Reported on | 2017-11-14 00:16:44 -0800 |
Last modified on | 2017-11-27 02:25:23 -0800 |
Version | trunk |
Hardware | PC Windows NT |
CC | craig.topper@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com |
Fixed by commit(s) | |
Attachments | |
Blocks | |
Blocked by | |
See also | PR35299 |
This looks like a missing opcode in X86ISelLowering::combineTruncatedArithmetic. Adding Simon
I do not know whether it is important but the same behavior is observable for short array as well...
Also the code for mul generated by icc is better than code generated by clang as well... I guess the reason should be close to this sub but I'm not sure.
Hopefully, this will be solved with an instcombine like bug 35299:
https://rise4fun.com/Alive/Ymp
There's a hole in our optimization logic here. It's caused by the
'shouldChangeType()' limitation that we've placed on instcombine. It doesn't
affect x86 on this test because x86 has legal (from a datalayout perspective)
i8 and i16.
So we are able to narrow the scalar ops in these loops for x86 (pre-
vectorization) and that produces optimal code (once I've applied my sub fix):
for.body.lr.ph:
; 'k' is truncated before the loop
%0 = trunc i32 %k to i8
...
; ...and splatted before the loop
%broadcast.splatinsert8 = insertelement <16 x i8> undef, i8 %0, i32 0
%broadcast.splat9 = shufflevector <16 x i8> %broadcast.splatinsert8, <16 x i8> undef, <16 x i32> zeroinitializer
; ...and used as a loop invariant in the narrow vector sub.
%3 = sub <16 x i8> %wide.load, %broadcast.splat9
---------------------------------------------------------------------------
But on something like aarch64:
./clang 35295.c -S -o - -emit-llvm -O2 -fno-unroll-loops -target aarch64
This is all in the vector body:
%2 = trunc i32 %k to i8
%3 = insertelement <16 x i8> undef, i8 %2, i32 0
%4 = shufflevector <16 x i8> %3, <16 x i8> undef, <16 x i32> zeroinitializer
%5 = sub <16 x i8> %wide.load, %4
------------------------------------------------------------------------------
It's possible that some backend pass will fix this, but I don't like that we
cripple target-independent instcombine transforms based on the datalayout.
(In reply to Serguei Katkov from comment #3)
> Also the code for mul generated by icc is better than code generated by
> clang as well... I guess the reason should be close to this sub but I'm not
> sure.
Yes - this is very similar to bug 35299.
This should be fixed in IR with:
https://reviews.llvm.org/rL318404
We might still want to add ISD::SUB to x86's list of narrowable binops in
combineTruncatedArithmetic(), but I think we should open another bug if that's
needed.
Also, someone looking at non-x86 optimizations might want to open a bug for the
problem mentioned in comment 5.
It is interesting, within this patch I see in ll file:
%136 = sub <32 x i8> %126, %88
%137 = sub <32 x i8> %129, %88
%138 = sub <32 x i8> %132, %88
%139 = sub <32 x i8> %135, %88
but in assembler I see the following pattern:
b60: c4 81 7e 6f 94 0e a0 vmovdqu -0x360(%r14,%r9,1),%ymm2
b67: fc ff ff
b6a: c4 81 7e 6f 9c 0e c0 vmovdqu -0x340(%r14,%r9,1),%ymm3
b71: fc ff ff
b74: c4 81 7e 6f a4 0e e0 vmovdqu -0x320(%r14,%r9,1),%ymm4
b7b: fc ff ff
b7e: c4 81 7e 6f ac 0e 00 vmovdqu -0x300(%r14,%r9,1),%ymm5
b85: fd ff ff
b88: c4 e3 7d 19 d6 01 vextractf128 $0x1,%ymm2,%xmm6
b8e: c5 c9 f8 f1 vpsubb %xmm1,%xmm6,%xmm6
b92: c5 e9 f8 d0 vpsubb %xmm0,%xmm2,%xmm2
b96: c4 e3 6d 18 d6 01 vinsertf128 $0x1,%xmm6,%ymm2,%ymm2
b9c: c4 e3 7d 19 de 01 vextractf128 $0x1,%ymm3,%xmm6
ba2: c5 c9 f8 f1 vpsubb %xmm1,%xmm6,%xmm6
ba6: c5 e1 f8 d8 vpsubb %xmm0,%xmm3,%xmm3
baa: c4 e3 65 18 de 01 vinsertf128 $0x1,%xmm6,%ymm3,%ymm3
bb0: c4 e3 7d 19 e6 01 vextractf128 $0x1,%ymm4,%xmm6
bb6: c5 c9 f8 f1 vpsubb %xmm1,%xmm6,%xmm6
bba: c5 d9 f8 e0 vpsubb %xmm0,%xmm4,%xmm4
bbe: c4 e3 5d 18 e6 01 vinsertf128 $0x1,%xmm6,%ymm4,%ymm4
bc4: c4 e3 7d 19 ee 01 vextractf128 $0x1,%ymm5,%xmm6
bca: c5 c9 f8 f1 vpsubb %xmm1,%xmm6,%xmm6
bce: c5 d1 f8 e8 vpsubb %xmm0,%xmm5,%xmm5
bd2: c4 e3 55 18 ee 01 vinsertf128 $0x1,%xmm6,%ymm5,%ymm5
bd8: c4 81 7c 11 94 0c a0 vmovups %ymm2,-0x360(%r12,%r9,1)
bdf: fc ff ff
be2: c4 81 7c 11 9c 0c c0 vmovups %ymm3,-0x340(%r12,%r9,1)
be9: fc ff ff
bec: c4 81 7c 11 a4 0c e0 vmovups %ymm4,-0x320(%r12,%r9,1)
bf3: fc ff ff
bf6: c4 81 7c 11 ac 0c 00 vmovups %ymm5,-0x300(%r12,%r9,1)
bfd: fd ff ff
So it seem it does not want to do a psubb on ymm :)
Any idea why it happens?
Will try to look into it on Monday...
Nevemind, I've re-run it on skylake and it generates what is expected.
(In reply to Serguei Katkov from comment #8)
> Nevemind, I've re-run it on skylake and it generates what is expected.
The code in comment 6 is what we get with -mavx, right? This is a somewhat
known problem for the vectorizer. It sees that the target supports 256-bit
vectors, but because AVX1 is a half-step of vector architecture, we have to
split integer ops back into 128-bit pieces.
It seems like we should be able to avoid that by tuning the cost model in this
case, so feel free to file a bug for that. :)
(In reply to Sanjay Patel from comment #9)
> (In reply to Serguei Katkov from comment #8)
> > Nevemind, I've re-run it on skylake and it generates what is expected.
>
> The code in comment 6
That should have been: comment 7.
(In reply to Sanjay Patel from comment #9)
> (In reply to Serguei Katkov from comment #8)
> > Nevemind, I've re-run it on skylake and it generates what is expected.
>
> The code in comment 6 is what we get with -mavx, right? This is a somewhat
Right.