Potential improvement for vectorization of sub

Quuxplusone commented 6 years ago


Bugzilla Link	PR35295
Status	RESOLVED FIXED
Importance	P enhancement
Reported by	Serguei Katkov (serguei.katkov@azul.com)
Reported on	2017-11-14 00:16:44 -0800
Last modified on	2017-11-27 02:25:23 -0800
Version	trunk
Hardware	PC Windows NT
CC	craig.topper@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also	PR35299

Hi,

I'm not sure that the example below is valid but still I'd like to understand
difference in vectorizer behavior for add and sub.

Let consider two following c++ functions (https://godbolt.org/g/WxjdMP):
void testAdd(signed char a[], int N, signed int k) {
  for (int i = 0; i < N; i++)
    a[i] += k;
}
void testSub(signed char a[], int N, signed int k) {
  for (int i = 0; i < N; i++)
    a[i] -= k;
}

for testAdd vectorizer is able to utilize packed byte size paddb instruction
while for sub it is not. Instead it uses psubd resulting in a big additional
code to convert vector of i8 to vector of i32 and vice versa.

I wonder whether it is possible to utilize psubb for the second case.
I noticed that both gcc and icc is able to do that...

Quuxplusone commented 6 years ago

This looks like a missing opcode in X86ISelLowering::combineTruncatedArithmetic. Adding Simon

Quuxplusone commented 6 years ago

I do not know whether it is important but the same behavior is observable for short array as well...

Quuxplusone commented 6 years ago

Also the code for mul generated by icc is better than code generated by clang as well... I guess the reason should be close to this sub but I'm not sure.

Quuxplusone commented 6 years ago

Hopefully, this will be solved with an instcombine like bug 35299:
https://rise4fun.com/Alive/Ymp

Quuxplusone commented 6 years ago

There's a hole in our optimization logic here. It's caused by the
'shouldChangeType()' limitation that we've placed on instcombine. It doesn't
affect x86 on this test because x86 has legal (from a datalayout perspective)
i8 and i16.

So we are able to narrow the scalar ops in these loops for x86 (pre-
vectorization) and that produces optimal code (once I've applied my sub fix):

for.body.lr.ph:
  ; 'k' is truncated before the loop
  %0 = trunc i32 %k to i8
...
  ; ...and splatted before the loop
  %broadcast.splatinsert8 = insertelement <16 x i8> undef, i8 %0, i32 0
  %broadcast.splat9 = shufflevector <16 x i8> %broadcast.splatinsert8, <16 x i8> undef, <16 x i32> zeroinitializer

  ; ...and used as a loop invariant in the narrow vector sub.
  %3 = sub <16 x i8> %wide.load, %broadcast.splat9

---------------------------------------------------------------------------

But on something like aarch64:
./clang 35295.c -S -o - -emit-llvm -O2 -fno-unroll-loops -target aarch64

This is all in the vector body:
  %2 = trunc i32 %k to i8
  %3 = insertelement <16 x i8> undef, i8 %2, i32 0
  %4 = shufflevector <16 x i8> %3, <16 x i8> undef, <16 x i32> zeroinitializer
  %5 = sub <16 x i8> %wide.load, %4

------------------------------------------------------------------------------

It's possible that some backend pass will fix this, but I don't like that we
cripple target-independent instcombine transforms based on the datalayout.

Quuxplusone commented 6 years ago

(In reply to Serguei Katkov from comment #3)
> Also the code for mul generated by icc is better than code generated by
> clang as well... I guess the reason should be close to this sub but I'm not
> sure.

Yes - this is very similar to bug 35299.

This should be fixed in IR with:
https://reviews.llvm.org/rL318404

We might still want to add ISD::SUB to x86's list of narrowable binops in
combineTruncatedArithmetic(), but I think we should open another bug if that's
needed.

Also, someone looking at non-x86 optimizations might want to open a bug for the
problem mentioned in comment 5.

Quuxplusone commented 6 years ago

It is interesting, within this patch I see in ll file:
  %136 = sub <32 x i8> %126, %88
  %137 = sub <32 x i8> %129, %88
  %138 = sub <32 x i8> %132, %88
  %139 = sub <32 x i8> %135, %88

but in assembler I see the following pattern:
     b60:       c4 81 7e 6f 94 0e a0    vmovdqu -0x360(%r14,%r9,1),%ymm2
     b67:       fc ff ff
     b6a:       c4 81 7e 6f 9c 0e c0    vmovdqu -0x340(%r14,%r9,1),%ymm3
     b71:       fc ff ff
     b74:       c4 81 7e 6f a4 0e e0    vmovdqu -0x320(%r14,%r9,1),%ymm4
     b7b:       fc ff ff
     b7e:       c4 81 7e 6f ac 0e 00    vmovdqu -0x300(%r14,%r9,1),%ymm5
     b85:       fd ff ff
     b88:       c4 e3 7d 19 d6 01       vextractf128 $0x1,%ymm2,%xmm6
     b8e:       c5 c9 f8 f1             vpsubb %xmm1,%xmm6,%xmm6
     b92:       c5 e9 f8 d0             vpsubb %xmm0,%xmm2,%xmm2
     b96:       c4 e3 6d 18 d6 01       vinsertf128 $0x1,%xmm6,%ymm2,%ymm2
     b9c:       c4 e3 7d 19 de 01       vextractf128 $0x1,%ymm3,%xmm6
     ba2:       c5 c9 f8 f1             vpsubb %xmm1,%xmm6,%xmm6
     ba6:       c5 e1 f8 d8             vpsubb %xmm0,%xmm3,%xmm3
     baa:       c4 e3 65 18 de 01       vinsertf128 $0x1,%xmm6,%ymm3,%ymm3
     bb0:       c4 e3 7d 19 e6 01       vextractf128 $0x1,%ymm4,%xmm6
     bb6:       c5 c9 f8 f1             vpsubb %xmm1,%xmm6,%xmm6
     bba:       c5 d9 f8 e0             vpsubb %xmm0,%xmm4,%xmm4
     bbe:       c4 e3 5d 18 e6 01       vinsertf128 $0x1,%xmm6,%ymm4,%ymm4
     bc4:       c4 e3 7d 19 ee 01       vextractf128 $0x1,%ymm5,%xmm6
     bca:       c5 c9 f8 f1             vpsubb %xmm1,%xmm6,%xmm6
     bce:       c5 d1 f8 e8             vpsubb %xmm0,%xmm5,%xmm5
     bd2:       c4 e3 55 18 ee 01       vinsertf128 $0x1,%xmm6,%ymm5,%ymm5
     bd8:       c4 81 7c 11 94 0c a0    vmovups %ymm2,-0x360(%r12,%r9,1)
     bdf:       fc ff ff
     be2:       c4 81 7c 11 9c 0c c0    vmovups %ymm3,-0x340(%r12,%r9,1)
     be9:       fc ff ff
     bec:       c4 81 7c 11 a4 0c e0    vmovups %ymm4,-0x320(%r12,%r9,1)
     bf3:       fc ff ff
     bf6:       c4 81 7c 11 ac 0c 00    vmovups %ymm5,-0x300(%r12,%r9,1)
     bfd:       fd ff ff

So it seem it does not want to do a psubb on ymm :)
Any idea why it happens?
Will try to look into it on Monday...

Quuxplusone commented 6 years ago

Nevemind, I've re-run it on skylake and it generates what is expected.

Quuxplusone commented 6 years ago

(In reply to Serguei Katkov from comment #8)
> Nevemind, I've re-run it on skylake and it generates what is expected.

The code in comment 6 is what we get with -mavx, right? This is a somewhat
known problem for the vectorizer. It sees that the target supports 256-bit
vectors, but because AVX1 is a half-step of vector architecture, we have to
split integer ops back into 128-bit pieces.

It seems like we should be able to avoid that by tuning the cost model in this
case, so feel free to file a bug for that. :)

Quuxplusone commented 6 years ago

(In reply to Sanjay Patel from comment #9)
> (In reply to Serguei Katkov from comment #8)
> > Nevemind, I've re-run it on skylake and it generates what is expected.
>
> The code in comment 6

That should have been: comment 7.

Quuxplusone commented 6 years ago

(In reply to Sanjay Patel from comment #9)
> (In reply to Serguei Katkov from comment #8)
> > Nevemind, I've re-run it on skylake and it generates what is expected.
>
> The code in comment 6 is what we get with -mavx, right? This is a somewhat
Right.

Quuxplusone / LLVMBugzillaTest

Potential improvement for vectorization of sub #34268