Loop vectorizer produces extremely poor code for pattern fills

Quuxplusone commented 6 years ago


Bugzilla Link	PR37423
Status	NEW
Importance	P enhancement
Reported by	Fabian Giesen (fabian.giesen@epicgames.com)
Reported on	2018-05-11 12:32:39 -0700
Last modified on	2019-10-06 03:05:16 -0700
Version	6.0
Hardware	PC Windows NT
CC	andrea.dibiagio@gmail.com, florian_hahn@apple.com, greg.bedwell@sony.com, hfinkel@anl.gov, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also	PR20225

void patternFill(int *arr, int count)
{
    for (int i = 0; i < count; i++)
        arr[i] = (i & 1) ? 456 : 123;
}

With clang 6.0 release, "clang -O2" on x86-64, this turns into a lot of code,
with the inner loop being

.LBB0_8: # =>This Inner Loop Header: Depth=1
  movdqa %xmm1, %xmm7
  pand %xmm4, %xmm7
  movdqa %xmm2, %xmm0
  pand %xmm4, %xmm0
  pcmpeqd %xmm3, %xmm0
  pshufd $177, %xmm0, %xmm5 # xmm5 = xmm0[1,0,3,2]
  pand %xmm0, %xmm5
  pcmpeqd %xmm3, %xmm7
  pshufd $177, %xmm7, %xmm0 # xmm0 = xmm7[1,0,3,2]
  pand %xmm7, %xmm0
  shufps $136, %xmm0, %xmm5 # xmm5 = xmm5[0,2],xmm0[0,2]
  movaps %xmm5, %xmm0
  andnps %xmm8, %xmm0
  andps %xmm9, %xmm5
  orps %xmm0, %xmm5
  movups %xmm5, (%rdi,%rsi,4)
  movups %xmm5, 16(%rdi,%rsi,4)
  movups %xmm5, 32(%rdi,%rsi,4)
  movups %xmm5, 48(%rdi,%rsi,4)
  addq $16, %rsi
  paddq %xmm6, %xmm2
  paddq %xmm6, %xmm1
  addq $4, %rax
  jne .LBB0_8

by comparison, compiling with "-fno-vectorize" results in the much better (in
terms of both code size and expected execution time)

  movaps .LCPI0_0(%rip), %xmm0 # xmm0 = [123,456,123,456]
.LBB0_8: # =>This Inner Loop Header: Depth=1
  movups %xmm0, (%rdi,%rcx,4)
  addq $4, %rcx
  cmpq %rcx, %rdx
  jne .LBB0_8

Quuxplusone commented 5 years ago

This is still an issue. It looks like we do not manage to fold vector add,and,select chain to vectorizer generates.

Quuxplusone commented 5 years ago

Current Codegen: https://godbolt.org/z/R_K7FG

Quuxplusone commented 5 years ago

This variant also includes a const loop count: https://godbolt.org/z/OkPkSf

void patternFill_const(int *arr)
{
    for (int i = 0; i < 65536; i++)
        arr[i] = (i & 1) ? 456 : 123;
}

Here we should definitely keep the loop's indvar as a i32/vXi32 instead of
extending it to i64/vXi64 - we can guarantee that i + #loopvectorelts never
overflows.

It'd improve the loop but probably wouldn't do much to help optimize to a
constant select mask though......

Quuxplusone / LLVMBugzillaTest

Loop vectorizer produces extremely poor code for pattern fills #36396