Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Wrong generation of 256/512 bits vperm* from 128 mov #39786

Open Quuxplusone opened 5 years ago

Quuxplusone commented 5 years ago
Bugzilla Link PR40815
Status CONFIRMED
Importance P enhancement
Reported by Gael Guennebaud (gael.guennebaud@gmail.com)
Reported on 2019-02-22 00:58:52 -0800
Last modified on 2019-04-28 08:09:55 -0700
Version 7.0
Hardware PC All
CC chtz@informatik.uni-bremen.de, craig.topper@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@ndave.org, llvm-dev@redking.me.uk, neeilans@live.com, richard-llvm@metafoo.co.uk, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also
Clang 6 and 7, with -O2 and either AVX or AVX512 wrongly optimize some sequence
of 128 bits load/stores when the source memory has already been loaded in a 256
or 512 bits register.

See the self-contained demo:

  https://godbolt.org/z/oFhMze

This issue has been discovered in Eigen
(http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1684). The above demo includes
both a self-contained example and some Eigen-based examples at the bottom.

The problem is much clearer in AVX512 than in AVX as it generates:

  vmovaps zmm0, zmmword ptr [rip + .LCPI2_0] # zmm0 = [3,4,5,6,2,3,4,5,1,2,3,4,0,1,2,3]
  vpermps zmm0, zmm0, zmmword ptr [rdi]

instead of:

  # zmm0 = [12,13,14,15,8,9,10,11,4,5,6,7,0,1,2,3]
  vpermps zmm0, zmm0, zmmword ptr [rdi]

(btw, I'm very impressed that it folded all this code to a single vpermps, too
bad its wrong)

With the "trunk" version on godbolt, the issue does not show up as clang/llvm
does not try to generate vperm* but instead it generates a sequence of vinsert*.

I still reported this issue because:

1- It is not clear whether this issue has been properly identified and is not
simply hidden in trunk waiting to pop-up again.

2- It would be worth fixing the 7 branch.

3- Do you have any suggestion for us to workaround this issue with clang6/7 on
Eigen's side? The only full-proof solution I have so far is to ban
clang6/7+AVX{512} with a #error... That would be extremely bad as this would
mean about x8 slowdowns of matrix products, linear solves and the likes with
clang6/7 on AVX512.

4- Very minor: performance-wise, on AVX512 the vperm approach is usually
significantly faster than a sequence of vinsert, though vperm require a full
cache-line to old the indices.
Quuxplusone commented 5 years ago

Note: In the self-contained test the __attribute__((noinline)) at line 4 needs to be removed to reproduce the error.

Quuxplusone commented 5 years ago

My bad, I played too much with it before sharing. Let me share a cleaned-up link:

https://godbolt.org/z/NkFM5w

Quuxplusone commented 5 years ago

IR: https://godbolt.org/z/I5q2jP

An initial investigation suggests its actually MergeConsecutiveStores that is messing up, creating extract_subvector with unaligned indices (which asserts), but if assertions are disabled, later a shuffle is generated containing misaligned indices.

Quuxplusone commented 5 years ago

Added test case at rL359401