Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Performance regression in SLPVectorize between llvm 10.0 and 11.0 #47455

Open Quuxplusone opened 3 years ago

Quuxplusone commented 3 years ago
Bugzilla Link PR48486
Status NEW
Importance P normal
Reported by David Parks (code.optimizer@gmail.com)
Reported on 2020-12-11 09:45:40 -0800
Last modified on 2020-12-28 11:01:37 -0800
Version 11.0
Hardware PC Linux
CC a.bataev@hotmail.com, craig.topper@gmail.com, htmldeveloper@gmail.com, lebedev.ri@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, rscottmanley@gmail.com, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments morphology.ll (699015 bytes, text/plain)
morphology-10.llvm (404156 bytes, text/plain)
morphology-11.llvm (408742 bytes, text/plain)
morphology-10.s (422733 bytes, text/x-tex)
morphology-11.s (462937 bytes, text/x-tex)
perf-10.lst (2474 bytes, text/plain)
perf-11.lst (2494 bytes, text/plain)
Blocks
Blocked by
See also
With llvm 11.0 the change to the heuristics and/or instructions costs used in
SLPVectorize.cpp (opt) have causes a 30% regression in overall application
performance with routine  __nv_MorphologyPrimitive_F1L2849_2 in the attached
morphology.ll as measured on an Intel Skylake 40 core Xeon server.

With llvm 10.0, SLPVectorize promotes some of the loops from using xmm pd to
ymm pd.  Those same transformations do not happen with llvm 11.0.

Attached in SLPV.tar are:
morphology.ll (used as input for llvm opt releases 10 and 11)
morphology-10.llvm (output of opt using --opt-bisect-limit=778 - just after the
SLP pass) - exactly:

lim=778
opt -O2 -mcpu=skylake-avx512 --enable-unsafe-fp-math --enable-no-nans-fp-math --
enable-no-infs-fp-math --enable-no-signed-zeros-fp-math --opt-bisect-
limit=${lim} ./obj/magick/morphology.ll -S -o ./obj/magick/morphology-10.llvm

morphology-11.llvm
morphology-10.s output from llc invoked with:
-mcpu=skylake-avx512 -O2 --enable-unsafe-fp-math --enable-no-nans-fp-math --
enable-no-infs-fp-math --enable-no-signed-zeros-fp-math -fast-isel=0 -non-
global-value-max-name-size=4294967295 -x86-cmov-converter=0 -filetype=obj

perf-10.lst and perf-11.lst: snapshots of perf report ofthe most costly loop in
routine __nv_MorphologyPrimitive_F1L2849_2
Quuxplusone commented 3 years ago

David - please can attach the files you mention?

Quuxplusone commented 3 years ago

Hi Simon,

I missed the error when originally uploading the tarball that it exceeds the 1MB limit. What's your preferred way to have the file uploaded?

Thanks

Quuxplusone commented 3 years ago

Attached morphology.ll (699015 bytes, text/plain): Input to opt

Quuxplusone commented 3 years ago

Attached morphology-10.llvm (404156 bytes, text/plain): Output from opt llvm 10

Quuxplusone commented 3 years ago

Attached morphology-11.llvm (408742 bytes, text/plain): Ouput from opt llvm 11

Quuxplusone commented 3 years ago

Attached morphology-10.s (422733 bytes, text/x-tex): Output from llc llvm 10

Quuxplusone commented 3 years ago

Attached morphology-11.s (462937 bytes, text/x-tex): Output from llc llvm 11

Quuxplusone commented 3 years ago

Attached perf-10.lst (2474 bytes, text/plain): perf detailed report using llvm 10

Quuxplusone commented 3 years ago

Attached perf-11.lst (2494 bytes, text/plain): perf detailed report using llvm 11