Open Quuxplusone opened 4 years ago
As noted here https://reviews.llvm.org/D68298
That patch is maybe a reason of this slow down.
Hopefully the fix (https://reviews.llvm.org/D72214) will land before 10.0
release.
https://www.phoronix.com/scan.php?page=news_item&px=AMD-Znver2-Clang-10-Tests
PHP compile time regression is not fixed - 12 % is quite significant regression
I believe.
(In reply to David Bolvansky from comment #1)
> As noted here https://reviews.llvm.org/D68298
>
> That patch is maybe a reason of this slow down.
>
>
> Hopefully the fix (https://reviews.llvm.org/D72214) will land before 10.0
> release.
Florian, do you know the status here?
(In reply to Hans Wennborg from comment #3)
> (In reply to David Bolvansky from comment #1)
> > As noted here https://reviews.llvm.org/D68298
> >
> > That patch is maybe a reason of this slow down.
> >
> >
> > Hopefully the fix (https://reviews.llvm.org/D72214) will land before 10.0
> > release.
>
> Florian, do you know the status here?
I've reverted the commit causing the regression on Jan 14:
https://reviews.llvm.org/rG192cce10f67e4f22be6d9b8c0975f78ad246d1bd
The patch was causing a compile-time regression, but I am not sure the patch
caused the ones mentioned in the article.
I just checked and it looks like it is on the 10.x release branch.
(In reply to David Bolvansky from comment #2)
> https://www.phoronix.com/scan.php?page=news_item&px=AMD-Znver2-Clang-10-Tests
>
>
> PHP compile time regression is not fixed - 12 % is quite significant
> regression I believe.
I bisected the PHP regression and ended up at this commit
---
commit 6da79ce1fed52f828b5e5d2e3d274ef1692f17af
Author: Alina Sbirlea <asbirlea@google.com>
Date: Wed Sep 4 19:16:04 2019 +0000
[MemorySSA] Re-enable MemorySSA use.
Differential Revision: https://reviews.llvm.org/D58311
llvm-svn: 370957
---
Alina, can you take a look?
It would at least be interesting to know what code is being slow to compile and
why.
I bisected like this:
$ cd /work/php
$ wget https://www.php.net/distributions/php-7.4.2.tar.gz
$ tar xf php-7.4.2.tar.gz
And then from my llvm checkout:
$ ninja -C build.release -j1000 clang && ( cd /work/php && rm -rf build &&
mkdir build && cd build && CC=/work/llvm.monorepo/build.release/bin/clang
CXX=/work/llvm.monorepo/build.release/bin/clang++ ../php-7.4.2/configure &&
time make -j50 )
The build step goes from about 12m25s to 13m27s at this revision.
Partial explanation (mentions DSE, but it makes sense as a reason for this
slowdown) is here:
https://reviews.llvm.org/D72700
"The performance problem's scope goes beyond DSE. The current pass pipeline for
both pass managers has a sequence of (gvn, memcpyopt, dse), where all of these
use MemDepAnalysis. Switching DSE to MemorySSA may initially get worse compile
times, as we need to build both MemDepAnalysis and MemorySSA, but switching all
three (use newgvn and port memcpyopt and dse), may be worthwhile. This is the
long term goal I have in mind."
I suggest turning off MemorySSA until all passes are ready.
I've opened a related report at bug #44889, where LoopRotate in particular is
affected. Also referenced there is https://perf.rust-
lang.org/compare.html?start=9993408439c871c1bc0c0bac3b3008b24039443c&end=52e999afd3be3341463f3b87be02f2bcfb381c17,
which shows compile-time regressions in Rust when MemorySSA is enabled.
Enabling MemorySSA regresses optimized builds by 1-8%.
(Note that even with MemorySSA disabled, LLVM 10 *still* regresses opt build
compile-times by 4-12%.)
Some clarifications:
MemorySSA is only enabled for some loop passes (opt flag: -enable-mssa-loop-dependency) and in EarlyCSEwMSSA, which is a FunctionPass. It is not enabled for DSE or other function passes (new gvn is not used). The comment in https://bugs.llvm.org/show_bug.cgi?id=44408#c6 refers to an ongoing work, nothing is tuned on.
AFAICT https://bugs.llvm.org/show_bug.cgi?id=44408#c7 refers to a regression with asserts enabled due to verifications in MemorySSA.
AFAICT https://bugs.llvm.org/show_bug.cgi?id=44408#c5 does not appear to be a regression due to verifications. Looking into it, thank you Hans.
(In reply to Alina Sbirlea from comment #8)
> Some clarifications:
>
> - MemorySSA is only enabled for some loop passes (opt flag:
> -enable-mssa-loop-dependency) and in EarlyCSEwMSSA, which is a FunctionPass.
> It is not enabled for DSE or other function passes (new gvn is not used).
> The comment in https://bugs.llvm.org/show_bug.cgi?id=44408#c6 refers to an
> ongoing work, nothing is tuned on.
I believe David's point here was that enabling of MemorySSA in loop passes
added 4 extra MemorySSA computations, because it is not being preserved by
intermediate passes, so the same issue exists here.
Looking at the pipeline log, it's peculiar that one of the MemorySSA
computations happens before a LoopRotate-only loop pass manager, which (as far
as I know) does not need MemorySSA, only preserves it -- is it possible to not
compute it in that case?
> - AFAICT https://bugs.llvm.org/show_bug.cgi?id=44408#c7 refers to a
> regression with asserts enabled due to verifications in MemorySSA.
As I mentioned on the other report, this regression is with assertions
disabled. Regressions with enabled assertions is an unrelated (and thankfully
easily fixed) issue.
Understood, here's a preliminary update.
Profiling the test in PR44889 shows there is good amount of time spent in
updating dominator tree. I had a pending FIXME which may help address this.
I made a quick prototype to check if this is indeed a relevant cause and for
that IR the regression (w/ release-noasserts) went from 9.8% to 6.6% on my
machine.
So there is indeed some room to improve there, but the infrastructure change
may take a bit to get precisely right.
@Nikita:
Yes, we can have LoopRotate not run MemorySSA when being in the loop pass
manager alone . We have such a mechanism in the new pass manager, but not in
the legacy one.
I sent out https://reviews.llvm.org/D74574 which makes LoopRotate preserve
MemorySSA when available, but the side effect is also that it splits the loop
pass pipeline when LICM and unswitch are run right after.
I'm not seeing a regression on the IR in PR44889 with this, but I'd be happy to
get additional timings here.
In order to resolve the loop pass pipeline being split, I'd need to pass the
info that the analysis is required in the pass manager, at
lib/Transforms/IPO/PassManagerBuilder.cpp:401.
I haven't profiled the php regression yet.
https://reviews.llvm.org/D74574 is now in tree.
I also sent out https://reviews.llvm.org/D74640, which only removes the creation of MemorySSA for LoopRotate when the pass is run alone.
Would you mind testing if these preliminary fixes help?
Thanks!
I was curious about the ImageMagick regression mentioned in the Phoronix
article. I built it like this:
$ wget https://imagemagick.org/download/ImageMagick-7.0.9-24.tar.gz
$ tar xf ImageMagick-7.0.9-24.tar.gz
$ cd ImageMagick-7.0.9-24/
$ mkdir build && cd build
$ CC=/work/llvm.monorepo.rel/build.noasserts/bin/clang
CXX=/work/llvm.monorepo.rel/build.noasserts/bin/clang++ ../configure
$ time make -j50
I didn't see any 50% slow-down, but more like 10%. It's not clear exactly what
the Phoronix article is measuring though.
Anyway, bisection points to the same MemorySSA enabling commit as the PHP
bisection.
I tried building also with the commit and patch in #c11, but it doesn't seem to
make any significant difference.
Attached mogrify.i.gz
(108163 bytes, application/gzip): imagemagick mogrify.c preprocessed
So.. revert "[MemorySSA] Re-enable MemorySSA use." ?
I didn't see any 50% slow-down, but more like 10%.
10 % between two commits? Or Clang 9 vs Clang 10? Any data for Clang 9?
(In reply to David Bolvansky from comment #14)
> So.. revert "[MemorySSA] Re-enable MemorySSA use." ?
Presumably that leads to better optimization, so I'm not sure we just want to
turn it off.
(In reply to David Bolvansky from comment #15)
> > I didn't see any 50% slow-down, but more like 10%.
>
> 10 % between two commits? Or Clang 9 vs Clang 10? Any data for Clang 9?
Between Clang 9 (roughly) and Clang 10.
Presumably that leads to better optimization
So we need data to confirm this generally so Clang 10 can justify worse compile times for much better performance of binaries.
Probably worth to analyze compile time changes even for -Os/-Oz/Og levels.
Here are some of the timings I see at LLVM ToT (without patch in #c11, but with
the already committed ones):
time bin/clang -mllvm -enable-mssa-loop-dependency=true -fopenmp -g -O2 -w -
mtune=broadwell -fexceptions -pthread -c ~/mogrify.i -fPIC -o /dev/null
real 0m5.919s
time bin/clang -mllvm -enable-mssa-loop-dependency=false -fopenmp -g -O2 -w -
mtune=broadwell -fexceptions -pthread -c ~/mogrify.i -fPIC -o /dev/null
real 0m5.822s
For the test in PR44889:
time bin/opt -O3 -enable-mssa-loop-dependency=true ~/extracted.bc
real 0m5.774s
time bin/opt -O3 -enable-mssa-loop-dependency=false ~/extracted.bc
real 0m5.701s
Python build with CC=/myrelease/bin/clang CXX=/myrelease/bin/clang++ ../php-
7.4.2/configure && time make -j50
clang9 release:
real 29.243s
user 7m31.280s
ToT before the fix in D74574:
real 31.604s
user 7m51.633s
ToT now:
real 31.116s
user 7m49.442s
ToT and revert MSSA flag flip:
real 30.778s
user 7m44.797s
Thanks for data! MSSA does not make php compile times much worse it seems...
Sight. We need to find out what causes this big regression:
29.243s - > 30.778s
@hans, can you run bisect between Clang 9 branch commit and this MSSA commit?
(In reply to Alina Sbirlea from comment #18)
> Here are some of the timings I see at LLVM ToT (without patch in #c11, but
> with the already committed ones):
>
> time bin/clang -mllvm -enable-mssa-loop-dependency=true -fopenmp -g -O2 -w
> -mtune=broadwell -fexceptions -pthread -c ~/mogrify.i -fPIC -o /dev/null
> real 0m5.919s
> time bin/clang -mllvm -enable-mssa-loop-dependency=false -fopenmp -g -O2 -w
> -mtune=broadwell -fexceptions -pthread -c ~/mogrify.i -fPIC -o /dev/null
> real 0m5.822s
Setting the -enable-mssa-loop-dependency=false flag does not make any
difference for me. Building at the tip of the release/10.x branch (da0fe2ade36):
$ perf stat -r5 /work/llvm.monorepo.rel/build.noasserts/bin/clang -fopenmp -g -
O2 -w -mtune=broadwell -fexceptions -pthread -c /tmp/mogrify.i -fPIC -o
/dev/null
10.860 +- 0.147 seconds
$ perf stat -r5 /work/llvm.monorepo.rel/build.noasserts/bin/clang -fopenmp -g -
O2 -w -mtune=broadwell -fexceptions -pthread -c /tmp/mogrify.i -fPIC -o
/dev/null -enable-mssa-loop-dependency=false
10.8424 +- 0.0784
Reverting the patch however, makes a big difference:
$ git revert 6da79ce1fed52f828b5e5d2e3d274ef1692f17af
$ ninja -j1000 clang
$ perf stat -r5 /work/llvm.monorepo.rel/build.noasserts/bin/clang -fopenmp -g -
O2 -w -mtune=broadwell -fexceptions -pthread -c /tmp/mogrify.i -fPIC -o
/dev/null -enable-mssa-loop-dependency=false
7.836 +- 0.137 seconds
For some reason, things are looking much better at master (0e5ed1b2626):
$ perf stat -r5 /work/llvm.monorepo.rel/build.noasserts/bin/clang -fopenmp -g -
O2 -w -mtune=broadwell -fexceptions -pthread -c /tmp/mogrify.i -fPIC -o
/dev/null -enable-mssa-loop-dependency=true
7.818 +- 0.146 seconds
Maybe that's due to D74574 (1326a5a4cfe)...
At 1326a5a4cfe^1
$ perf stat -r5 /work/llvm.monorepo.rel/build.noasserts/bin/clang -fopenmp -g -
O2 -w -mtune=broadwell -fexceptions -pthread -c /tmp/mogrify.i -fPIC -o
/dev/null
10.7749 +- 0.0153 seconds
At 1326a5a4cfe
$ perf stat -r5 /work/llvm.monorepo.rel/build.noasserts/bin/clang -fopenmp -g -
O2 -w -mtune=broadwell -fexceptions -pthread -c /tmp/mogrify.i -fPIC -o
/dev/null
7.8181 +- 0.0668 seconds
Okay, that's it.
But then I tried applying https://reviews.llvm.org/D74640 on top
10.670 +- 0.168 seconds
So I guess we should not land that?
Hans,
Yes, these times sound right to me. Yes, D74574 should have made the difference.
The "-enable-mssa-loop-dependency" flag is an LLVM flag, so adding "-mllvm" in
front should work.
I'm sorry I forgot to clarify in the previous comment, yes, there is an added
regression with D74640 currently, and I'm not going to land it until I resolve
the underlying cause. I tracked that down to building a DT when doing updates
and put together a quick fix to test the theory; the regression with D74640
went away.
I want to do the change properly though, so until then, D74640 remains on hold.
Please let me know if I can help track down the other regression causes, or in
any other way.
(In reply to Alina Sbirlea from comment #21)
> Hans,
>
> Yes, these times sound right to me. Yes, D74574 should have made the
> difference.
Thanks! I have cherry-picked this to 10.x as
8b0df8e1ed6842095388fce08a0a5f761cd905ed
>
> The "-enable-mssa-loop-dependency" flag is an LLVM flag, so adding "-mllvm"
> in front should work.
D'oh, I didn't realize I dropped the -mllvm flag.
>
> I'm sorry I forgot to clarify in the previous comment, yes, there is an
> added regression with D74640 currently, and I'm not going to land it until I
> resolve the underlying cause. I tracked that down to building a DT when
> doing updates and put together a quick fix to test the theory; the
> regression with D74640 went away.
> I want to do the change properly though, so until then, D74640 remains on
> hold.
But do we need D74640 for the release? It seems like D74574 fixed the compile-
time regression. What's D74640 for?
We don't need D74640 for the release.
The purpose of D74640 is to re-merge the loop pass manager pipeline in the
legacy pass manager, because (I think) traditionally, we tried to run loop
passes in the same pipeline, hence running all passes in the pipeline on a loop
before moving to the next loop in the loop nest.
AFAICT this doesn't change the optimizations in this case, so it's not an issue
to have it split.
@Hans: Could you please also cherry-pick https://github.com/llvm/llvm-project/commit/f0b57d8071853ec2ab459c0492854c67ea4fa93c? It's only relevant for assertion-enabled builds, so mainly to make CI builds less slow.
(In reply to Nikita Popov from comment #24)
> @Hans: Could you please also cherry-pick
> https://github.com/llvm/llvm-project/commit/
> f0b57d8071853ec2ab459c0492854c67ea4fa93c? It's only relevant for
> assertion-enabled builds, so mainly to make CI builds less slow.
Sure. Pushed to 10.x as 593a0dda7a683df9b3744c6391bb2f8de9ed5908.
Some new numbers on current release/10.x branch: https://perf.rust-lang.org/compare.html?start=9381e8178b49636d4604e4ec0f1263960691c958&end=c6d04150234d5cb973ec5213d27a38e2a7b67955
Comparing with a previous experiment that disables MSSA, it looks like the MSSA related regressions are now reduced to 2-3%, which is "good enough" I think.
LLVM 10 still regresses optimized build times by 5-15%, but I've had little luck in figuring out why. At least for some benchmarks, ThinLTO import seems to have become significantly slower (jumping from 210 to 260 seconds for the top regression).
(In reply to Nikita Popov from comment #26)
> Some new numbers on current release/10.x branch:
> https://perf.rust-lang.org/compare.
> html?start=9381e8178b49636d4604e4ec0f1263960691c958&end=c6d04150234d5cb973ec5
> 213d27a38e2a7b67955
>
> Comparing with a previous experiment that disables MSSA, it looks like the
> MSSA related regressions are now reduced to 2-3%, which is "good enough" I
> think.
>
> LLVM 10 still regresses optimized build times by 5-15%, but I've had little
> luck in figuring out why. At least for some benchmarks, ThinLTO import seems
> to have become significantly slower (jumping from 210 to 260 seconds for the
> top regression).
Is this on a release branch or at HEAD?
There was a slowdown incurred by D70404 "Always import constants", which I then
disabled by default in D73724. But I made a bunch of efficiency improvements to
ThinLTO importing in D73851 which should have actually sped things up even
without constant importing. I did re-enable constant importing in D74512 after
those fixes.
Could you try disabling constant importing to see if that fixes things? The
internal LLVM option is -import-constants-with-refs=false. I'm not sure how
that would get passed to Rust though. For gold or llvm LTO links it would be -
Wl,-plugin-opt,-import-constants-with-refs=false.
If this is on a release branch it would be interesting to see which of the
patches I referenced are on it.
> There was a slowdown incurred by D70404 "Always import constants", which I
> then disabled by default in D73724. But I made a bunch of efficiency
> improvements to ThinLTO importing in D73851 which should have actually sped
> things up even without constant importing. I did re-enable constant
> importing in D74512 after those fixes.
[..]
>
> If this is on a release branch it would be interesting to see which of the
> patches I referenced are on it.
D70404 / 10cadee "Always import constants" landed soon after the 10.x branch
point, so it is not on the release branch, and neither are any of the follow-
ups.
(In reply to Hans Wennborg from comment #28)
> > There was a slowdown incurred by D70404 "Always import constants", which I
> > then disabled by default in D73724. But I made a bunch of efficiency
> > improvements to ThinLTO importing in D73851 which should have actually sped
> > things up even without constant importing. I did re-enable constant
> > importing in D74512 after those fixes.
> [..]
> >
> > If this is on a release branch it would be interesting to see which of the
> > patches I referenced are on it.
>
>
> D70404 / 10cadee "Always import constants" landed soon after the 10.x branch
> point, so it is not on the release branch, and neither are any of the
> follow-ups.
It might be worth trying to add D73851 to the branch regardless, as it should
speed up the general variable importing case as well.
Nikita - can you try HEAD to see if it is faster on the importing than with
10.x?
@Teresa: Thanks for the pointer! I gave cherry-picking D73851 a try, but the numbers came out pretty much the same (https://perf.rust-lang.org/compare.html?start=75cf41afb468152611212271bae026948cd3ba46&end=5ba7b02a22f34df566a821a0141548e567964d23).
(In reply to Nikita Popov from comment #30)
> @Teresa: Thanks for the pointer! I gave cherry-picking D73851 a try, but the
> numbers came out pretty much the same
> (https://perf.rust-lang.org/compare.
> html?start=75cf41afb468152611212271bae026948cd3ba46&end=5ba7b02a22f34df566a82
> 1a0141548e567964d23).
Hmm, ok, not really sure what could have made the thin link get significantly
slower. Would you be able to give me a reproducer?
I wanted to ping this big in case there have been any more progress. We have ~1 month left to get fixes into 10.0.1.
mogrify.i.gz
(108163 bytes, application/gzip)