Significant compile time regressions since Clang 9

Quuxplusone commented 4 years ago


Bugzilla Link	PR44408
Status	NEW
Importance	P enhancement
Reported by	David Bolvansky (david.bolvansky@gmail.com)
Reported on	2019-12-30 09:21:09 -0800
Last modified on	2020-05-20 14:51:50 -0700
Version	trunk
Hardware	PC Linux
CC	abrahm.scully@gmail.com, alina.sbirlea@gmail.com, blitzrakete@gmail.com, craig.topper@gmail.com, dblaikie@gmail.com, dgregor@apple.com, dmajor@bugmail.cc, erik.pilkington@gmail.com, florian_hahn@apple.com, hans@chromium.org, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, nikita.ppv@gmail.com, richard-llvm@metafoo.co.uk, spatel+llvm@rotateright.com, sylvestre@debian.org, tejohnson@google.com, tstellar@redhat.com, yuanfang.chen@sony.com
Fixed by commit(s)
Attachments	`mogrify.i.gz` (108163 bytes, application/gzip)
Blocks	PR45309, PR44555
Blocked by
See also	PR44539

Recent Phoronix article mentions huge compile time / perf regressions.

For example:
https://www.phoronix.com/scan.php?page=article&item=gcc-clang-3960x&num=4

ImageMagisk - 50 % worse compile time.
PHP - 20 % worse compile time.

Quuxplusone commented 4 years ago

As noted here https://reviews.llvm.org/D68298

That patch is maybe a reason of this slow down.

Hopefully the fix (https://reviews.llvm.org/D72214) will land before 10.0
release.

Quuxplusone commented 4 years ago

https://www.phoronix.com/scan.php?page=news_item&px=AMD-Znver2-Clang-10-Tests

PHP compile time regression is not fixed - 12 % is quite significant regression
I believe.

Quuxplusone commented 4 years ago

(In reply to David Bolvansky from comment #1)
> As noted here https://reviews.llvm.org/D68298
>
> That patch is maybe a reason of this slow down.
>
>
> Hopefully the fix (https://reviews.llvm.org/D72214) will land before 10.0
> release.

Florian, do you know the status here?

Quuxplusone commented 4 years ago

(In reply to Hans Wennborg from comment #3)
> (In reply to David Bolvansky from comment #1)
> > As noted here https://reviews.llvm.org/D68298
> >
> > That patch is maybe a reason of this slow down.
> >
> >
> > Hopefully the fix (https://reviews.llvm.org/D72214) will land before 10.0
> > release.
>
> Florian, do you know the status here?

I've reverted the commit causing the regression on Jan 14:
https://reviews.llvm.org/rG192cce10f67e4f22be6d9b8c0975f78ad246d1bd

The patch was causing a compile-time regression, but I am not sure the patch
caused the ones mentioned in the article.

I just checked and it looks like it is on the 10.x release branch.

Quuxplusone commented 4 years ago

(In reply to David Bolvansky from comment #2)
> https://www.phoronix.com/scan.php?page=news_item&px=AMD-Znver2-Clang-10-Tests
>
>
> PHP compile time regression is not fixed - 12 % is quite significant
> regression I believe.

I bisected the PHP regression and ended up at this commit

---
commit 6da79ce1fed52f828b5e5d2e3d274ef1692f17af
Author: Alina Sbirlea <asbirlea@google.com>
Date:   Wed Sep 4 19:16:04 2019 +0000

    [MemorySSA] Re-enable MemorySSA use.

    Differential Revision: https://reviews.llvm.org/D58311

    llvm-svn: 370957
---

Alina, can you take a look?
It would at least be interesting to know what code is being slow to compile and
why.

I bisected like this:

$ cd /work/php
$ wget https://www.php.net/distributions/php-7.4.2.tar.gz
$ tar xf php-7.4.2.tar.gz

And then from my llvm checkout:

$ ninja -C build.release -j1000 clang && ( cd /work/php && rm -rf build &&
mkdir build && cd build && CC=/work/llvm.monorepo/build.release/bin/clang
CXX=/work/llvm.monorepo/build.release/bin/clang++ ../php-7.4.2/configure &&
time make -j50 )

The build step goes from about 12m25s to 13m27s at this revision.

Quuxplusone commented 4 years ago

Partial explanation (mentions DSE, but it makes sense as a reason for this
slowdown) is here:
https://reviews.llvm.org/D72700

"The performance problem's scope goes beyond DSE. The current pass pipeline for
both pass managers has a sequence of (gvn, memcpyopt, dse), where all of these
use MemDepAnalysis. Switching DSE to MemorySSA may initially get worse compile
times, as we need to build both MemDepAnalysis and MemorySSA, but switching all
three (use newgvn and port memcpyopt and dse), may be worthwhile. This is the
long term goal I have in mind."

I suggest turning off MemorySSA until all passes are ready.

Quuxplusone commented 4 years ago

I've opened a related report at bug #44889, where LoopRotate in particular is
affected. Also referenced there is https://perf.rust-
lang.org/compare.html?start=9993408439c871c1bc0c0bac3b3008b24039443c&end=52e999afd3be3341463f3b87be02f2bcfb381c17,
which shows compile-time regressions in Rust when MemorySSA is enabled.
Enabling MemorySSA regresses optimized builds by 1-8%.

(Note that even with MemorySSA disabled, LLVM 10 *still* regresses opt build
compile-times by 4-12%.)

Quuxplusone commented 4 years ago

Some clarifications:

MemorySSA is only enabled for some loop passes (opt flag: -enable-mssa-loop-dependency) and in EarlyCSEwMSSA, which is a FunctionPass. It is not enabled for DSE or other function passes (new gvn is not used). The comment in https://bugs.llvm.org/show_bug.cgi?id=44408#c6 refers to an ongoing work, nothing is tuned on.
AFAICT https://bugs.llvm.org/show_bug.cgi?id=44408#c7 refers to a regression with asserts enabled due to verifications in MemorySSA.
AFAICT https://bugs.llvm.org/show_bug.cgi?id=44408#c5 does not appear to be a regression due to verifications. Looking into it, thank you Hans.

Quuxplusone commented 4 years ago

(In reply to Alina Sbirlea from comment #8)
> Some clarifications:
>
> - MemorySSA is only enabled for some loop passes (opt flag:
> -enable-mssa-loop-dependency) and in EarlyCSEwMSSA, which is a FunctionPass.
> It is not enabled for DSE or other function passes (new gvn is not used).
> The comment in https://bugs.llvm.org/show_bug.cgi?id=44408#c6 refers to an
> ongoing work, nothing is tuned on.

I believe David's point here was that enabling of MemorySSA in loop passes
added 4 extra MemorySSA computations, because it is not being preserved by
intermediate passes, so the same issue exists here.

Looking at the pipeline log, it's peculiar that one of the MemorySSA
computations happens before a LoopRotate-only loop pass manager, which (as far
as I know) does not need MemorySSA, only preserves it -- is it possible to not
compute it in that case?

> - AFAICT https://bugs.llvm.org/show_bug.cgi?id=44408#c7 refers to a
> regression with asserts enabled due to verifications in MemorySSA.

As I mentioned on the other report, this regression is with assertions
disabled. Regressions with enabled assertions is an unrelated (and thankfully
easily fixed) issue.

Quuxplusone commented 4 years ago

Understood, here's a preliminary update.

Profiling the test in PR44889 shows there is good amount of time spent in
updating dominator tree. I had a pending FIXME which may help address this.
I made a quick prototype to check if this is indeed a relevant cause and for
that IR the regression (w/ release-noasserts) went from 9.8% to 6.6% on my
machine.
So there is indeed some room to improve there, but the infrastructure change
may take a bit to get precisely right.

@Nikita:
Yes, we can have LoopRotate not run MemorySSA when being in the loop pass
manager alone . We have such a mechanism in the new pass manager, but not in
the legacy one.

I sent out https://reviews.llvm.org/D74574 which makes LoopRotate preserve
MemorySSA when available, but the side effect is also that it splits the loop
pass pipeline when LICM and unswitch are run right after.
I'm not seeing a regression on the IR in PR44889 with this, but I'd be happy to
get additional timings here.

In order to resolve the loop pass pipeline being split, I'd need to pass the
info that the analysis is required in the pass manager, at
lib/Transforms/IPO/PassManagerBuilder.cpp:401.

I haven't profiled the php regression yet.

Quuxplusone commented 4 years ago

https://reviews.llvm.org/D74574 is now in tree.

I also sent out https://reviews.llvm.org/D74640, which only removes the creation of MemorySSA for LoopRotate when the pass is run alone.

Would you mind testing if these preliminary fixes help?

Thanks!

Quuxplusone commented 4 years ago

I was curious about the ImageMagick regression mentioned in the Phoronix
article. I built it like this:

$ wget https://imagemagick.org/download/ImageMagick-7.0.9-24.tar.gz
$ tar xf ImageMagick-7.0.9-24.tar.gz
$ cd ImageMagick-7.0.9-24/
$ mkdir build && cd build
$ CC=/work/llvm.monorepo.rel/build.noasserts/bin/clang
CXX=/work/llvm.monorepo.rel/build.noasserts/bin/clang++ ../configure
$ time make -j50

I didn't see any 50% slow-down, but more like 10%. It's not clear exactly what
the Phoronix article is measuring though.

Anyway, bisection points to the same MemorySSA enabling commit as the PHP
bisection.

I tried building also with the commit and patch in #c11, but it doesn't seem to
make any significant difference.

Quuxplusone commented 4 years ago

Attached mogrify.i.gz (108163 bytes, application/gzip): imagemagick mogrify.c preprocessed

Quuxplusone commented 4 years ago

So.. revert "[MemorySSA] Re-enable MemorySSA use." ?

Quuxplusone commented 4 years ago

I didn't see any 50% slow-down, but more like 10%.

10 % between two commits? Or Clang 9 vs Clang 10? Any data for Clang 9?

Quuxplusone commented 4 years ago

(In reply to David Bolvansky from comment #14)
> So.. revert "[MemorySSA] Re-enable MemorySSA use." ?

Presumably that leads to better optimization, so I'm not sure we just want to
turn it off.

(In reply to David Bolvansky from comment #15)
> > I didn't see any 50% slow-down, but more like 10%.
>
> 10 % between two commits? Or Clang 9 vs Clang 10? Any data for Clang 9?

Between Clang 9 (roughly) and Clang 10.

Quuxplusone commented 4 years ago

Presumably that leads to better optimization

So we need data to confirm this generally so Clang 10 can justify worse compile times for much better performance of binaries.

Probably worth to analyze compile time changes even for -Os/-Oz/Og levels.

Quuxplusone commented 4 years ago

Here are some of the timings I see at LLVM ToT (without patch in #c11, but with
the already committed ones):

time bin/clang -mllvm -enable-mssa-loop-dependency=true -fopenmp -g -O2 -w -
mtune=broadwell -fexceptions -pthread -c ~/mogrify.i -fPIC -o /dev/null
real    0m5.919s
time bin/clang -mllvm -enable-mssa-loop-dependency=false -fopenmp -g -O2 -w -
mtune=broadwell -fexceptions -pthread -c ~/mogrify.i -fPIC -o /dev/null
real    0m5.822s

For the test in PR44889:
time bin/opt -O3 -enable-mssa-loop-dependency=true ~/extracted.bc
real    0m5.774s
time bin/opt -O3 -enable-mssa-loop-dependency=false ~/extracted.bc
real    0m5.701s

Python build with CC=/myrelease/bin/clang CXX=/myrelease/bin/clang++ ../php-
7.4.2/configure && time make -j50

clang9 release:
real    29.243s
user    7m31.280s

ToT before the fix in D74574:
real    31.604s
user    7m51.633s

ToT now:
real    31.116s
user    7m49.442s

ToT and revert MSSA flag flip:
real    30.778s
user    7m44.797s

Quuxplusone commented 4 years ago

Thanks for data! MSSA does not make php compile times much worse it seems...

Sight. We need to find out what causes this big regression:

29.243s - > 30.778s

@hans, can you run bisect between Clang 9 branch commit and this MSSA commit?

Quuxplusone commented 4 years ago

(In reply to Alina Sbirlea from comment #18)
> Here are some of the timings I see at LLVM ToT (without patch in #c11, but
> with the already committed ones):
>
> time bin/clang -mllvm -enable-mssa-loop-dependency=true -fopenmp -g -O2 -w
> -mtune=broadwell -fexceptions -pthread -c ~/mogrify.i -fPIC -o /dev/null
> real  0m5.919s
> time bin/clang -mllvm -enable-mssa-loop-dependency=false -fopenmp -g -O2 -w
> -mtune=broadwell -fexceptions -pthread -c ~/mogrify.i -fPIC -o /dev/null
> real  0m5.822s

Setting the -enable-mssa-loop-dependency=false flag does not make any
difference for me. Building at the tip of the release/10.x branch (da0fe2ade36):

$ perf stat -r5 /work/llvm.monorepo.rel/build.noasserts/bin/clang -fopenmp -g -
O2 -w -mtune=broadwell -fexceptions -pthread -c /tmp/mogrify.i -fPIC -o
/dev/null

10.860 +- 0.147 seconds

$ perf stat -r5 /work/llvm.monorepo.rel/build.noasserts/bin/clang -fopenmp -g -
O2 -w -mtune=broadwell -fexceptions -pthread -c /tmp/mogrify.i -fPIC -o
/dev/null -enable-mssa-loop-dependency=false

10.8424 +- 0.0784

Reverting the patch however, makes a big difference:

$ git revert 6da79ce1fed52f828b5e5d2e3d274ef1692f17af
$ ninja -j1000 clang
$ perf stat -r5 /work/llvm.monorepo.rel/build.noasserts/bin/clang -fopenmp -g -
O2 -w -mtune=broadwell -fexceptions -pthread -c /tmp/mogrify.i -fPIC -o
/dev/null -enable-mssa-loop-dependency=false

7.836 +- 0.137 seconds

For some reason, things are looking much better at master (0e5ed1b2626):

$ perf stat -r5 /work/llvm.monorepo.rel/build.noasserts/bin/clang -fopenmp -g -
O2 -w -mtune=broadwell -fexceptions -pthread -c /tmp/mogrify.i -fPIC -o
/dev/null -enable-mssa-loop-dependency=true

7.818 +- 0.146 seconds

Maybe that's due to D74574 (1326a5a4cfe)...

At 1326a5a4cfe^1

$ perf stat -r5 /work/llvm.monorepo.rel/build.noasserts/bin/clang -fopenmp -g -
O2 -w -mtune=broadwell -fexceptions -pthread -c /tmp/mogrify.i -fPIC -o
/dev/null

10.7749 +- 0.0153 seconds

At 1326a5a4cfe

$ perf stat -r5 /work/llvm.monorepo.rel/build.noasserts/bin/clang -fopenmp -g -
O2 -w -mtune=broadwell -fexceptions -pthread -c /tmp/mogrify.i -fPIC -o
/dev/null

7.8181 +- 0.0668 seconds

Okay, that's it.

But then I tried applying https://reviews.llvm.org/D74640 on top

10.670 +- 0.168 seconds

So I guess we should not land that?

Quuxplusone commented 4 years ago

Hans,

Yes, these times sound right to me. Yes, D74574 should have made the difference.

The "-enable-mssa-loop-dependency" flag is an LLVM flag, so adding "-mllvm" in
front should work.

I'm sorry I forgot to clarify in the previous comment, yes, there is an added
regression with D74640 currently, and I'm not going to land it until I resolve
the underlying cause. I tracked that down to building a DT when doing updates
and put together a quick fix to test the theory; the regression with D74640
went away.
I want to do the change properly though, so until then, D74640 remains on hold.

Please let me know if I can help track down the other regression causes, or in
any other way.

Quuxplusone commented 4 years ago

(In reply to Alina Sbirlea from comment #21)
> Hans,
>
> Yes, these times sound right to me. Yes, D74574 should have made the
> difference.

Thanks! I have cherry-picked this to 10.x as
8b0df8e1ed6842095388fce08a0a5f761cd905ed

>
> The "-enable-mssa-loop-dependency" flag is an LLVM flag, so adding "-mllvm"
> in front should work.

D'oh, I didn't realize I dropped the -mllvm flag.

>
> I'm sorry I forgot to clarify in the previous comment, yes, there is an
> added regression with D74640 currently, and I'm not going to land it until I
> resolve the underlying cause. I tracked that down to building a DT when
> doing updates and put together a quick fix to test the theory; the
> regression with D74640 went away.
> I want to do the change properly though, so until then, D74640 remains on
> hold.

But do we need D74640 for the release? It seems like D74574 fixed the compile-
time regression. What's D74640 for?

Quuxplusone commented 4 years ago

We don't need D74640 for the release.
The purpose of D74640 is to re-merge the loop pass manager pipeline in the
legacy pass manager, because (I think) traditionally, we tried to run loop
passes in the same pipeline, hence running all passes in the pipeline on a loop
before moving to the next loop in the loop nest.
AFAICT this doesn't change the optimizations in this case, so it's not an issue
to have it split.

Quuxplusone commented 4 years ago

@Hans: Could you please also cherry-pick https://github.com/llvm/llvm-project/commit/f0b57d8071853ec2ab459c0492854c67ea4fa93c? It's only relevant for assertion-enabled builds, so mainly to make CI builds less slow.

Quuxplusone commented 4 years ago

(In reply to Nikita Popov from comment #24)
> @Hans: Could you please also cherry-pick
> https://github.com/llvm/llvm-project/commit/
> f0b57d8071853ec2ab459c0492854c67ea4fa93c? It's only relevant for
> assertion-enabled builds, so mainly to make CI builds less slow.

Sure. Pushed to 10.x as 593a0dda7a683df9b3744c6391bb2f8de9ed5908.

Quuxplusone commented 4 years ago

Some new numbers on current release/10.x branch: https://perf.rust-lang.org/compare.html?start=9381e8178b49636d4604e4ec0f1263960691c958&end=c6d04150234d5cb973ec5213d27a38e2a7b67955

Comparing with a previous experiment that disables MSSA, it looks like the MSSA related regressions are now reduced to 2-3%, which is "good enough" I think.

LLVM 10 still regresses optimized build times by 5-15%, but I've had little luck in figuring out why. At least for some benchmarks, ThinLTO import seems to have become significantly slower (jumping from 210 to 260 seconds for the top regression).

Quuxplusone commented 4 years ago

(In reply to Nikita Popov from comment #26)
> Some new numbers on current release/10.x branch:
> https://perf.rust-lang.org/compare.
> html?start=9381e8178b49636d4604e4ec0f1263960691c958&end=c6d04150234d5cb973ec5
> 213d27a38e2a7b67955
>
> Comparing with a previous experiment that disables MSSA, it looks like the
> MSSA related regressions are now reduced to 2-3%, which is "good enough" I
> think.
>
> LLVM 10 still regresses optimized build times by 5-15%, but I've had little
> luck in figuring out why. At least for some benchmarks, ThinLTO import seems
> to have become significantly slower (jumping from 210 to 260 seconds for the
> top regression).

Is this on a release branch or at HEAD?

There was a slowdown incurred by D70404 "Always import constants", which I then
disabled by default in D73724. But I made a bunch of efficiency improvements to
ThinLTO importing in D73851 which should have actually sped things up even
without constant importing. I did re-enable constant importing in D74512 after
those fixes.

Could you try disabling constant importing to see if that fixes things? The
internal LLVM option is -import-constants-with-refs=false. I'm not sure how
that would get passed to Rust though. For gold or llvm LTO links it would be -
Wl,-plugin-opt,-import-constants-with-refs=false.

If this is on a release branch it would be interesting to see which of the
patches I referenced are on it.

Quuxplusone commented 4 years ago

> There was a slowdown incurred by D70404 "Always import constants", which I
> then disabled by default in D73724. But I made a bunch of efficiency
> improvements to ThinLTO importing in D73851 which should have actually sped
> things up even without constant importing. I did re-enable constant
> importing in D74512 after those fixes.
[..]
>
> If this is on a release branch it would be interesting to see which of the
> patches I referenced are on it.

D70404 / 10cadee "Always import constants" landed soon after the 10.x branch
point, so it is not on the release branch, and neither are any of the follow-
ups.

Quuxplusone commented 4 years ago

(In reply to Hans Wennborg from comment #28)
> > There was a slowdown incurred by D70404 "Always import constants", which I
> > then disabled by default in D73724. But I made a bunch of efficiency
> > improvements to ThinLTO importing in D73851 which should have actually sped
> > things up even without constant importing. I did re-enable constant
> > importing in D74512 after those fixes.
> [..]
> >
> > If this is on a release branch it would be interesting to see which of the
> > patches I referenced are on it.
>
>
> D70404 / 10cadee "Always import constants" landed soon after the 10.x branch
> point, so it is not on the release branch, and neither are any of the
> follow-ups.

It might be worth trying to add D73851 to the branch regardless, as it should
speed up the general variable importing case as well.

Nikita - can you try HEAD to see if it is faster on the importing than with
10.x?

Quuxplusone commented 4 years ago

@Teresa: Thanks for the pointer! I gave cherry-picking D73851 a try, but the numbers came out pretty much the same (https://perf.rust-lang.org/compare.html?start=75cf41afb468152611212271bae026948cd3ba46&end=5ba7b02a22f34df566a821a0141548e567964d23).

Quuxplusone commented 4 years ago

(In reply to Nikita Popov from comment #30)
> @Teresa: Thanks for the pointer! I gave cherry-picking D73851 a try, but the
> numbers came out pretty much the same
> (https://perf.rust-lang.org/compare.
> html?start=75cf41afb468152611212271bae026948cd3ba46&end=5ba7b02a22f34df566a82
> 1a0141548e567964d23).

Hmm, ok, not really sure what could have made the thin link get significantly
slower. Would you be able to give me a reproducer?

Quuxplusone commented 4 years ago

I wanted to ping this big in case there have been any more progress. We have ~1 month left to get fixes into 10.0.1.

Quuxplusone / LLVMBugzillaTest

Significant compile time regressions since Clang 9 #43378