New compile time regressions (12 vs 13) on znver3 (LoopMicroOpBufferSize too large?)

Quuxplusone commented 3 years ago


Bugzilla Link	PR51460
Status	REOPENED
Importance	P enhancement
Reported by	David Bolvansky (david.bolvansky@gmail.com)
Reported on	2021-08-12 12:26:43 -0700
Last modified on	2021-09-07 09:33:11 -0700
Version	trunk
Hardware	Other All
CC	alina.sbirlea@gmail.com, lebedev.ri@gmail.com, listmail@philipreames.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, matdzb@gmail.com, max.kazantsev@azul.com, nikita.ppv@gmail.com, spatel+llvm@rotateright.com, tstellar@redhat.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also	PR50584, PR51108

Phoronix reported some significant CT regressions for mplayer and ffmpeg

https://www.phoronix.com/scan.php?page=article&item=clang13-initial-epyc&num=4

Quuxplusone commented 3 years ago

This is a duplicate of https://bugs.llvm.org/show_bug.cgi?id=50584
This is expected, and will not be addressed for a release.

_This bug has been marked as a duplicate of bug 50584_

Quuxplusone commented 3 years ago

Shouldn't LoopMicroOpBufferSize for znver3 be reduced? While it may be correct from a micro-architectural perspective, the way it is used by LLVM is clearly inappropriate: LoopMicroOpBufferSize should be an upper bound on how much we partially unroll, but what we seem to be doing right now is to just unroll maximally up to that size. That's okay if the buffer size is reasonably small, but not if it's very large. That results in massive code bloat for no benefit. While there are scalability problems in other passes, that seems like the part that is causing the immediate problem and should be addressed. As this value appears to only be used for this unrolling heuristics, I'd suggest dropping it to a more sensible value.

Quuxplusone commented 3 years ago

I agree that cost/benefit model for full loop unrolling is broken.
It seems we want both to aggressively fully unroll if we can,
and not do that at the same time.

At the very least, i really don't think we should be doing full unrolling
before inlining/vectorization, but IMHO we should not be fully unrolling at all,
at least not unless we can tell that there will be other benefits
than lack of said loop.

+Reames - thoughts on changing profitability heuristics for full unrolling?

Quuxplusone commented 3 years ago

I don't think full unrolling is relevant here: LoopMicroOpBufferSize controls the PartialThreshold, which is used for partial and runtime unrolling.

Full unrolling is controlled by different thresholds and a more sophisticated cost model that takes into account how much simplification is expected from breaking the loop backedge. Fully unrolling a relatively large loop can still be beneficial if it results in significant simplification.

Partial and runtime unrolling have more tenuous profitability heuristics, and creating 4096 instruction loops because that's the micro-architectural loop buffer size is almost certainly not profitable. Runtime unrolling will at least limit to 8 unrolls by default, but partial unrolling will be happily unroll to the full threshold.

Quuxplusone commented 3 years ago

Shouldn't LoopMicroOpBufferSize for znver3 be reduced?

Yeah, there should be a more reasonable value because as wee see for reported projects, regressions are huge (and hardly with 30% better runtime perf to “justify” them).

Reopened, renamed a bit.

Quuxplusone commented 3 years ago

Thank you everyone for your thoughts.
It is always fascinating to read what people think.

As a word of caution, i would strongly advise from using some random
3'rd party numbers to pass hard judgments.

Unfortunately, my position still hasn't changed, and this is still WONFIX.
Although we can discuss temporarily reducing it *iff* someone
personally promises to resolve https://bugs.llvm.org/show_bug.cgi?id=50584.

Even halving the LoopMicroOpBufferSize (512->256) results in a number of
statistically-significant performance regressions (+ ~6%).

Quuxplusone commented 3 years ago

To be clear, my primary concern here is (for once!) not compile-time, but the fact that this simply generates terrible code. Just look at this: https://c.godbolt.org/z/617TjjKhs The compiler shouldn't be unrolling random loops to hundreds of iterations.

While I'd love to see issues like https://bugs.llvm.org/show_bug.cgi?id=50584 resolved as a matter of general principle, they are only tangentially related to the problem at hand, which is way too aggressive unrolling. The overzealous unrolling must be fixed independently of any compile-time problems it may also be triggering.

Quuxplusone commented 3 years ago

Sorry for late response here, was on vacation.

I started writing a long comment on unrolling heuristics, realized it was only
vaguely relevant here, and decided to save it to my public notes for ease of
later use.  See https://github.com/preames/public-notes/blob/master/llvm-loop-
opt-ideas.rst#unroll-heuristics if interested.

Specifically for this bug, I think it's unfortunate that our unrolling
heuristic is too tightly tied to an architectural feature, but I *don't* think
we should avoid correcting the feature value simply because it causes some
regressions on that architecture.  I would suggest we look at some workarounds
to preserve the previous behavior while letting the flag be correct.  A few
ideas:
* Cap the scheduler with a fixed upper bound to avoid complexity explosion with
a link specifically to the bug to fix.
* Introduce a "legacy feature for unrolling heuristic" TTI hook which used the
old value on that platform, and defaults to the feature value on all other
platforms.

I think it's unreasonable to ask Roman to rewrite the core partial unrolling
heuristic just because he happened to trip across a case where it's broken.  A
more constructive approach is workaround it for the moment, and let someone
interested with time and resources to tackle unrolling separately.

Quuxplusone commented 3 years ago

(In reply to listmail from comment #8)

Sorry for late response here, was on vacation.

I started writing a long comment on unrolling heuristics, realized it was only vaguely relevant here, and decided to save it to my public notes for ease of later use. See https://github.com/preames/public-notes/blob/master/llvm-loop-opt-ideas. rst#unroll-heuristics if interested.

Specifically for this bug, I think it's unfortunate that our unrolling heuristic is too tightly tied to an architectural feature, but I don't think we should avoid correcting the feature value simply because it causes some regressions on that architecture. I would suggest we look at some workarounds to preserve the previous behavior while letting the flag be correct. A few ideas:

Cap the scheduler with a fixed upper bound to avoid complexity explosion with a link specifically to the bug to fix.

Introduce a "legacy feature for unrolling heuristic" TTI hook which used the old value on that platform, and defaults to the feature value on all other platforms.
To be honest, i'm not quite convinced there is anything to fix wrt the threshold itself - unrolled loops are great, and as long as loop's body is less than 4K x86 uops (well, let's pretend that it maps 1:1 to 4K LLVM IR ops), then it fit within uop cache and we've also improved cpu dispatch performance.

So yes, if we unroll 100-instruction loop 40 times, and bloat it to 4000 ops, that's fine, ignoring the latent quadratic edge-cases it exposes in the rest of the compiler. Of course, if you care about binary size, perhaps you should use -Os/-Oz :]

But then yes, we probably should reevaluate the cases where the code is strictly sequential, like the contrived example from nikic, those are perhaps not worth unrolling, at least as much.

I think it's unreasonable to ask Roman to rewrite the core partial unrolling heuristic just because he happened to trip across a case where it's broken.

A more constructive approach is workaround it for the moment, and let someone interested with time and resources to tackle unrolling separately.

The thing is, the current threshold (512) is already a compromise/workaround. :)

Quuxplusone commented 3 years ago

Can we remove release-13.0.0 from the blocks field?

Quuxplusone commented 3 years ago

Thank you all for participating in this disscussion!

Quuxplusone / LLVMBugzillaTest

New compile time regressions (12 vs 13) on znver3 (LoopMicroOpBufferSize too large?) #50427