[NewPM] Inliner: Certain destructor calls are no longer inlined, preventing SROA, causing +10% runtime perf regression

Quuxplusone commented 3 years ago


Bugzilla Link	PR50099
Status	NEW
Importance	P enhancement
Reported by	Roman Lebedev (lebedev.ri@gmail.com)
Reported on	2021-04-23 07:07:18 -0700
Last modified on	2021-04-24 04:24:12 -0700
Version	trunk
Hardware	PC Linux
CC	aeubanks@google.com, david.bolvansky@gmail.com, davidxl@google.com, eraman@google.com, jdoerfert@anl.gov, llvm-bugs@lists.llvm.org, prazek@google.com, spatel+llvm@rotateright.com, stefomeister@gmail.com
Fixed by commit(s)
Attachments	`orig.ll.xz` (36220 bytes, application/x-xz)
Blocks
Blocked by
See also	PR4931

Created attachment 24799
unoptimized IR for reproduction

I'm seeing a rather big +10% runtime perf regression
from new-pm switch on a certain codepath.

I don't really have a small workable testcase yet,
but there's a full self-contained reproducer attached.

Here's how we can see that the problem is there:

$ clang -O3 -o old.ll -S -emit-llvm orig.ll -flegacy-pass-manager
$ clang -O3 -o new.ll -S -emit-llvm orig.ll
$ llvm-diff old.ll new.ll
<...>
function @_ZN8rawspeed6BufferD2Ev exists only in right module
<...>
in function _ZN8rawspeed23PanasonicDecompressorV511ProxyStream10parseBlockEv:
  in block %entry:
    >   %FirstSection = alloca %"class.rawspeed::Buffer", align 8
    >   %SecondSection = alloca %"class.rawspeed::Buffer", align 8
    >   %0 = bitcast %"class.rawspeed::Buffer"* %FirstSection to i8*
    >   call void @llvm.lifetime.start.p0i8(i64 16, i8* nonnull %0) #17
    >   tail call void @llvm.experimental.noalias.scope.decl(metadata !135)
        %pos.i = getelementptr inbounds %"class.rawspeed::PanasonicDecompressorV5::ProxyStream", %"class.rawspeed::PanasonicDecompressorV5::ProxyStream"* %this, i64 0, i32 0, i32 1
        %0 = load i32, i32* %pos.i, align 4, !tbaa !4, !noalias !9
    >   tail call void @llvm.experimental.noalias.scope.decl(metadata !138)

<...>

Inliner/inline cost model isn't something i'm familiar with,
so i'm posting this as is. Any suggestions how this can be approached from the
LLVM side?

Quuxplusone commented 3 years ago

Attached orig.ll.xz (36220 bytes, application/x-xz): unoptimized IR for reproduction

Quuxplusone commented 3 years ago

adding -mllvm -debug-only=inline shows that some of the calls to _ZN8rawspeed6BufferD2Ev are just barely not inlined.

Inlining (cost=55, threshold=250), Call:   call void @_ZN8rawspeed6BufferD2Ev(%"class.rawspeed::Buffer"* nonnull dereferenceable(13) %ref.tmp2) #22
NOT Inlining (cost=55, threshold=45), Call:   call void @_ZN8rawspeed6BufferD2Ev(%"class.rawspeed::Buffer"* nonnull dereferenceable(20) %29) #22

Weird that the threshold goes down at some point. -flegacy-pass-manager shows that the threshold stays at 250. I'm also not super familiar with inline cost modeling, InlineCost.cpp is where that mostly lives.

Also, the function after optimizations isn't trivial; if the invoke were just a normal call it'd probably get inlined always.

define linkonce_odr hidden void @_ZN8rawspeed6BufferD2Ev(%"class.rawspeed::Buffer" nonnull dereferenceable(13) %this) unnamed_addr #4 comdat align 2 personality i8 bitcast (i32 (...) @__gxx_personality_v0 to i8) { entry: %isOwner = getelementptr inbounds %"class.rawspeed::Buffer", %"class.rawspeed::Buffer" %this, i64 0, i32 2 %0 = load i8, i8 %isOwner, align 4, !tbaa !12, !range !80 %tobool.not = icmp eq i8 %0, 0 br i1 %tobool.not, label %if.end, label %if.then

if.then: ; preds = %entry %data = getelementptr inbounds %"class.rawspeed::Buffer", %"class.rawspeed::Buffer" %this, i64 0, i32 0 %1 = load i8, i8* %data, align 8, !tbaa !4 invoke void @_ZN8rawspeed19alignedFreeConstPtrEPKv(i8 %1) to label %if.end unwind label %terminate.lpad

if.end: ; preds = %if.then, %entry ret void

terminate.lpad: ; preds = %if.then %2 = landingpad { i8, i32 } catch i8 null %3 = extractvalue { i8, i32 } %2, 0 tail call void @__clang_call_terminate(i8 %3) #24 unreachable }

Quuxplusone commented 3 years ago

It is not argument promotion, or any IPO pass, as far as I can tell. The
functions in questions have external linkage.

My guess is the prune-eh and simplify-cfg are to blame. Diffing the statistics
shows the following two are missing in the new pipeline:

5 simplifycfg                  - Number of invokes with empty resume blocks
simplified into calls
2 prune-eh                     - Number of invokes removed

Quuxplusone commented 3 years ago

PruneEH is not in the new PM, it's supposed to be replaced by a combination of FuncAttrs + SimplifyCFG.

But even in the legacy PM, before being inlined away, _ZN8rawspeed6BufferD2Ev still has the invoke. Perhaps some marking some functions as noexcept is the real fix?

Quuxplusone commented 3 years ago

Looked a bit more at InlineCost.cpp There is one obvious bug: CallAnalyzer::visitExtractValue() should start with: // Usually extractvalue is modelled as free. if (TTI.getUserCost(&I, TargetTransformInfo::TCK_SizeAndLatency) == TargetTransformInfo::TCC_Free) return true;

that almost gets us over the edge, now the cost is 50 instead of 55, with threshold being 45 (because the callsite is cold).

Now, what i wonder is, perhaps we should be discounting noreturn calls?

Quuxplusone commented 3 years ago

At the C++ source level, can you mark _ZN8rawspeed19alignedFreeConstPtrEPKv as noexcept? That would almost certainly fix this. (or compile with -fno-exceptions, but maybe that's beyond the scope of this)

For visitExtractValue(), it could end with return Base::visitExtractValue(); instead of return false;, which would forward to visitInstruction() which does what you want.

Quuxplusone commented 3 years ago

(In reply to Arthur Eubanks from comment #5)

At the C++ source level, can you mark _ZN8rawspeed19alignedFreeConstPtrEPKv as noexcept? That would almost certainly fix this. (or compile with -fno-exceptions, but maybe that's beyond the scope of this) Yep. There is a number of things that could be fixed at source level, however i think this might not be the best way forward for compiler regressions.

For visitExtractValue(), it could end with return Base::visitExtractValue(); instead of return false;, which would forward to visitInstruction() which does what you want.

Yep.

Any thoughts regarding having smaller CallPenalty for noreturn calls?

Quuxplusone commented 3 years ago

It seems like the threshold is the issue, not the cost. You said that the threshold was lower because it's considered a cold call. If this is causing a 10% regression, then maybe the call shouldn't be considered cold and we should be using the higher threshold.

Quuxplusone commented 3 years ago

(In reply to Arthur Eubanks from comment #7)

It seems like the threshold is the issue, not the cost. You said that the threshold was lower because it's considered a cold call. If this is causing a 10% regression, then maybe the call shouldn't be considered cold and we should be using the higher threshold.

It is cold as per the (static) block frequency, because it's at the returns out of the function (well, because those are destructor calls), which are less frequent than the loops in function body.

I guess this bisects to https://reviews.llvm.org/D28331

How about not using isColdCallSite() iff there are arguments that are marked as SROA'ble?

Quuxplusone commented 3 years ago

Posted https://reviews.llvm.org/D101228 and https://reviews.llvm.org/D101229

Quuxplusone / LLVMBugzillaTest

[NewPM] Inliner: Certain destructor calls are no longer inlined, preventing SROA, causing +10% runtime perf regression #49068