Open Quuxplusone opened 3 years ago
Attached orig.ll.xz
(36220 bytes, application/x-xz): unoptimized IR for reproduction
adding -mllvm -debug-only=inline
shows that some of the calls to _ZN8rawspeed6BufferD2Ev are just barely not inlined.
Inlining (cost=55, threshold=250), Call: call void @_ZN8rawspeed6BufferD2Ev(%"class.rawspeed::Buffer"* nonnull dereferenceable(13) %ref.tmp2) #22
NOT Inlining (cost=55, threshold=45), Call: call void @_ZN8rawspeed6BufferD2Ev(%"class.rawspeed::Buffer"* nonnull dereferenceable(20) %29) #22
Weird that the threshold goes down at some point. -flegacy-pass-manager shows that the threshold stays at 250. I'm also not super familiar with inline cost modeling, InlineCost.cpp is where that mostly lives.
Also, the function after optimizations isn't trivial; if the invoke were just a normal call it'd probably get inlined always.
define linkonce_odr hidden void @_ZN8rawspeed6BufferD2Ev(%"class.rawspeed::Buffer" nonnull dereferenceable(13) %this) unnamed_addr #4 comdat align 2 personality i8 bitcast (i32 (...) @__gxx_personality_v0 to i8) { entry: %isOwner = getelementptr inbounds %"class.rawspeed::Buffer", %"class.rawspeed::Buffer" %this, i64 0, i32 2 %0 = load i8, i8 %isOwner, align 4, !tbaa !12, !range !80 %tobool.not = icmp eq i8 %0, 0 br i1 %tobool.not, label %if.end, label %if.then
if.then: ; preds = %entry %data = getelementptr inbounds %"class.rawspeed::Buffer", %"class.rawspeed::Buffer" %this, i64 0, i32 0 %1 = load i8, i8* %data, align 8, !tbaa !4 invoke void @_ZN8rawspeed19alignedFreeConstPtrEPKv(i8 %1) to label %if.end unwind label %terminate.lpad
if.end: ; preds = %if.then, %entry ret void
terminate.lpad: ; preds = %if.then %2 = landingpad { i8, i32 } catch i8 null %3 = extractvalue { i8, i32 } %2, 0 tail call void @__clang_call_terminate(i8 %3) #24 unreachable }
It is not argument promotion, or any IPO pass, as far as I can tell. The
functions in questions have external linkage.
My guess is the prune-eh and simplify-cfg are to blame. Diffing the statistics
shows the following two are missing in the new pipeline:
5 simplifycfg - Number of invokes with empty resume blocks
simplified into calls
2 prune-eh - Number of invokes removed
PruneEH is not in the new PM, it's supposed to be replaced by a combination of FuncAttrs + SimplifyCFG.
But even in the legacy PM, before being inlined away, _ZN8rawspeed6BufferD2Ev still has the invoke. Perhaps some marking some functions as noexcept is the real fix?
Looked a bit more at InlineCost.cpp
There is one obvious bug: CallAnalyzer::visitExtractValue()
should start with:
// Usually extractvalue
is modelled as free.
if (TTI.getUserCost(&I, TargetTransformInfo::TCK_SizeAndLatency) ==
TargetTransformInfo::TCC_Free)
return true;
that almost gets us over the edge, now the cost is 50 instead of 55, with threshold being 45 (because the callsite is cold).
Now, what i wonder is, perhaps we should be discounting noreturn calls?
At the C++ source level, can you mark _ZN8rawspeed19alignedFreeConstPtrEPKv as noexcept? That would almost certainly fix this. (or compile with -fno-exceptions, but maybe that's beyond the scope of this)
For visitExtractValue(), it could end with return Base::visitExtractValue();
instead of return false;
, which would forward to visitInstruction()
which does what you want.
(In reply to Arthur Eubanks from comment #5)
At the C++ source level, can you mark _ZN8rawspeed19alignedFreeConstPtrEPKv as noexcept? That would almost certainly fix this. (or compile with -fno-exceptions, but maybe that's beyond the scope of this) Yep. There is a number of things that could be fixed at source level, however i think this might not be the best way forward for compiler regressions.
For visitExtractValue(), it could end with
return Base::visitExtractValue();
instead ofreturn false;
, which would forward tovisitInstruction()
which does what you want.
Yep.
Any thoughts regarding having smaller CallPenalty for noreturn calls?
It seems like the threshold is the issue, not the cost. You said that the threshold was lower because it's considered a cold call. If this is causing a 10% regression, then maybe the call shouldn't be considered cold and we should be using the higher threshold.
(In reply to Arthur Eubanks from comment #7)
It seems like the threshold is the issue, not the cost. You said that the threshold was lower because it's considered a cold call. If this is causing a 10% regression, then maybe the call shouldn't be considered cold and we should be using the higher threshold.
It is cold as per the (static) block frequency, because it's at the returns out of the function (well, because those are destructor calls), which are less frequent than the loops in function body.
I guess this bisects to https://reviews.llvm.org/D28331
How about not using isColdCallSite()
iff there are arguments that are marked as SROA'ble?
orig.ll.xz
(36220 bytes, application/x-xz)