dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.34k stars 4.74k forks source link

PhiDefinitions are not zero cost #88841

Open AndyAyersMS opened 1 year ago

AndyAyersMS commented 1 year ago

Was playing around with removing PhiDefinitions just after the optimization passes, but this lead to an unexpectedly large number of diffs. The culprit seems to be attribution of size cost to a PHI def store:

N004 (  0,  3) [000824] DA---------                         *  STORE_LCL_VAR int    V02 loc1         d:2 $VN.Void
N003 (  0,  0) [000823] -----------                         \--*  PHI       int    $201
N001 (  0,  0) [000830] ----------- pred BB06                  +--*  PHI_ARG   int    V02 loc1         u:5
N002 (  0,  0) [000825] ----------- pred BB01                  \--*  PHI_ARG   int    V02 loc1         u:1 $c1

Leading to more frequent code duplication in fgOptimizeBranch.

EG in benchmarks.run_pgo

[19:16:25] --------------------------------------------------------------------------------
[19:16:25] 749 contexts with diffs (10 improvements, 726 regressions, 13 same size)
[19:16:25]   -112/+10,097 bytes

On the whole this additional duplication might not be bad, as the heuristic thinks it is avoiding a branch to branch, but the duplication seems to put a lot of pressure on LSRA, leading to many more split edge resolution moves (with corresponding horrible block placements). So perhaps this optimization increases the density of critical edges or something?

At any rate attributing costs to PHIs like this is nonsensical; fixing that might actually have a broader impact as it probably feeds into decisions made during optimization as well.

My main motivation for eagerly removing the PhiDefinitions was better TP and less gunk possibly blocking optimizations, oddly my local runs show no TP benefit at all, though perhaps I mismeasured.

cc @dotnet/jit-contrib

ghost commented 1 year ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

Issue Details
Was playing around with removing PhiDefinitions just after the optimization passes, but this lead to an unexpectedly large number of diffs. The culprit seems to be attribution of size cost to a PHI def store: ``` N004 ( 0, 3) [000824] DA--------- * STORE_LCL_VAR int V02 loc1 d:2 $VN.Void N003 ( 0, 0) [000823] ----------- \--* PHI int $201 N001 ( 0, 0) [000830] ----------- pred BB06 +--* PHI_ARG int V02 loc1 u:5 N002 ( 0, 0) [000825] ----------- pred BB01 \--* PHI_ARG int V02 loc1 u:1 $c1 ``` Leading to more frequent code duplication in `fgOptimizeBranch`. EG in benchmarks.run_pgo ``` [19:16:25] -------------------------------------------------------------------------------- [19:16:25] 749 contexts with diffs (10 improvements, 726 regressions, 13 same size) [19:16:25] -112/+10,097 bytes ``` On the whole this additional duplication might not be bad, as the heuristic thinks it is avoiding a branch to branch, but the duplication seems to put a lot of pressure on LSRA, leading to many more split edge resolution moves (with corresponding horrible block placements). So perhaps this optimization increases the density of critical edges or something? At any rate attributing costs to PHIs like this is nonsensical; fixing that might actually have a broader impact as it probably feeds into decisions made during optimization as well. My main motivation for eagerly removing the PhiDefinitions was better TP and less gunk possibly blocking optimizations, oddly my local runs show no TP benefit at all, though perhaps I mismeasured. cc @dotnet/jit-contrib
Author: AndyAyersMS
Assignees: -
Labels: `area-CodeGen-coreclr`
Milestone: Future
markples commented 1 year ago

Do you think this was intended as some sort of indirect cost of a phi assignment? For example, it "full SSA", un-ssa requires assignments which of course has costs. Are phis indicative of register shuffling that might need to happen? To be clear, I don't really think so - the normal case would keep the variable in the same register - but certainly phis indicate some sort of complexity of the code.

AndyAyersMS commented 1 year ago

We currently don't allow overlapping SSA lifetimes so there is no codegen implication for PHIs.

And if we ever did want to allow overlapping lifetimes, we'd need to fix our SSA maintenance story -- by the time we drop SSA it's no longer accurate.

I think a (possible) simpler next step would be to rename disjoint SSA webs right after we build SSA. That would give us some freedom to rearrange things in ways we can't do now, but I suspect other changes might also be needed to really see any benefit from this.

markples commented 1 year ago

Sorry, I was trying to contrast against overlapping SSA lifetimes by saying things like "indirect cost" and "full SSA".

A concrete example would be a phi of an integer remainder (rdx) and a call return (rax). The phi is the clue that a register shuffle will be needed. Whether this is important enough to matter (it's not like we assign a cost to an integer remainder (rdx) feeding the first argument of a call (rcx)) is a different question. And the cost of the phi is in the predecessor anyway...

SingleAccretion commented 1 year ago

In the current design, all PHI stores are costed to be zero (look for where they are created in ssabuilder.cpp). The fact we end up with this non-zero cost is accidental.

AndyAyersMS commented 1 year ago

Sorry, I was trying to contrast against overlapping SSA lifetimes by saying things like "indirect cost" and "full SSA".

A concrete example would be a phi of an integer remainder (rdx) and a call return (rax). The phi is the clue that a register shuffle will be needed. Whether this is important enough to matter (it's not like we assign a cost to an integer remainder (rdx) feeding the first argument of a call (rcx)) is a different question. And the cost of the phi is in the predecessor anyway...

I guess it's fair to say that LSRA rediscovers this stuff later on in its own way... note it does a poor job in general handling cyclic constraints (not surprising given its "linear" world view....)