Closed dibyendumajumdar closed 4 months ago
Handling anti-deps
From discord chat: Darrick 28/May/24
Ok, there are a few pieces:
In the Graph class:
public Graph build_ante_deps() {
for( Node n : _nodes() ) n.build_ante_deps();
return this;
}
in the Node class:
public void build_ante_deps() {
for( Node n : ante_users() ) createName(_g.ad(), n);
}
// By default a node has no ante-users.
public NodeUtil.Iter ante_users() {
return NodeUtil.Iter.get(null,0);
}
in nodes that are part of a chain, mem gets the previous node in the chain:
public NodeUtil.Iter ante_users() {
return mem().fast_users().filtered(n -> !(n instanceof CramStoreNode));
}
That's it!
My node structure has the idea that I can have named deps. I add a new named dep using createName, and the "name" is an AD object that I get by calling _g.ad():
public class AD {
public int _i;
public AD(int i) { _i = i; }
@Override public final int hashCode() { return (int)Util.mix_hash(_i); }
@Override public final boolean equals(Object x) { return x==this; }
public String toString() { return "AD[" + _i + "]"; }
}
More from Cliff: 01/June/24
I looked at C2's anti-deps. I understand what its doing (I did write the code, but 25yrs ago). Its fast & minimal, but complex. We can lose at least 50% of whats going on, since Simple is simpler, but there's still plenty in there. Here's a "short" description of the algorithm. Run schedule early. Run schedule late- add anti-deps DURING running schedule late.
Look backwards from the Load (follow the mem edge) to the "memory definer" - a start-mem-projection, or a phi-mem, or a store. Look forwards from the mem-def to all its mem-def users. There's always at least 1; there may be many. These completely (all of it) and exactly (no doubling up) cover the forward memory space, sorta like a network flow problem. (be sure to limit yourself to same-alias, because e.g. start-mem-proj produces all aliases). Also, we can limit loads against final fields (we have none at the moment). So now you got a candidate set of mem-defs; some of these need the anti-dep on the Load, but not all.
Since we're in the middle of "schedule late", we've already computed all the late schedules of a Load's users, and we have the Loads "Least Common Ancestor" of uses, the LCA or late position. We inspect the set of mem-defs that might impact the Load, and either add a anti-dep from Store to Load, or raise the Loads effective LCA. Going back one mem step from the Load, we have the Loads' mem-def. Going forwards 1 set fro the mem-def, we have a set of candidate mem-defs that might impact the Load. For Phi mem-defs, we look at the Phi inputs and place an effective use on that block; this will be used to raise the Load's LCA. For stores, we do the same - until/unless we find stores with the SAME block as the Load's LCA. Then we add the anti-dep edge to force ordering within the same block.
Anyways, I started at C2's anti-deps. As said its fast & accurate & a little subtle. It relies on the early-schedule, and the late-schedule of the Load's uses (before scheduling the Load). It needs a Basic BLock structure; this can be represented in any number of ways. It needs a side-field in the BB (again any number of ways, array-of-structs, struct-of-arrays, etc whatever)
Original
struct Bar { int b; }
Bar bar = new Bar;
bar.b = arg;
int x = 3;
while( hot_loop ) {
if( common ) {
b.b = 1;
} else { // Very rare
int tmp = b.b;
x = x + tmp;
b.b = 2;
}
}
return x;
Here I expect @Darrick Wiebe and @Hayley Patton 's algo to add anti deps between BOTH Stores to the Load. This will result in an effective schedule:
struct Bar { int b; }
Bar bar = new Bar;
bar.b = arg;
int x = 3;
while( hot_loop ) {
int tmp = b.b; // Load hoisted to hot common path
if( common ) {
b.b = 1;
} else { // Very rare
x = x + tmp;
b.b = 2;
}
}
return x;
Correct, but definitely sub-optimal.
From Discord 20th June
Cliff:
Here's rough draft of algo:
Main source of information: Global code motion/global value numbering, Cliff Click. https://courses.cs.washington.edu/courses/cse501/06wi/reading/click-pldi95.pdf
Additional sources of information:
Dominator Tree:
Figure 9.24, chapter 9, p 532, of Engineering a Compiler.
Also described in the paper 'A Simple, Fast Dominance Algorithm' by Keith D. Cooper, Timothy J. Harvey and Ken Kennedy.
Loop discovery is described in:
Principles, Techniques and Tools, p604, 1986 ed
Discord comments:
15/May/24 Cliff: the anti-deps will be needed and are not in the paper. so paper will work until we do fancy control flow with loads & stores, then you'll get rare loads-bypassing-stores illegally. we can cross that bridge when we get there.
The dom depth in the GVNGCM paper:
11/May/24 what is meant by dom_depth in the paper? Suppose I have dominator tree - is the depth the same for all children of a node?
Yasser: i think Hayley and I ran into it being the opposite direction to what we figured but it's depth in the dom tree yea
Hayley: Should do, it's how many dominators a node has. Indeed the comparisons need to be the other way around versus Cliff's thesis.
Impacts Find_FCA() - reverse the dom depth comparison.
Issue: Xmilia reported Scheduling moves Stores incorrectly
since both s.f = 1 are GVNed and then the least common block is outside of the if blocks
Cliff: 17/May/24
as usual, lots of ways to slice these games -
In general, think of these algos's and the SoN edges as tools to work with, not Golden Rules to follow. If some setup of your edges or graph walking isn't suiting you, change it. C2 does, I've left AA undecided for now. Simple needs to pick soon.
25/May/24 Cliff: pass#1, DFS post-order on inputs (a "upwards" walk, starting from Stop), sched early, set slot 0 of every data op to its highest input using dom-depth (computed in same pass). Unreached If T/F projections indicate a never-reached infinite loop - should be handled before. You can add anti-deps here as well, adding dependence-inputs to a Store before scheduling it high.
pass#2, sched late, DFS post-order on outputs (a "downwards" walk, starting from Start). In the idom-path from down-most and up (still in slot 0), pick lowest frequency, estimated here by loop depth.
Here I don't need a side-array for the global part. I do find it useful to have an array for the local part - the basic block - which you'll probably want to hand off to e.g. register allocation. If you're just going to an infinite register machine (say, emitting C or Java not asm) you might could skip that as well.
28/May/24
RIght - 3 possible fixes -
As soon as the Store does not CSE, it sinks in GCM to its lowest use, which is always where it started from