Chapter 11 - GCM and Scheduling

Main source of information: Global code motion/global value numbering, Cliff Click. https://courses.cs.washington.edu/courses/cse501/06wi/reading/click-pldi95.pdf

Additional sources of information:

Dominator Tree:

Figure 9.24, chapter 9, p 532, of Engineering a Compiler.

Also described in the paper 'A Simple, Fast Dominance Algorithm' by Keith D. Cooper, Timothy J. Harvey and Ken Kennedy.

Loop discovery is described in:

Principles, Techniques and Tools, p604, 1986 ed

Discord comments:

15/May/24 Cliff: the anti-deps will be needed and are not in the paper. so paper will work until we do fancy control flow with loads & stores, then you'll get rare loads-bypassing-stores illegally. we can cross that bridge when we get there.

The dom depth in the GVNGCM paper:

11/May/24 what is meant by dom_depth in the paper? Suppose I have dominator tree - is the depth the same for all children of a node?

Yasser: i think Hayley and I ran into it being the opposite direction to what we figured but it's depth in the dom tree yea

Hayley: Should do, it's how many dominators a node has. Indeed the comparisons need to be the other way around versus Cliff's thesis.

Impacts Find_FCA() - reverse the dom depth comparison.

Issue: Xmilia reported Scheduling moves Stores incorrectly

struct S { int f; }
S s = new S;
if (arg == 0) {
    s.f = 1;
} else if (arg+1 == 0) {
    s.f = 1;
}
return s;

since both s.f = 1 are GVNed and then the least common block is outside of the if blocks

Cliff: 17/May/24

as usual, lots of ways to slice these games -

split the CSE'd uses so no extra computation.  C2 eventually did this, but did not start this way
add Control to stores; C2 does this, so didn't CSE stores.

In general, think of these algos's and the SoN edges as tools to work with, not Golden Rules to follow. If some setup of your edges or graph walking isn't suiting you, change it. C2 does, I've left AA undecided for now. Simple needs to pick soon.

25/May/24 Cliff: pass#1, DFS post-order on inputs (a "upwards" walk, starting from Stop), sched early, set slot 0 of every data op to its highest input using dom-depth (computed in same pass). Unreached If T/F projections indicate a never-reached infinite loop - should be handled before. You can add anti-deps here as well, adding dependence-inputs to a Store before scheduling it high.

pass#2, sched late, DFS post-order on outputs (a "downwards" walk, starting from Start). In the idom-path from down-most and up (still in slot 0), pick lowest frequency, estimated here by loop depth.

Here I don't need a side-array for the global part. I do find it useful to have an array for the local part - the basic block - which you'll probably want to hand off to e.g. register allocation. If you're just going to an infinite register machine (say, emitting C or Java not asm) you might could skip that as well.

28/May/24

RIght - 3 possible fixes -

Add control.  Adds an edge, Stores won't CSE.
Make Stores just NOT CSE
After CSE, re-split stores if they not fully used on all paths.

As soon as the Store does not CSE, it sinks in GCM to its lowest use, which is always where it started from

Handling anti-deps

From discord chat: Darrick 28/May/24

Ok, there are a few pieces:

In the Graph class:

    public Graph build_ante_deps() {
        for( Node n : _nodes() )  n.build_ante_deps();
        return this;
    }

in the Node class:

    public void build_ante_deps() {
        for( Node n : ante_users() )  createName(_g.ad(), n);
    }

    // By default a node has no ante-users.
    public NodeUtil.Iter ante_users() {
        return NodeUtil.Iter.get(null,0);
    }

in nodes that are part of a chain, mem gets the previous node in the chain:

    public NodeUtil.Iter ante_users() {
        return mem().fast_users().filtered(n -> !(n instanceof CramStoreNode));
    }

That's it!

My node structure has the idea that I can have named deps. I add a new named dep using createName, and the "name" is an AD object that I get by calling _g.ad():

public class AD {
    public int _i;
    public AD(int i) { _i = i; }
    @Override public final int hashCode() { return (int)Util.mix_hash(_i); }
    @Override public final boolean equals(Object x) { return x==this; }
    public String toString() { return "AD[" + _i + "]"; }
}

More from Cliff: 01/June/24

I looked at C2's anti-deps. I understand what its doing (I did write the code, but 25yrs ago). Its fast & minimal, but complex. We can lose at least 50% of whats going on, since Simple is simpler, but there's still plenty in there. Here's a "short" description of the algorithm. Run schedule early. Run schedule late- add anti-deps DURING running schedule late.

Look backwards from the Load (follow the mem edge) to the "memory definer" - a start-mem-projection, or a phi-mem, or a store. Look forwards from the mem-def to all its mem-def users. There's always at least 1; there may be many. These completely (all of it) and exactly (no doubling up) cover the forward memory space, sorta like a network flow problem. (be sure to limit yourself to same-alias, because e.g. start-mem-proj produces all aliases). Also, we can limit loads against final fields (we have none at the moment). So now you got a candidate set of mem-defs; some of these need the anti-dep on the Load, but not all.

Since we're in the middle of "schedule late", we've already computed all the late schedules of a Load's users, and we have the Loads "Least Common Ancestor" of uses, the LCA or late position. We inspect the set of mem-defs that might impact the Load, and either add a anti-dep from Store to Load, or raise the Loads effective LCA. Going back one mem step from the Load, we have the Loads' mem-def. Going forwards 1 set fro the mem-def, we have a set of candidate mem-defs that might impact the Load. For Phi mem-defs, we look at the Phi inputs and place an effective use on that block; this will be used to raise the Load's LCA. For stores, we do the same - until/unless we find stores with the SAME block as the Load's LCA. Then we add the anti-dep edge to force ordering within the same block.

Anyways, I started at C2's anti-deps. As said its fast & accurate & a little subtle. It relies on the early-schedule, and the late-schedule of the Load's uses (before scheduling the Load). It needs a Basic BLock structure; this can be represented in any number of ways. It needs a side-field in the BB (again any number of ways, array-of-structs, struct-of-arrays, etc whatever)

Original

struct Bar { int b; }
Bar bar = new Bar;
bar.b = arg;
int x = 3;
while( hot_loop ) {
  if( common ) {
    b.b = 1;
  } else { // Very rare
    int tmp = b.b;
    x = x + tmp;
    b.b = 2;
  } 
}
return x;

Here I expect @Darrick Wiebe and @Hayley Patton 's algo to add anti deps between BOTH Stores to the Load. This will result in an effective schedule:

struct Bar { int b; }
Bar bar = new Bar;
bar.b = arg;
int x = 3;
while( hot_loop ) {
  int tmp = b.b;  // Load hoisted to hot common path
  if( common ) {
    b.b = 1;
  } else { // Very rare
    x = x + tmp;
    b.b = 2;
  } 
}
return x;

Correct, but definitely sub-optimal.

From Discord 20th June

Cliff:

Here's rough draft of algo:

Place Store first
Walk the idom path for where the Load could go (gonna get walked anyways looking for lowest frequency placement), and tag the path somehow.
Walk the Store's idom path from placement to shared Load->mem input.
- If you touch a tagged block, you need an anti-dep.

SeaOfNodes / Simple

Chapter 11 - GCM and Scheduling #80