FractalFir / rustc_codegen_clr

This rust compiler backend(module) emmits valid CIL (.NET IR), enabling you to use Rust in .NET projects.
MIT License
1.56k stars 35 forks source link

Added BLt, BLt.Un, BGe, BGe.Un, BLe, BGe instructions and introduced some optimizations #47

Closed EinarSnorrason closed 5 months ago

EinarSnorrason commented 5 months ago

Implemented several branching instructions, added some optimizations to the BTrue and BFalse roots, and added some unit tests for these optimizations.

Unfortunately, when testing this I discovered that while the optimizations do work, they don't actually appear in the generated IL. This is because the IL is not doing this:

  ldloc.0
  ldloc.1
 clt
brtrue bb_4_0

Instead it's doing this:

  ldloc.0
  ldloc.1
 clt
stloc.2
.line 40:8 './lt.rs'
 ldloc.2
brtrue bb_4_0

So none of the optimizations I've written actually do anything at the moment 😅

Changing the optimization function to remove the unnecessary pushes and pops from the stack is a bit more invasive, so I might work on that in another PR.

Let me know if there's anything I should change or work on!

Resolves #44

Resolves #43

FractalFir commented 5 months ago

Nice work!

As for removing unnecessary stloc/ldloc pairs, while it is a bigger task, it can be split into smaller, easier to implement parts.

1. Removing dead sets

I would first start with removing "dead" stlocs. Basically, you would first need to check if a local variable is never read or its address is never taken, and then replace all stlocs to this variable with pops - don't remove that tree entirely, since the non-root parts still can have side effects.

You should do that in Method::opt, and use the cil iterators to make things easier.

After that, you can call Method::realloc_locals to remove all unused local variables.

This optimization will not do much on its own, since such dead stlocs are very rare.

The next step will have a much greater impact.

2. Replace ldlocs with the original tree, where possible

This optimization should be done on a per-block basis (at least for now), to reduce complexity.

Iterate through all CILTrees of a basic block. If the current tree has a root of CILRoot::STLoc, check if the nodes calculating the value of that local have no side effects (by calling CILNode::has_side_effects). If it has no side effects, check if it does not dereference any pointers. If all of those conditions are true, then it is a candidate for a move.

When we "move" a node, we simply replace any future occurrences of ldloc.n , so that this:

  ldloc.0
  ldloc.1
 add
stloc.2
   ldloc.2
   ldloc.2
  add
stloc.3
   ldloc.2
   ldloc.4
  mul
stloc.3

becomes:

   ldloc.0
   ldloc.1
 add
stloc.2
       ldloc.0
       ldloc.1
     add
        ldloc.0
        ldloc.1
     add
  add
stloc.3
       ldloc.0
       ldloc.1
     add
   ldloc.4
  mul
stloc.3

We can propagate this replacement until we find a CILTree that prevents it.

A tree prevents a move if it may change anything the nodes we are moving depend on.

So when we are moving those nodes:

   ldloc.0
   ldloc.1
 add

We can move across this tree:

  ldc.i4.0
stloc.2

But not this tree

  ldc.i4.0
stloc.1 // Writes to 1, and the nodes we are moving depend on `1`

or this tree:

  ldloc.3
  ldloc.4
stind.i8 // Writes to memory, so could write to local `0` or `1`, so a move is impossible

or this tree:

     ldloc.5
  call int32 SomeMethod(int32) // We don't know what it does, so it could change local `0` or `1`, so we can't move
pop

For now, we can assume such a replacement can only move across:

  1. A tree with root CILRoot::STLoc, that sets a local different from n, that is not referenced by the nodes we are moving, an whose value calculation has no side effects(CILNode::has_side_effects returns false).
  2. A tree with root CILRoot::Pop, whose value calculation has no side effects(CILNode::has_side_effects returns false).
  3. SourceFileInfo - this root exists only for debugging, and if other conditions are met, moving across it should have no observable effects.

This list can be expanded in the future, but this alone should allow for a lot of optimizations.

3. Remove pops, whose value has no side effects.

As a final clean-up step, iterate through all CILTrees, and if it is a CILRoot::Pop, whose value calculation has no side effetcs, remove that tree.

How all of that will work in practice.

When those optimizations will be applied to this CIL:

    ldloc.0
    ldloc.1
   clt
stloc.2
   ldloc.2
brtrue bb_4_0

First, the stloc.2 will be "moved":

    ldloc.0
    ldloc.1
  clt
stloc.2
    ldloc.0
    ldloc.1
  clt
brtrue bb_4_0

Then, the stloc.2 will become dead, and get replaced with a pop:

    ldloc.0
    ldloc.1
  clt
pop
    ldloc.0
    ldloc.1
  clt
brtrue bb_4_0

Then, the pop will be removed(since it has no side effects).

    ldloc.0
    ldloc.1
  clt
brtrue bb_4_0

Then, clt and brtrue will get simplifed:

  ldloc.0
  ldloc.1
blt bb_4_0

leaving us with an optimal sequence of CIL instructions.

EinarSnorrason commented 5 months ago

Thanks for the reply @FractalFir! It's a really cool project. I'll take a look at the stloc optimization when I have time. I have a couple of questions for it if you have the time 😄

  1. Can I assume that a local variable that is stored in one basic block won't be loaded in a subsequent basic block?
  2. For part 2, isn't there a danger that this might undo a CSE optimization and duplicate an expensive operation? Would it be safer to only perform the operation on stored locals that get loaded exactly once?
FractalFir commented 5 months ago

Sorry for taking a bit to respond, I was busy.

  1. No. Due to the way MIR works, you can't assume anything about the subsequent use of local variables. This is exactly why I suggest this optimization is split into parts, and the "moves" occur only within a block.

The phrase "dead stloc" was a bit misleading, sorry. For now, you can assume that a stloc is dead when and only when this local variable is never read/its address is not taken. So, you could do something like this:

// Find all locals, which are not read, and their addres is not taken.
let is_loc_dead = vec![true;method.locals.len()];
for node in method.iter_cil().nodes(){
   match node{
       // If we read or take the address of a local, it is not dead.
       CILNode::LdLoc(loc) | CILNode::LdLocA(loc)=>is_loc_dead[loc as usize] = false;
      _=>(),
   }
}
// For each local *n* which is "dead", replace `STLoc(n,val)` with `Pop(val)`
for (n,is_dead) in is_loc_dead.iter().zip(){
  if is_dead{
    // Replace all writes to this local with Pops, since the value calcualtion might have side-effects
    todo!()
  }
}
// Call `realloc_locals` to remove those local variables
method.realloc_locals();
  1. Yeah, there is a danger of that happening. We probably should have some limitation on this optimization. I am thinking of something like load_count * number_of_nodes < 10. That should keep the node count reasonable.

Since all calls are assumed to have side effects, there would danger of ever duplicating more than 10 nodes, which I feel is reasonable. Besides that, due to the way MIR works, most node trees are relatively flat.

FractalFir commented 5 months ago

Also, removing a stloc/ldloc pair can remove the "hidden" casts in the case of U8, U16, I8, I16. So, when removing a stloc ldloc pair that uses a local of that type, an explicit cast needs to be added.