akkartik / mu

Soul of a tiny new machine. More thorough tests → More comprehensible and rewrite-friendly software → More resilient society.
http://akkartik.name/akkartik-convivial-20200607.pdf
Other
1.35k stars 47 forks source link

Static checks for function outputs #45

Closed akkartik closed 3 years ago

akkartik commented 3 years ago

Background

Mu (this repo) contains a compiler built in machine code, which converts a memory-safe high-level language into machine code. Since it's built in machine code, the compiler is intended to be as simple as possible. In particular, it performs no optimizations.

Most Mu statements translate to a single instruction of machine code. One consequence of this 1:1 constraint is that the language exposes registers, and requires programmers to manage them manually. However, the compiler will check registers, and raise errors if you read a variable after a different variable has clobbered its register.

Current design

Function headers currently specify the names, types and registers for their outputs. For example:

# example 1
fn foo -> result/eax: int {
  result <- copy 0x34
}

There is no return instruction.

Problems with the current design (Plan 0)

Failing to write outputs

We need some way to statically ensure that outputs are always written. Otherwise a caller may end up with access to random temporaries of any possible type, which violates the memory-safety guarantees I want:

# example 2
fn foo -> result/eax: int {
  {
    break-if-=
    result <- copy 0x34
  }
}

Detecting errors like this will require generating the control-flow graph for functions, and checking every possible path through them. I've built such analysis passes before, but never in machine code.

Outputs getting over-written

There's an even bigger problem: tracking variable lifetimes. Mu is designed so variable lifetimes nest perfectly like Russian dolls:

fn foo {
  var x/eax: int <- copy 0
  ...
  {
    var x2/eax: int <- copy 1
    ...
    {
      var x3/eax: int <- copy 2
      ...
    }  # x3 gets reclaimed here
    ...  # eax guaranteed to contain x2 here
    {
      var x4/eax: int <- copy 2
      ...
    }  # x4 gets reclaimed here
    ...  # eax guaranteed to contain x2 here
  }  # x2 gets reclaimed here
  ...  # eax guaranteed to contain x here
}

It's been relatively easy to do this in machine code. I just maintain a stack of variables and blocks, and I clean up all variables for a block when I encounter its }.

However, this nice property currently has a loophole due to function outputs. Consider this example:

# example 3
fn foo -> out/eax: int {
  var x/eax: int <- 0
  ...
  {
    var x2/eax: int <- 0  # should we push `x` to the stack here?
    ...
    break-if-=
    out <- copy 0x34
  }
  # Should `x` be accessible here?
}

Currently we save the state of registers when defining a new variable -- only if it wouldn't be written to by a function output later in the block. In example 3 above, we currently don't save x. However this strategy has two problems:

  1. Blocks can conditionally write function outputs. In this situation we lose access to x even in some situations where we could preserve it.
  2. Errors become hard to explain. "x isn't available here, because it may have been overwritten by out."

Back in March, my best attempt at clean error messages was to speak of live ranges in terms of static ranges of lines, ignoring control flow. Once an output is defined, you can't use other existing variables in that register in lower lines. (You can still shadow the output in new variables in nested blocks.) However, I've lately found myself breaking this rule as I write code. Fixed ranges of lines are easy in themselves, but they're also a very different rule from regular variables with shadowing, so it's been easy to forget the rule for outputs. So I need to add the static check to remind myself when I violate the rule. And I've been loath to get into that just because tracking possible control-flow paths seems so hard in machine code. Perhaps it's time to go back to the drawing board on this plan.

akkartik commented 3 years ago

Plan 1: the opposite of push isn't always pop

In Plan 0 I was implicitly assuming that if I pushed a variable to the stack (say when defining a new register), I had to pop the stack at the end of the variable's lifetime. Here's example 3 again:

# example 3
fn foo -> out/eax: int {
  var x/eax: int <- 0
  ...
  {
    var x2/eax: int <- 0  # always push `x` here
    ...
    break-if-=  # pop `x` back here
    out <- copy 0x34
  }  # don't pop `x` here, instead just drop it
  ... # A
}

This seems much nicer, and we no longer need to worry about liveness. Or don't we? We now no longer lose out, but we still can't rely on x being available by the time we get to A.

And we still have to detect when an output isn't written in some path through a function. Even though there's a plan there, the more I think about it, the more the difficulty of control-flow analysis dominates my thinking.

akkartik commented 3 years ago

Plan 2: drop named outputs, use return statements

So, back to the problem of conditionally writing outputs. After writing 10k+ lines, I'm starting to notice that I never actually want to write example 3. What I usually want is this:

# example 4
fn foo -> out/eax: int {
$foo:body: {
  var x/eax: int <- 0
  ...
  {
    var x2/eax: int <- 0  # always push `x` here
    ...
    break-if-=
    out <- copy 0x34
    break $foo:body
  }
  ... # A: `x` is always available here
}
}

This is much more unambiguous. We always know exactly what variable should be in a given register. However, we still need control flow analysis to detect when someone forgets to add the break $foo:body.

Given how common it is for me to add $___:body: labels around functions, perhaps we should just write the previous example like this:

# example 5
fn foo -> out/eax: int {
  var x/eax: int <- 0
  ...
  {
    var x2/eax: int <- 0
    ...
    break-if-=
    return 0x34
  }
  ...
  return ...
}

In this model we never write to out directly. We always use a return. The return instruction isn't restricted in which register or memory location it can use. We just copy the arguments to the appropriate registers.

Now we no longer need to perform any control flow analysis because writes to outputs are always in the final block of a function, after which control exits the function. Now we only need to ensure that the function always ends in a return instruction.

There are still an open question, though it seems far less serious: We no longer need to name the result. What should the syntax for function outputs look like? And we may want to change the syntax for all var definitions to take the function output into account and be more consistent everywhere. Some options:

  1. Just remove the name:
fn foo -> /eax: int  # I don't like this
  1. Switch the syntax of variables to something else that looks a little bit cleaner when the name is dropped. Some examples:

2a.

var x: int
var y: int@eax
var z: (addr int)@eax  # I don't like this
fn foo -> int@eax

2b.

var x: int
var y: int.eax
var z: (addr int).eax  # a little more natural looking, though '.' has a surprising meaning
fn foo -> int.eax

2c.

var x: int
var y: eax:int
var z: eax:(addr int)  # I don't like this
fn foo -> eax:int

Some observations:

  1. Use a wildcard:
fn foo -> _/eax: int

On balance, this seems the best option I can think of. In the short term, it's less work because I don't need to change the syntax for every var everywhere. In the long term, it seems about as clear as all our code so far. And it's now easy to specify all possible combinations of name, type and register.

var x: int  # simple type without a register
var y: (addr int)  # compound type without a register
var z/eax: (addr int)  # type in register
var _/eax: int  # function output in a register, but without a name
var _/eax: (t x)  # function output in a register with a (putative, but currently impossible) compound type

Summary

This seems like the winner at the moment:

# example 5
fn foo -> _/eax: int {
  var x/eax: int <- 0
  ...
  {
    var x2/eax: int <- 0
    ...
    break-if-=
    return 0x34
  }
  ...
  return ...
}

It's unfortunate that I need a noisy character on every. single. function output. Hopefully we'll think of a better var syntax one day. If so, we can switch to it down the road.

Task list:

  1. Implement a return instruction.
  2. Replace references to function outputs with return instructions in all Mu code.
  3. Stop pushing function outputs in the vars stack during parsing.
  4. Implement the wildcard "_" function output name.
  5. Rename function outputs to _ in all Mu code.
s-ol commented 3 years ago

@akkartik I'm only loosely following the thread here, but with Plan 2, what would the idiomatic way of returning an incrementally-computed result be?

As an example, say a strlen() for C strings. I would expect that to work by declaring a variable in a register and initializing it to zero, then incrementing it in a loop and finally returning it. It seems slightly awkward to have to match the register at the top of the function and the variable declaration, and return would have to handle the special case where the result is already in the right place rather than performing a mov.

akkartik commented 3 years ago

It would be the same as before, except the output variable isn't declared in the header, you have to declare it in the body. To keep the types simple, here's adding numbers from 1 to n:

fn sum-first n: int -> _/eax: int {
  var result/ebx: int <- copy 0
  var i/eax: int <- copy 1
  {
    compare i, n
    break-if->
    result <- add i
    i <- increment
    loop
  }
  return result
}

Since you're declaring it manually, you can declare it in any register. The return copies it to the right place.

akkartik commented 3 years ago

Hmm, one corner case here is returning multiple values from a function when the values are permuted across the output registers.

fn foo -> _/eax: int, _/ebx: int {
  var x/eax: int <- copy 3
  var y/ebx: int <- copy 4
  return y, x  # order is swapped
}

Naively, I would translate the return to:

89/<- %eax 3/r32/ebx
89/<- %ebx 0/r32/eax
c3/return

Which acts like return y, y.

Ugh :/