Closed akkartik closed 3 years ago
push
isn't always pop
In Plan 0 I was implicitly assuming that if I pushed a variable to the stack (say when defining a new register), I had to pop the stack at the end of the variable's lifetime. Here's example 3 again:
# example 3
fn foo -> out/eax: int {
var x/eax: int <- 0
...
{
var x2/eax: int <- 0 # always push `x` here
...
break-if-= # pop `x` back here
out <- copy 0x34
} # don't pop `x` here, instead just drop it
... # A
}
This seems much nicer, and we no longer need to worry about liveness. Or don't we? We now no longer lose out
, but we still can't rely on x
being available by the time we get to A
.
And we still have to detect when an output isn't written in some path through a function. Even though there's a plan there, the more I think about it, the more the difficulty of control-flow analysis dominates my thinking.
return
statementsSo, back to the problem of conditionally writing outputs. After writing 10k+ lines, I'm starting to notice that I never actually want to write example 3. What I usually want is this:
# example 4
fn foo -> out/eax: int {
$foo:body: {
var x/eax: int <- 0
...
{
var x2/eax: int <- 0 # always push `x` here
...
break-if-=
out <- copy 0x34
break $foo:body
}
... # A: `x` is always available here
}
}
This is much more unambiguous. We always know exactly what variable should be in a given register. However, we still need control flow analysis to detect when someone forgets to add the break $foo:body
.
Given how common it is for me to add $___:body:
labels around functions, perhaps we should just write the previous example like this:
# example 5
fn foo -> out/eax: int {
var x/eax: int <- 0
...
{
var x2/eax: int <- 0
...
break-if-=
return 0x34
}
...
return ...
}
In this model we never write to out
directly. We always use a return
. The return
instruction isn't restricted in which register or memory location it can use. We just copy the arguments to the appropriate registers.
Now we no longer need to perform any control flow analysis because writes to outputs are always in the final block of a function, after which control exits the function. Now we only need to ensure that the function always ends in a return
instruction.
There are still an open question, though it seems far less serious: We no longer need to name the result. What should the syntax for function outputs look like? And we may want to change the syntax for all var
definitions to take the function output into account and be more consistent everywhere. Some options:
fn foo -> /eax: int # I don't like this
2a.
var x: int
var y: int@eax
var z: (addr int)@eax # I don't like this
fn foo -> int@eax
2b.
var x: int
var y: int.eax
var z: (addr int).eax # a little more natural looking, though '.' has a surprising meaning
fn foo -> int.eax
2c.
var x: int
var y: eax:int
var z: eax:(addr int) # I don't like this
fn foo -> eax:int
Some observations:
addr
s, handles
s, addr
s to array
s, etc.) look ok in combination with registers.fn foo -> _/eax: int
On balance, this seems the best option I can think of. In the short term, it's less work because I don't need to change the syntax for every var
everywhere. In the long term, it seems about as clear as all our code so far. And it's now easy to specify all possible combinations of name, type and register.
var x: int # simple type without a register
var y: (addr int) # compound type without a register
var z/eax: (addr int) # type in register
var _/eax: int # function output in a register, but without a name
var _/eax: (t x) # function output in a register with a (putative, but currently impossible) compound type
This seems like the winner at the moment:
# example 5
fn foo -> _/eax: int {
var x/eax: int <- 0
...
{
var x2/eax: int <- 0
...
break-if-=
return 0x34
}
...
return ...
}
It's unfortunate that I need a noisy character on every. single. function output. Hopefully we'll think of a better var
syntax one day. If so, we can switch to it down the road.
Task list:
return
instruction.return
instructions in all Mu code.vars
stack during parsing._
in all Mu code.@akkartik I'm only loosely following the thread here, but with Plan 2, what would the idiomatic way of returning an incrementally-computed result be?
As an example, say a strlen() for C strings. I would expect that to work by declaring a variable in a register and initializing it to zero, then incrementing it in a loop and finally returning it. It seems slightly awkward to have to match the register at the top of the function and the variable declaration, and return would have to handle the special case where the result is already in the right place rather than performing a mov.
It would be the same as before, except the output variable isn't declared in the header, you have to declare it in the body. To keep the types simple, here's adding numbers from 1 to n
:
fn sum-first n: int -> _/eax: int {
var result/ebx: int <- copy 0
var i/eax: int <- copy 1
{
compare i, n
break-if->
result <- add i
i <- increment
loop
}
return result
}
Since you're declaring it manually, you can declare it in any register. The return
copies it to the right place.
Hmm, one corner case here is returning multiple values from a function when the values are permuted across the output registers.
fn foo -> _/eax: int, _/ebx: int {
var x/eax: int <- copy 3
var y/ebx: int <- copy 4
return y, x # order is swapped
}
Naively, I would translate the return
to:
89/<- %eax 3/r32/ebx
89/<- %ebx 0/r32/eax
c3/return
Which acts like return y, y
.
Ugh :/
Background
Mu (this repo) contains a compiler built in machine code, which converts a memory-safe high-level language into machine code. Since it's built in machine code, the compiler is intended to be as simple as possible. In particular, it performs no optimizations.
Most Mu statements translate to a single instruction of machine code. One consequence of this 1:1 constraint is that the language exposes registers, and requires programmers to manage them manually. However, the compiler will check registers, and raise errors if you read a variable after a different variable has clobbered its register.
Current design
Function headers currently specify the names, types and registers for their outputs. For example:
There is no
return
instruction.Problems with the current design (Plan 0)
Failing to write outputs
We need some way to statically ensure that outputs are always written. Otherwise a caller may end up with access to random temporaries of any possible type, which violates the memory-safety guarantees I want:
Detecting errors like this will require generating the control-flow graph for functions, and checking every possible path through them. I've built such analysis passes before, but never in machine code.
Outputs getting over-written
There's an even bigger problem: tracking variable lifetimes. Mu is designed so variable lifetimes nest perfectly like Russian dolls:
It's been relatively easy to do this in machine code. I just maintain a stack of variables and blocks, and I clean up all variables for a block when I encounter its
}
.However, this nice property currently has a loophole due to function outputs. Consider this example:
Currently we save the state of registers when defining a new variable -- only if it wouldn't be written to by a function output later in the block. In example 3 above, we currently don't save
x
. However this strategy has two problems:x
even in some situations where we could preserve it.out
."Back in March, my best attempt at clean error messages was to speak of live ranges in terms of static ranges of lines, ignoring control flow. Once an output is defined, you can't use other existing variables in that register in lower lines. (You can still shadow the output in new variables in nested blocks.) However, I've lately found myself breaking this rule as I write code. Fixed ranges of lines are easy in themselves, but they're also a very different rule from regular variables with shadowing, so it's been easy to forget the rule for outputs. So I need to add the static check to remind myself when I violate the rule. And I've been loath to get into that just because tracking possible control-flow paths seems so hard in machine code. Perhaps it's time to go back to the drawing board on this plan.