Closed akkartik closed 4 years ago
Ok, I have a broader proposal: create syntax not for function calls but for stack management.
Stack management is a crucial part of the book-keeping involved in Assembly programming, and it would be great if explicit push
instructions become code smells if not utterly disallowed.
Currently we use push
for three kinds of things:
a) Defining local variables (which we then must remember to clean up before c3/return
, because otherwise we lose our return address.
b) Calling functions with arguments (which we must then remember to clean up after the callee returns)
c) Spilling registers to be reused later.
Here's an example syntax to support all 3: create a rudimentary stack-based language for lines beginning with some special token, say {
. Such lines can have two kinds of expressions:
Later a line with a }
would restore the stack to the same level as before the corresponding {
.
For example:
{ 0 0 ->%ecx
...
}
This is equivalent to:
68/push 0/imm32
68/push 0/imm32
89/copy %ecx 4/ESP/r32
...
81 0/subop/add %esp, 8
Which is basically what you need to define a local variable (say a slice).
A function call:
{ %ecx "foo" %edx
e8/call foo/disp32
}
which is equivalent to the pseudocode:
push %ecx
push "foo"/imm32
push %edx
e8/call foo/disp32
81 0/subop/add %esp, 8
Hmm, this syntax is interesting, but it makes the original tailor-exit-descriptor
scenario pretty terse and awfully hard to spot. For example, assuming the address to the exit descriptor is currently in ECX:
# call f(x, y, ed, z) that may call stop(ed) at some point
{ z ed y x ->*ed # last word tailors 'ed'
e8/call f/disp32
}
The only difference between a local variable and tailoring is that %
turns into *
.
create a rudimentary stack-based language for lines beginning with some special token, say
{
I really like this idea. Keeping the stack balanced is error prone, and this gives us more control over function calls than the normal function call syntax (i.e. whether we use disp32/disp8).
In terms of syntax, I think being able to reference the stack frame/scope by name would be handy:
{|stack1| 1 2 3 {|stack2| 3 *(stack1+4) 5 call blah}}
{|stack1| 1 (stack1+12) call some-func-that-uses-ed}
{|stack1| 1 (stack1 + stack1.retaddr) call some-function-that-uses-ed} # if we calculate the return address for every stack frame, just pass the return addr
I think the stack location can then be computed from the lexical location in the file (i.e. we know how many words are on the stack, so we can determine statically what the offset is to the address).
That is interesting, but where would these stack1 variables be stored? This may be harder than it seems at first glance.
I like how you've put the entire {...}
on a single line. If it's not too hard I'd like to provide that single-line alternative. But we still need to support multiple lines between the {...}
.
I was imagining something like this:
{|stack1| 1 2 3 {|a-call| %ebx *(stack1-4) e8/call somefunc}
=>
# Stack #
| 0x01 | <- stack1 is a pointer to this
| 0x02 |
| 0x03 |
| %ebx val | <- a-call points here
| *((ebp+32)-4) | # we know this is ebp-32 since there are only 4 words on the stack at this point in the program
Essentially it's just a label for the stack, which I think works(?).
Yeah, mostly makes sense. The question in my mind is: where is stack1
allocated? Is it a purely translation-time variable? Your first example also seemed to use it in complex ways like stack1 + stack1.retaddr
.
Is it a purely translation-time variable?
Yeah. That's what I was thinking.
stack1.size
or stack1.len
is probably a better name. It just becomes the length of the stack (which we can be evaluated at compile time). This of course doesn't work with vararg functions and dynamic stack manipulation, but I feel like those are less common cases.
If a dynamic stack is required, I think something like this would work:
{|stack1| %esp 1 2 3 *dynamic pushes* ...
# length is required now
%esp - (*stack1) # calculate dynamically by subtracting the old esp with the current esp
}```
Ok, I see.
It looks like your examples have expressions on multiple lines? Maybe this is a completely new language rather than just sugar?
I'm starting to grow less excited about this whole thread. For multiple reasons:
a) The new stack syntax adds a new gotcha to compensate for the gotcha it protects us from. You have to make sure you never exit except through the }
. Otherwise the stack gets mismatched.
b) It seems to increase the reader's burden to have an additional 'language' that code in the repo may be written in. The alternative would be to treat the new syntax sugar as part of core SubX, support it in the C++ version, rewrite all our SubX code to use it, and treat any new phases as part of the core. That seems like a lot of work for unclear benefit, since the amount of progress we've made is a sort of existence proof that maybe SubX without the extra sugar isn't so bad after all.
c) Rather than attack gotchas one by one, we should just start on a new language. A memory-safe statement-oriented language implemented in SubX where each statement maps to a single x86 instruction.
In other words, I'd rather this be the next syntax:
var x : slice
...
than this:
{ 0 0 ->%ecx
...
}
Yeah, I was looking at the syntax I came up with the other day and I realized that it attempts to solve two problems: memory labeling (poorly) and stack balancing.
With the new approach are you thinking it’ll do stack balancing automatically since it should be memory safe?
You know, that's a good question. It's my top priority, and I think I'll have to violate my "1 instruction per line" design constraint to achieve it.
But yes, that's the plan.
Today, though, I'm enamored with the idea of a tiny Lisp interpreter. It's not going to be the final goal, but it would just be so cool to be able to type commands at init
. And should be fairly quick.. We're due for some fun.
Any fun little projects you want to try?
Today, though, I'm enamored with the idea of a tiny Lisp interpreter. It's not going to be the final goal, but it would just be so cool to be able to type commands at init. And should be fairly quick.. We're due for some fun.
That definitely could be fun. I've not implemented lisp in asm before (though I guess it has been done before which might be a handy reference).
I've been interested in Forth recently (particularly because of how simple it is to implement in asm, and because it allows mixing asm code with interpreted code). I started implementing a Forth in nasm a few months back (https://git.sr.ht/~nch/onward/tree/master/onward.s), but never quite finished it. Now might be the time for me to port it to subx and finish it off :)
Excellent idea!
SubX currently allows one to test the
exit()
syscall. It does so using a dependency-injected wrapper calledstop
that takes an exit-descriptor as an argument. If the exit-descriptor is null the program reallyexit
s. If it is created usingtailor-exit-descriptor
stop
unwinds the stack until the frame that calledtailor-exit-descriptor
.But
tailor-exit-descriptor
is klunky. The way it currently works is, you pass in the number of args of the function that's going to get passed in the exit-descriptor, and it computes where on the stack the return address for the current stack frame is going to be, saving it to the exit-descriptor. That way allstop
has to do is set ESP to the return address in the exit-descriptor and then callc3/return
.If we moved to a more HLL syntax where function calls were all in a single line, we'd like to make
tailor-exit-descriptor
cleaner or maybe do away with it altogether.One easier way to explain
tailor-exit-descriptor
is that it is equivalent to a special function call. Now for background, regular calls in SubX look like this:ESP
)The special call to a function that may want to call
stop
looks like this:Only the second step is new. But we now get to replace all of
tailor-exit-descriptor
with a single instruction (assuming the address of the exit-descriptor is always in a register).Now my question becomes: what does a syntax for this special call look like?
If regular calls look like
f(arg1, arg2, ... argn)
, then some possibilities:1.
2.
I don't like either, but they do have some benefits. The first allows for a single exit-descriptor to be reused across function calls in the same stack frame (totally safe). The second is all on one line to indicate that conceptually it's all a single call.
On the other hand, the second now looks like two operations on a single line, which is confusing and potentially sets a bad precedent. So I wonder what sort of approach we may take that makes it look less like two calls in a single line.
That's a lot of grammar for just one construct.
4.
We'd need to somehow figure out where
ed
is.5.
6.
Anyways, that's my brain dump.
cc @charles-l