faster-cpython / ideas

1.67k stars 49 forks source link

0,1, 2, or 3 address form? #489

Closed markshannon closed 1 year ago

markshannon commented 1 year ago

We currently have a 0-address (stack machine) instruction format. All inputs are taken from the stack and all outputs are pushed to the stack. We are investigating a 3-address (register machine) instruction format.

But what about one or two address formats?

One address format

A.K.A. register-accumulator. One input and the output is the accumulator. The other input is explicit.

Two address format

x86 machine instructions are typically two-address instructions. The output goes to the same place as one of the inputs.

I don't think one-address format will be useful, but the two address form is interesting, as it consumes one of its argument references, but not both which may have advantages in terms of object re-use.

Minimizing reference count operations and instruction counts.

Which format minimizes reference count operations and instruction counts?

I think an AST level analysis should be sufficient to answer this.

Examples

f(a+b+c, d)

f is global, the others are local:

0-address
LOAD_GLOBAL f
LOAD_FAST a
LOAD_FAST b
BINARY +
LOAD_FAST c
BINARY +
LOAD_FAST d
CALL 2

2-address

LOAD_GLOBAL f -> r1
LOAD_FAST a -> r2
BINARY + r2, b
BINARY + r2, c
LOAD_FAST d -> r3
CALL 2, r1

3-address

LOAD_GLOBAL f -> r1
BINARY + a, b -> r2
BINARY + r2, c -> r2
LOAD_FAST d -> r3
CALL 2, r1, r2

The two address form looks promising as is it more compact. Don't know if it would work as well in general.

@iritkatriel what do you think?

iritkatriel commented 1 year ago

What would be the representation of instructions with 2 args?

mdboom commented 1 year ago

As a way of answering this, could we look at the code generated by the RegCPython fork and analyse the number of places where its 3 arg instructions could be reduced to 2 arg form? I guess if they aren't already re-using a register that would require some sort of register usage tracking...

iritkatriel commented 1 year ago

Not sure - their compiler wasn't restricted to 2 args, so it might use 3 when it doesn't need to.

I don't know whether we really need to decide this now. We could do something flexible (like I did in my branch): leave the instruction layout as it is, add opcodes to set OPARG1, OPARG2, OPARG3 to the register they need to be (merge them into superinstructions so there is only one dispatch no matter how many opargs there are). We can get pretty far with that, and then experiment.

arhadthedev commented 1 year ago

3-address

jneb commented 1 year ago

What would be the representation of instructions with 2 args?

I just realized that the recipe I wrote here works with 2 address instructions as well, since binary operations write their result in the same location as the first argument. And if you look at the top of stack as the accumulator, it could do 1 address, but that's slightly more complicated.

o11c commented 1 year ago

In my experience, 1-address is the optimal tradeoff for infinite-register machines, since the existence of the accumulator (which almost never needs to be spilled in practice) means you can use a fixed 16 bits almost all of the time.

2-address works fine for actual hardware since the number of registers is very small, so you can encode 2 arguments within a word. But bytecode VMs do not want to limit themselves to such a small number of reigsters (it's possible, but it necessarily means the bytecode is only efficient on hardware with similar assumptions), and multiple EXTENDED_ARG is quite suboptimal.

My suggested example bytecode:

def foo(a, b, d):
    return bar(a+b+42, d)

foo.__code__ == Code(
    local_names = ('a', 'b', 'd') # mostly for debugging
    global_names = ('bar',) # if possible, share `global_names` across the module.
    int_consts = (42,)
    str_consts = ()
    other_consts = ()
    callinfo = ((bare_function, 2 positional, no keywords),)
    code = '''
        # Incoming arguments come first in the stack map
        #   r0 = a
        #   r1 = b
        #   r2 = d
        # Then locals and temporaries (none in this example)
        # and finally outgoing arguments:
        #   r3 = function (there are other ways this might be handled)
        #   r4 = outgoing argument 1
        #   r5 = outgoing argument 2

        LOAD_GLOBAL $0 # bar
        STORE r3 # stash for ordering reasons

        LOAD r0 # a
        ADD r1 # b;  accumulator = a + b
        ADD_INT_CONST $0 # 42;  accumulator = a + b + 42
        STORE r4

        LOAD r2 # d
        STORE r5

        CALL $0 # per callinfo: bare function in r3, positionals in r4 and r5

        RETURN # rare opcode that doesn't need an argument since the value we want is already in the accumulator
    '''
)

The main point of flexibility is exactly how the CALL opcode(s) works with the callinfo table.

jneb commented 1 year ago

In my experience, 1-address is the optimal tradeoff for infinite-register machines, since the existence of the accumulator (which almost never needs to be spilled in practice) means you can use a fixed 16 bits almost all of the time.

But of course the important question here is: which is faster in an interpreter? 1 address because of decoding simplicity, or 2 address because it uses fewer instructions? To be honest, I have a hard time believing the 3 address solution is going to be the fastest option.

markshannon commented 1 year ago

What would be the representation of instructions with 2 args?

32 bits most likely. No bigger than the 3 address form, and no smaller, but with fewer EXTENDED_ARGS and room for some extra data or the occasional 3-address instruction (like CALL?)

o11c commented 1 year ago

My point is: the architectural addition of the accumulator means that 1-address usually does not have many, if any, more instructions than 2-address. And always fewer bytes.

Adding dedicated 4-byte (or another table?) MOVE instructions (sigh, I do consider them a wart, but I suppose we don't have PHI in this form) for (some of) the "load then immediately store" cases (c(i,s,o)/g/l -> g/l(,a?)) should be enough to make it unambiguously smaller. (merged instructions might not support EXTENDED_ARG for both operands, but you can just use the non-merged instructions if you have to). I suppose CALL could also be a 4-byte; be sure to make jumping into the middle of such an instruction obviously invalid though. We have learned that macro-op fusion is a good thing (I guess I was just proud of cramming everything into 16 bits when I did this).

We're agreed about 3-argument being bad though.


Since the "change index meanings per BB" discussion raced me, I would note that actually changing for every BB is certainly suboptimal (as is having a new output register for every instruction - there are a lot of cases where the same local is used or nothing is stored), but if you instead change only when you hit the 256 limit, that means you can avoid EXTENDED_ARG entirely if you want (at the cost of a little complexity elsewhere - not sure it's really worth it even with the nice benefits).

The most important thing is eliminating the (in-function) stack, though. Temporaries, if they can't stay in the accumulator, are just registers with no (meaningful) name. a*b + c*d fundamentally requires 1 such anonymous local, for example.

(note that between functions, the "stack" does exist; outgoing arguments deliberately overlap with the next frame's incoming arguments (though they might instead need to be copied if we've run out of stack space in the current allocation - we do not want to copy the whole stack); they (whether used or not), along with the accumulator (... actually, I guess for Python there is always a return value to put in it), should always be considered clobbered across calls (but not all jumps, to avoid pessimizing and, or, and maybe ternary bytecode))

markshannon commented 1 year ago

@o11c You say "In my experience", care to elaborate?

You state that "1-address usually does not have many, if any, more instructions than 2-address". Why not?

"outgoing arguments deliberately overlap with the next frame's incoming arguments"

Where do the globals, builtins, etc. go?

We're agreed about 3-argument being bad though.

Published work suggests otherwise https://github.com/faster-cpython/ideas/issues/485

jneb commented 1 year ago

There are a few reasons why an n+1 address architecture may be better than an n address architecture in the case of a reference counting interpreter:

While playing around with manually compiling a bit of code, I saw the incref/decrets decrease with the number of instruction addresses. The step from 2 to 3 addresses was bigger than I expected.

The biggest surprise was that the calculating bytecodes (so: not jumps, etc) all take on average one INCREF and one DECREF per bytecode, regardless of architecture. So, shorter is better.

So I would not immediately reject a 3 register architecture, even with more parameter decoding and longer instructions.

Edit: I read the Shi, Casey, Ertl, Gregg paper. It is quite convincing, and it is obvious that a three address register machine (with decent move elimination) is the way forward.

iritkatriel commented 1 year ago

What would be the representation of instructions with 2 args?

32 bits most likely. No bigger than the 3 address form, and no smaller, but with fewer EXTENDED_ARGS and room for some extra data or the occasional 3-address instruction (like CALL?)

Extra data will be very useful - like which comparison operator to use for a COMPARE_OP, or which of the args need to be decref'ed.

Looks like don't need to worry about EXTENDED_ARGS.

There may be a few COPYs that we still need (if neither arg can be overwritten) but I think the most common ones will go.

gvanrossum commented 1 year ago

Maybe we can just close this with the conclusion that we're striving for a 3-address form (with the occasional extension to 4-address, in particular for BINARY_OP)?

gvanrossum commented 1 year ago

I realized something that might have contributed to the perf slowdown Irit measured for her first register VM prototype.

The discussion about the -fno-omit-frame-pointer GCC option (to keep frame pointers) made me realize that we're starved for CPU registers (these have nothing to do with the "registers" in "register VM" -- those really unnamed temporary locals).

This is the insight: Having additional variables oparg2 and oparg3 that are referenced by each instruction (the decoding part always set these) adds a lot of pressure and will cause more "spills" (where CPU registers are saved to the stack), slowing our main interpreter loop down.

I'm not sure what to do about this yet, but given that the GCC discussion mentioned as much as a 10% slowdown, this might be a big effect.

jneb commented 1 year ago

There's one particular remark in that piece that may give a hint in how to proceed:

But I don't think _PyEval_EvalFrameDefault example is typical of how application code is written, nor is it, generally speaking, a good idea to do so much within single gigantic function.

I am a bit disappointed by GCC not doing a better job at variable lifetime analysis here, so we'll have to help it. The trick of course is to make sure that a few values as possible are used in any stretch of code. One example is only decoding arguments at the moment you need them so that the the oparg don't have to be kept around. Since this is no fun by hand, this probably needs a clever code generator and/or well chosen macros.

gvanrossum commented 1 year ago

So the new plan is that oparg2 an oparg3 are only set by the decoding part of instructions that need them. But they can't be locals in each instruction's block because of EXTENDED_ARG* decoding. They will be dead in most parts of the code though, maybe that's enough for GCC. Time will have to tell (I can't think of an easy experiment to get more clarity here).

markshannon commented 1 year ago

In summary, we decided on three address form. Although some instructions will need extra operands, and some fewer, the default would be three address form. This might all be moot, as the register interpreter has been to deferred indefinitely.