faster-cpython / ideas

1.68k stars 48 forks source link

Define opcodes in terms of smaller "micro-ops" #454

Closed brandtbucher closed 7 months ago

brandtbucher commented 2 years ago

We've talked about this concept on-and-off for some time now, so I thought I would create this issue to collect some concrete ideas and start trying to reach a consensus on what the best path to take here is. This is a big, high-risk project, so we should probably start discussing it sooner rather than later.

Currently, our instructions are pretty much defined as "whatever the corresponding blob of C code in ceval.c does". While there are several places (mark_stacks, stack_effect, dis and its docs, the various haswhatever lists in opcode.py) that describe aspects of their behavior, they are scattered all over the place and must be kept in-sync (by hand) with the actual implementation. In other words, they are a burden just as much as they are a useful resource.

It's also not always very clear (for example):

And potentially, in the future:

...and, of course, much more.

One way to improve this situation is by ditching the free-form C code and defining our opcodes in terms of structured, statically-analyzable micro-ops (or "uops"). These uops represent lowered, composable units of work, such as incrementing the instruction pointer, decrementing a value's reference count, writing values to the stack, etc. Ideally, they'd represent a sort of IR that bridges the gap between Python and native code.

Last week I put together a simple proof-of-concept that converts much of the common logic in our instructions to a couple dozen initial uops. There is much more to be done, of course, but it does a pretty good job of communicating what this could look like, and many opcodes have been mostly or entirely converted. Note that GitHub collapses the huge ceval.c diff by default, but most of the interesting stuff is there.

Other random thoughts to spark discussion:

markshannon commented 2 years ago

I think the micro-ops you propose are too low level and try to do several different things:

It also requires local variables, which moves it into the space of a IR similar to LLVM's.

From your proof of concept, LOAD_FAST is defined as:


            PyObject *value;
            UOP_JUMP(1);
            UOP_WRITE_PREV_INSTR();
            UOP_UPDATE_STATS();
            UOP_GET_FAST(value, oparg);
            if (value == NULL) {
                goto unbound_local_error;
            }
            Py_INCREF(value);
            PUSH(value);
            DISPATCH();
            UOP_STACK_ADJUST(1);
            UOP_STACK_SET(1, value);
            UOP_INCREF(value);
            UOP_NEXT_OPCODE();
            UOP_NEXT_OPARG();
            UOP_LLTRACE();
            UOP_CHECK_TRACING();
            UOP_DISPATCH();

LOAD_FAST does three things. Moves the local variable to the stack, check to see if that value is NULL, and increments its refcount:

LOAD_FAST = GET_LOCAL CHECK_UNBOUND_LOCAL INCREF_TOS ;

We need to bottom out somewhere, and the three primitives GET_LOCAL, CHECK_UNBOUND_LOCAL, INCREF_TOS seem simple enough already.

GET_LOCAL ( -- obj) {
    obj = frame->f_localsplus[oparg];
}
CHECK_UNBOUND_LOCAL(obj -- obj) {
    if (obj == NULL) {
         goto unbound_local_error;
    }
}
INCREF_TOS(obj -- obj) {
     Py_INCREF(obj);
}

The above are amenable to optimization.

Handling tracing, dispatching, etc are the responsibility of the code generator.

markshannon commented 2 years ago

Here's another example, LOAD_ATTR_INSTANCE_VALUE.

LOAD_ATTR_INSTANCE_VALUE needs to check the type version, whether the object has values, before extracting the value and finally checking if it is NULL.

LOAD_ATTR_INSTANCE_VALUE =
    SKIP_COUNTER TYPE_VERSION_CHECK INSTANCE_VALUES_CHECK LOAD_VALUE NULL_CHECK INCREF_TOS;
SKIP_COUNTER(#counter) {}
TYPE_VERSION_CHECK(##version obj -- obj) {
    DEOPT_IF(version != Py_TYPE(obj)->tp_version);
}
INSTANCE_VALUES_CHECK(obj -- obj) {
    assert(tp->tp_dictoffset < 0);
    assert(tp->tp_flags & Py_TPFLAGS_MANAGED_DICT);
    PyDictOrValues dorv = *_PyObject_DictOrValuesPointer(obj);
    DEOPT_IF(!_PyDictOrValues_IsValues(dorv));
LOAD_VALUE(#index obj -- attr) {
    PyDictOrValues dorv = *_PyObject_DictOrValuesPointer(obj);
    attr = _PyDictOrValues_GetValues(dorv)->values[index];
    Py_DECREF(obj);
NULL_CHECK(obj -- obj) {
    DEOPT_IF(obj == NULL);

If we are willing to have non-objects on the stack (which we can provided the code generator ensures that they never escape), then we lower the above a bit further:

LOAD_ATTR_INSTANCE_VALUE =
    SKIP_COUNTER TYPE_VERSION_CHECK INSTANCE_VALUES_CHECK GET_VALUES_POINTER LOAD_VALUE_FROM_VALUES NULL_CHECK
GET_VALUES_POINTER(obj -- dorv: PyDictOrValues) {
    dorv = *_PyObject_DictOrValuesPointer(obj);
    Py_DECREF(obj);
LOAD_VALUE_FROM_VALUES(#index dorv: PyDictOrValues -- obj) {
    obj = _PyDictOrValues_GetValues(dorv)->values[index];

Note on stack effects.

(in0 -- out0 out1) is a stack effect, saying that this operation takes in0 from the stack, and puts out0 and out1 back on the stack.

#counter notes that this operation takes a 16 bit code unit from the instruction stream. ##version is a 32 bit value.

It is the job of the code generator to make sure that the stack and instruction stream are unchanged after a DEOPT_IF branch.

brandtbucher commented 2 years ago

I think the micro-ops you propose are too low level and try to do several different things:

  • Define the semantics of the higher-level instructions

Not accomplishing this first bullet would feel like a huge missed opportunity to me in terms of tooling/testing/maintenance, but I admit that my post probably overstated the value of this relative to other, more important design goals.

  • Implement VM features like tracing and recording the prev-instr
  • Specify the mechanism for handling the stack, rather than treating it as an abstract stack.

It also requires local variables, which moves it into the space of a IR similar to LLVM's.

I mostly agree with your second bullet: parts of the instruction preamble and dispatch sequence can probably be omitted when actually specifying the uop sequences for generation. As you say, "handling tracing, dispatching, etc are the responsibility of the code generator". I included them in the prototype because, well, I am the code generator in this case. ;)

If you remove some of that fluff, though, our uop specifications are quite similar. As you point out, the major differences are that you want to avoid local temporaries and explicit stack manipulations in favor of using just an abstract stack that is manipulated directly by many diverse uops (similar to our current instruction set). On the surface, those both seem less desirable to me, in terms of both maintenance and usefulness.

As a simple, concrete example, here is how a couple of simple passes might optimize a recorded trace (or superinstruction) for the statement a = b + c. Click to expand:

Uop concatenation ```c PyObject *value, *left, *right, *sum, *tmp; // LOAD_FAST: UOP_JUMP(1); UOP_WRITE_PREV_INSTR(); UOP_GET_FAST(value, 0); UOP_STACK_ADJUST(1); UOP_STACK_SET(1, value); UOP_INCREF(value); // LOAD_FAST: UOP_JUMP(1); UOP_WRITE_PREV_INSTR(); UOP_GET_FAST(value, 1); UOP_STACK_ADJUST(1); UOP_STACK_SET(1, value); UOP_INCREF(value); // BINARY_OP: UOP_JUMP(1); UOP_WRITE_PREV_INSTR(); UOP_STACK_GET(left, 2); UOP_STACK_GET(right, 1); UOP_BINARY_OP(sum, NB_ADD, left, right); UOP_STACK_SET(2, sum); UOP_DECREF(left); UOP_DECREF(right); UOP_STACK_ADJUST(-1); UOP_ERROR_IF_NULL(sum); UOP_JUMP(INLINE_CACHE_ENTRIES_BINARY_OP); // STORE_FAST: UOP_JUMP(1); UOP_WRITE_PREV_INSTR(); UOP_STACK_GET(value, 1); UOP_STACK_ADJUST(-1); UOP_GET_FAST(tmp, 2); UOP_STORE_FAST(2, value); UOP_XDECREF(tmp); ```
SSA-ification ```c PyObject *value_0, *value_1, *left, *right, *sum, *value_2, *tmp; // LOAD_FAST: UOP_JUMP(1); UOP_WRITE_PREV_INSTR(); UOP_GET_FAST(value_0, 0); UOP_STACK_ADJUST(1); UOP_STACK_SET(1, value_0); UOP_INCREF(value_0); // LOAD_FAST: UOP_JUMP(1); UOP_WRITE_PREV_INSTR(); UOP_GET_FAST(value_1, 1); UOP_STACK_ADJUST(1); UOP_STACK_SET(1, value_1); UOP_INCREF(value_1); // BINARY_OP: UOP_JUMP(1); UOP_WRITE_PREV_INSTR(); UOP_STACK_GET(left, 2); UOP_STACK_GET(right, 1); UOP_BINARY_OP(sum, NB_ADD, left, right); UOP_STACK_SET(2, sum); UOP_DECREF(left); UOP_DECREF(right); UOP_STACK_ADJUST(-1); UOP_ERROR_IF_NULL(sum); UOP_JUMP(INLINE_CACHE_ENTRIES_BINARY_OP); // STORE_FAST: UOP_JUMP(1); UOP_WRITE_PREV_INSTR(); UOP_STACK_GET(value_2, 1); UOP_STACK_ADJUST(-1); UOP_GET_FAST(tmp, 2); UOP_STORE_FAST(2, value_2); UOP_XDECREF(tmp); ```
Stack-effect elimination ```c PyObject *value_0, *value_1, *left, *right, *sum, *value_2, *tmp; // LOAD_FAST: UOP_JUMP(1); UOP_WRITE_PREV_INSTR(); UOP_GET_FAST(value_0, 0); UOP_STACK_SET(0, value_0); UOP_INCREF(value_0); // LOAD_FAST: UOP_JUMP(1); UOP_WRITE_PREV_INSTR(); UOP_GET_FAST(value_1, 1); UOP_STACK_SET(-1, value_1); UOP_INCREF(value_1); // BINARY_OP: UOP_JUMP(1); UOP_WRITE_PREV_INSTR(); UOP_STACK_GET(left, 0); UOP_STACK_GET(right, -1); UOP_BINARY_OP(sum, NB_ADD, left, right); UOP_STACK_SET(0, sum); UOP_DECREF(left); UOP_DECREF(right); UOP_ERROR_IF_NULL(sum); UOP_JUMP(INLINE_CACHE_ENTRIES_BINARY_OP); // STORE_FAST: UOP_JUMP(1); UOP_WRITE_PREV_INSTR(); UOP_STACK_GET(value_2, 0); UOP_GET_FAST(tmp, 2); UOP_STORE_FAST(2, value_2); UOP_XDECREF(tmp); ```
Stack elimination ```c PyObject *value_0, *value_1, *sum, *tmp; // LOAD_FAST: UOP_JUMP(1); UOP_WRITE_PREV_INSTR(); UOP_GET_FAST(value_0, 0); UOP_INCREF(value_0); // LOAD_FAST: UOP_JUMP(1); UOP_WRITE_PREV_INSTR(); UOP_GET_FAST(value_1, 1); UOP_INCREF(value_1); // BINARY_OP: UOP_JUMP(1); UOP_WRITE_PREV_INSTR(); UOP_BINARY_OP(sum, NB_ADD, value_0, value_1); UOP_DECREF(value_0); UOP_DECREF(value_1); UOP_ERROR_IF_NULL(sum); UOP_JUMP(INLINE_CACHE_ENTRIES_BINARY_OP); // STORE_FAST: UOP_JUMP(1); UOP_WRITE_PREV_INSTR(); UOP_GET_FAST(tmp, 2); UOP_STORE_FAST(2, sum); UOP_XDECREF(tmp); ```
Refcount elimination ```c PyObject *value_0, *value_1, *sum, *tmp; // LOAD_FAST: UOP_JUMP(1); UOP_WRITE_PREV_INSTR(); UOP_GET_FAST(value_0, 0); // LOAD_FAST: UOP_JUMP(1); UOP_WRITE_PREV_INSTR(); UOP_GET_FAST(value_1, 1); // BINARY_OP: UOP_JUMP(1); UOP_WRITE_PREV_INSTR(); UOP_BINARY_OP(sum, NB_ADD, value_0, value_1); UOP_ERROR_IF_NULL(sum); UOP_JUMP(INLINE_CACHE_ENTRIES_BINARY_OP); // STORE_FAST: UOP_JUMP(1); UOP_WRITE_PREV_INSTR(); UOP_GET_FAST(tmp, 2); UOP_STORE_FAST(2, sum); UOP_XDECREF(tmp); ```
Frame write elimination ```c PyObject *value_0, *value_1, *sum, *tmp; // LOAD_FAST: UOP_JUMP(1); UOP_GET_FAST(value_0, 0); // LOAD_FAST: UOP_JUMP(1); UOP_GET_FAST(value_1, 1); // BINARY_OP: UOP_JUMP(1); UOP_WRITE_PREV_INSTR(); UOP_BINARY_OP(sum, NB_ADD, value_0, value_1); UOP_ERROR_IF_NULL(sum); UOP_JUMP(INLINE_CACHE_ENTRIES_BINARY_OP); // STORE_FAST: UOP_JUMP(1); UOP_WRITE_PREV_INSTR(); UOP_GET_FAST(tmp, 2); UOP_STORE_FAST(2, sum); UOP_XDECREF(tmp); ```
Jump elimination ```c PyObject *value_0, *value_1, *sum, *tmp; // LOAD_FAST: UOP_GET_FAST(value_0, 0); // LOAD_FAST: UOP_GET_FAST(value_1, 1); // BINARY_OP: UOP_JUMP(3); UOP_WRITE_PREV_INSTR(); UOP_BINARY_OP(sum, NB_ADD, value_0, value_1); UOP_ERROR_IF_NULL(sum); // STORE_FAST: UOP_JUMP(INLINE_CACHE_ENTRIES_BINARY_OP + 1); UOP_WRITE_PREV_INSTR(); UOP_GET_FAST(tmp, 2); UOP_STORE_FAST(2, sum); UOP_XDECREF(tmp); ```
Temporary variable elimination ```c PyObject *sum, *tmp; // LOAD_FAST: // LOAD_FAST: // BINARY_OP: UOP_JUMP(3); UOP_WRITE_PREV_INSTR(); UOP_BINARY_OP(sum, NB_ADD, localsplus[0], localsplus[1]); UOP_ERROR_IF_NULL(sum); // STORE_FAST: UOP_JUMP(INLINE_CACHE_ENTRIES_BINARY_OP + 1); UOP_WRITE_PREV_INSTR(); UOP_GET_FAST(tmp, 2); UOP_STORE_FAST(2, sum); UOP_XDECREF(tmp); ``` Or, if we know `localsplus[2]` (`a`) is already `NULL`: ```c // LOAD_FAST: // LOAD_FAST: // BINARY_OP: UOP_JUMP(3); UOP_WRITE_PREV_INSTR(); UOP_BINARY_OP(localsplus[2], NB_ADD, localsplus[0], localsplus[1]); UOP_ERROR_IF_NULL(localsplus[2]); // STORE_FAST: UOP_JUMP(INLINE_CACHE_ENTRIES_BINARY_OP + 1); ```

The steps for analyzing and optimizing these uops are simple and intuitive to me, and hopefully others. I tried working through the same example with a uop specification closer to what you suggest, and each step seems so much more difficult. I won't straw-man by sharing my attempts, but it seems like it requires either significantly increasing the number of uops, emitting a final result that's less optimized than the one I worked out above, or adding significant complexity to the various passes to handle all of the different possible uops (rather than a tiny subset specific to that pass).

brandtbucher commented 2 years ago

From your proof of concept, LOAD_FAST is defined as:

Just a minor note: this looks like a typo, since it includes both old and new code. Maybe copied-and-pasted from a unified diff of the changes to LOAD_FAST_CHECK?

gvanrossum commented 2 years ago

Just a minor note: this looks like a typo, since it includes both old and new code. Maybe copied-and-pasted from a unified diff of the changes to LOAD_FAST_CHECK?

Yeah, Mark was definitely talking about LOAD_FAST_CHECK, which has this definition:

        TARGET(LOAD_FAST_CHECK) {
            // TODO
            PyObject *value;
            UOP_JUMP(1);
            UOP_WRITE_PREV_INSTR();
            UOP_UPDATE_STATS();
            UOP_GET_FAST(value, oparg);
            if (value == NULL) {
                goto unbound_local_error;
            }
            UOP_STACK_ADJUST(1);
            UOP_STACK_SET(1, value);
            UOP_INCREF(value);
            UOP_NEXT_OPCODE();
            UOP_NEXT_OPARG();
            UOP_LLTRACE();
            UOP_CHECK_TRACING();
            UOP_DISPATCH();
        }

I do find this distractingly verbose. It is also lacking a bit more explanation (e.g. why does every case starts with a jump? It seems that you've eliminated the next_instr++ implied by TARGET() (in the past this was part of the instruction decoding, word = *next_instr++ etc.).

I am still trying to get my head around the rest of the various proposals; I find Mark's idea, closer to a proper IR, attractive for its compactness. I think the ease of implementation of the various optimization steps is key to which way we go.

Looking at Mark's first comment again, he writes

GET_LOCAL ( -- obj) {
    obj = frame->f_localsplus[oparg];
}

What seems almost hidden here behind the near-C syntax is that this pushes obj onto the stack. Let's explore this in more depth.

brandtbucher commented 2 years ago

I do find this distractingly verbose. It is also lacking a bit more explanation (e.g. why does every case starts with a jump? It seems that you've eliminated the next_instr++ implied by TARGET() (in the past this was part of the instruction decoding, word = *next_instr++ etc.).

Yeah, my branch exposes these parts which are currently tucked away in the TARGET and various DISPATCH macros. They would likely be generated, not specified by us. So we can just ignore them.

The real body of the instruction is just:

PyObject *value;
UOP_GET_FAST(value, oparg);
if (value == NULL) {
    goto unbound_local_error;
}
UOP_STACK_ADJUST(1);
UOP_STACK_SET(1, value);
UOP_INCREF(value);

The if ... goto would probably be a single uop too, like Mark's CHECK_UNBOUND_LOCAL. I just haven't gotten to it yet. So more like:

PyObject *value;
UOP_GET_FAST(value, oparg);
UOP_CHECK_UNBOUND_LOCAL(value);
UOP_STACK_ADJUST(1);
UOP_STACK_SET(1, value);
UOP_INCREF(value);
gvanrossum commented 2 years ago

So how would the tooling for generating the eval loop work? I imagine there would be a file containing blocks like this (one block per opcode):

// opcode: LOAD_FAST_CHECK
PyObject *value;
UOP_GET_FAST(value, oparg);
UOP_CHECK_UNBOUND_LOCAL(value);
UOP_STACK_ADJUST(1);
UOP_STACK_SET(1, value);
UOP_INCREF(value);

Then the tool would read that file and generate the switch block for the eval loop in another file, which would then be #-included into ceval.c instead of the current sequence of cases, right?

So far, so good. In order to do the optimization stages you describe (very nicely) in your comment above we'd need an in-memory representation of the sequence of uops. I guess this could be somewhat similar to the blocks of instructions we currently use in the bytecode compiler. We'd have to have a way to reference variables like value, probably each variable would just be assigned a number. Then concatenating blocks of uops together would require renumbering the variables in each block consistently (I think this is what you do in your "SSA-ification" step).

And then what? Once you've reached the final optimized form you have to generate machine code (uops look terrible for an interpreter due to the dispatch overhead). So if we're going any of these routes (this applies to Mark's proposal as well) we probably should first think about our story for generating machine code.

Until then, UOPS (even the shortened form you give, augmented with more traditional TARGET() and DISPATCH() macros) look like a readability loss -- I'd much rather read the original code than the uops version, which isn't readable at all until I've memorized most of the uop definitions.

The benefit of being able to derive the stack effect and similar things just doesn't warrant the cost of the uops migration to me. So maybe we should not be so hasty to convert to uops before tackling eval loop generation.

(I have some thoughts about Mark's IR too, I'm putting those in a separate comment.)

gvanrossum commented 2 years ago

Regarding Mark's IR ideas, I was trying to see how it compares for Brandt's example of a = b + c but I got stuck on the proper definition of STORE_FAST.

As a refresher, LOAD_FAST_CHECK is defined like this (I'm laying it out vertically for readabillity):

LOAD_FAST_CHECK:
    GET_LOCAL
    CHECK_UNBOUND_LOCAL
    INCREF_TOS

There is no explicit PUSH uop here: GET_LOCAL pushes the local indicated by oparg on top of the stack. (OT: Why oh why did we rename LOAD_FAST to LOAD_FAST_CHECK, introducing a new LOAD_FAST that doesn't check, rather than keeping the definition of LOAD_FAST unchanged and adding LOAD_FAST_NOCHECK? That would have spared us some confusion here.)

Anyway, I was thinking about what STORE_FAST should look like in Mark's IR. He didn't give any examples that store anything so I tried this:

STORE_FAST:
    XDECREF_LOCAL
    SET_LOCAL

with IR definitions

XDECREF_LOCAL () {
    Py_XDECREF(frame->f_localsplus[oparg]);
}
SET_LOCAL (obj --) {
    frame->f_localsplus[oparg] = obj;
}

Alas, this is neither elegant, nor symmetric with the LOAD form, nor correct. (The XDECREF should happen after the store because decref operations may call out to Python code which may acces the frame and could find a pointer to freed memory.)

Second try:

STORE_FAST:
    GET_LOCAL
    SWAP
    SET_LOCAL
    XDECREF_TOS
    POP

with IR definitions:

XDECREF_TOS (obj -- obj) {
    Py_XDECREF(obj);
}
POP (obj --) {
}
SWAP (a b -- c d) {
    PyObject *tmp = b;
    c = a;
    d = tmp;
}

I don't like this definition of SWAP much. I tried to make it more elegant (note that the stack top is on the right, so b is the top of the input stack and d is the top of the output stack):

SWAP (a b -- c d) {
    c = b;
    d = a;
}

But that still requires some kind of implicit temporary, else the c = b assignment would overwrite a before it is read.

And even if we had a satisfactory SWAP, we'd still be stuck with the fact that we're loading the old variable on top of the stack just so we can decref it. This seems to require a lot of optimization effort before it's even equivalent to the current hand-written STORE_FAST definition.

So something is still missing. :-( I'm guessing this is why Brandt didn't want to post his attempts.

But we should look at MLIR and see what they are doing. Maybe we can learn from it. We should also look at the Cinder IRs (they have two, hir and lir).

brandtbucher commented 2 years ago

But we should look at MLIR and see what they are doing. Maybe we can learn from it.

I watched a presentation of theirs from 2020 earlier this week, and it appears on the surface that their framework is sort of overkill for what we're doing here. I guess if we wanted to, we would define our own dialect that maps to instructions/uops and define how to optimize and lower that dialect to LLVM IR. But I figure that at that point, if we want LLVM to be our backend, it may just be easier to optimize the uops ourselves and define a direct translation to LLVM IR.

Also, C++... :(

We should also look at the Cinder IRs (they have two, hir and lir).

Yeah, trycinder.com is great for exploring this. I preloaded that link with our little example, so you can see all of the passes they perform on it.

It appears that they do replace the stack with an SSA form almost immediately. Also interesting is that they include a pass to insert refcounts to IR that initially ignores them, rather than removing incref/decref pairs where possible. If we are defining our opcodes as a sequence of uops, the latter approach seems like it makes more sense, since we would just have that information anyways when we start compiling a trace. It would be interesting to learn why they went that route instead.

gvanrossum commented 2 years ago

it appears on the surface that their framework is sort of overkill for what we're doing here.

Yeah, we should probably not try to use MLIR directly. But we should understand it well enough to be either inspired by it to do something similar ourselves, or able to reject their approach confidently. Right now we can do neither because we don't understand their approach well enough. (At least I don't, although I've dug into their tutorial briefly.)

Yeah, trycinder.com is great for exploring this.

Oh, cool. I looked at Cinder's Hir source code a bit and it definitely looks like something we should understand better before we design our own.

carljm commented 2 years ago

Yeah, trycinder.com is great for exploring this.

I was just coming here to say this! Glad you're already aware of it.

It appears that they do replace the stack with an SSA form almost immediately. Also interesting is that they include a pass to insert refcounts to IR that initially ignores them, rather than removing incref/decref pairs where possible. If we are defining our opcodes as a sequence of uops, the latter approach seems like it makes more sense, since we would just have that information anyways when we start compiling a trace. It would be interesting to learn why they went that route instead.

We were faced with the task of translating Python bytecode (which hides refcount operations) into our HIR (where we wanted to expose refcount operations), so our choices were "manually handle refcounting in the translation of each Python opcode, and then later try to optimize some of them out" or "encode the externally visible refcount semantics of each HIR opcode (what references does it borrow/steal/create) and then have a pass to insert the minimal refcount operations." The latter approach seemed less error-prone since it avoids manual coding of refcounting operations, which is easy to get wrong.

swtaarrs commented 2 years ago

I'd love to write/chat more about our refcount insertion pass later, but quickly for now: we have a human-readable design doc for it here. It covers implementation rather than motivation, but what Carl said about making it harder to mess up refcounting operations in HIR pretty much covers it. Having worked on another JIT that went the other way, it's really nice having the freedom to optimize and otherwise manipulate HIR without having to worry about refcount operations.

In general, if you find Jit/ -name "*.md" there are a few other high-level docs, including one about our type system and one about our deoptimization code (aka OSR/side-exits).

brandtbucher commented 2 years ago

And then what? Once you've reached the final optimized form you have to generate machine code (uops look terrible for an interpreter due to the dispatch overhead). So if we're going any of these routes (this applies to Mark's proposal as well) we probably should first think about our story for generating machine code.

Okay, let's think about it! You know I can't resist a good protyping opportunity. ;)

Using an awesome x86-64 assembler library I found called PeachPy, I've created a tool that can compile sequences of the uops we've been using here. The compiler itself is just 100 lines of Python, and it's not too difficult to follow once you've gotten used to PeachPy's way of doing things:

https://github.com/brandtbucher/cpython/blob/uop-compiler/Tools/uop_compiler/uop_compiler.py

Here are some examples of its output for different forms of the uops for a = b + c:

Assuming a, b, and c are all bound: ```asm MOV r10, rdi ; _PyInterpreterFrame * MOV r11, [r10 + 56] ADD r11, 2 ADD r11, 6 LEA r11, [r11 - 2] MOV [r10 + 56], r11 PUSH r10 MOV rdi, [r10 + 72] MOV rsi, [r10 + 80] MOV r10, 94743266116281 ; PyNumber_Add CALL r10 POP r10 MOV r11, [r10 + 56] ADD r11, 2 MOV rax, rax CMP rax, 0 JE error ADD r11, 4 LEA r11, [r11 - 2] MOV [r10 + 56], r11 MOV r11, [r10 + 88] MOV [r10 + 88], rax DEC [r11] JNZ end PUSH r10 MOV rdi, r11 MOV rax, 94743266669343 ; _Py_Dealloc CALL rax POP r10 MOV r11, [r10 + 56] ADD r11, 2 end: MOV eax, 1 RET XOR eax, eax RET error: MOV eax, -1 RET ```
Assuming a is unbound and b and c are bound: ```asm MOV r10, rdi ; _PyInterpreterFrame * MOV r11, [r10 + 56] ADD r11, 2 ADD r11, 6 LEA r11, [r11 - 2] MOV [r10 + 56], r11 PUSH r10 MOV rdi, [r10 + 72] MOV rsi, [r10 + 80] MOV r10, 94743266116281 ; PyNumber_Add CALL r10 POP r10 MOV r11, [r10 + 56] ADD r11, 2 MOV [r10 + 88], rax CMP [r10 + 88], 0 JE error ADD r11, 4 MOV eax, 1 RET XOR eax, eax RET error: MOV eax, -1 RET ```
Assuming a, b, and c are all bound, specialized for strings: ```asm MOV r10, rdi ; _PyInterpreterFrame * MOV r11, [r10 + 56] ADD r11, 2 ADD r11, 6 LEA r11, [r11 - 2] MOV [r10 + 56], r11 MOV r11, [r10 + 72] MOV r9, 94743270261024 ; &PyUnicode_Type CMP [r11 + 8], r9 JNE deopt MOV r11, [r10 + 80] MOV r9, 94743270261024 ; &PyUnicode_Type CMP [r11 + 8], r9 JNE deopt PUSH r10 MOV rdi, [r10 + 72] MOV rsi, [r10 + 80] MOV r10, 94743266994542 ; PyUnicode_Concat CALL r10 POP r10 MOV r11, [r10 + 56] ADD r11, 2 MOV rax, rax CMP rax, 0 JE error ADD r11, 4 LEA r11, [r11 - 2] MOV [r10 + 56], r11 MOV r11, [r10 + 88] MOV [r10 + 88], rax DEC [r11] JNZ end PUSH r10 MOV rdi, r11 MOV rax, 94743266669343 ; _Py_Dealloc CALL rax POP r10 MOV r11, [r10 + 56] ADD r11, 2 end: MOV eax, 1 RET deopt: XOR eax, eax RET error: MOV eax, -1 RET ```
Assuming a is unbound and b, and c are bound, specialized for strings: ```asm MOV r10, rdi ; _PyInterpreterFrame * MOV r11, [r10 + 56] ADD r11, 2 ADD r11, 6 LEA r11, [r11 - 2] MOV [r10 + 56], r11 MOV r11, [r10 + 72] MOV r9, 94743270261024 ; &PyUnicode_Type CMP [r11 + 8], r9 JNE deopt MOV r11, [r10 + 80] MOV r9, 94743270261024 ; &PyUnicode_Type CMP [r11 + 8], r9 JNE deopt PUSH r10 MOV rdi, [r10 + 72] MOV rsi, [r10 + 80] MOV r10, 94743266994542 ; PyUnicode_Concat CALL r10 POP r10 MOV r11, [r10 + 56] ADD r11, 2 MOV [r10 + 88], rax CMP [r10 + 88], 0 JE error ADD r11, 4 MOV eax, 1 RET deopt: XOR eax, eax RET error: MOV eax, -1 RET ```

If you check out the link, you can see that this machine code actually works! I've set up a little harness that lets you specify a sequence of uops, compile them to assembly, load the assembly, and run it within the context of arbitrary frames.

Obviously there's room for improvement and many unanswered questions, but I think this does a good job of answering the question of "then what?" by completing the story of how different specialized forms of our little a = b + c example can be lowered from bytecode, through uops, into machine code without too much difficulty.

gvanrossum commented 2 years ago

Yup, I see that clearly now.

So the main question is really whether uops are the right abstraction for optimizations. From what I've learned so far about Cinder's Jit, in a sense they have their own version of this, HIR (high-level IR), but they don't translate HIR directly into machine code. Instead, they have a few stages where they transform the HIR into more optimal HIR, before they eventually generate machine code. (Their LIR is just an abstraction so they can target different hardware architectures easily).

So now the next question is, do uops lend themselves to the same optimization as HIR, or are there things that are easlier to do with a representation like HIR? I don't know, but I worry that uops will mostly be useful to eliminate the instruction dispatch overhead, which IIRC is 20-30%, whereas Cinder claims speedups from 1.5-4x.

IIUC, HIR uses a register machine (with unlimited registers) and they have an elegant way of translating bytecode to HIR -- a form of symbolic interpretation where the evaluation stack is replaced by a stack that tells the builder in which register a given stack value referenced by the bytecode lives. The resulting register-based IR feels more powerful as a basis for optimizing transformations than an IR that is a sequence of uops that manipulate the same stack machine as our bytecode.

What am I missing? -- I'm sure there's not one thing, I'm just trying to imagine what's the best approach for optimization passes. My ambition is to eventually (around Python 3.14?) have a JIT in CPython that makes it unnecessary for Cinder to maintain their own JIT.

markshannon commented 2 years ago

The resulting register-based IR feels more powerful as a basis for optimizing transformations than an IR that is a sequence of uops that manipulate the same stack machine as our bytecode.

I disagree. Stack-based instructions are much more easily composed than register-based ones. Forth and Joy are called concatenative programming languages because you can simply concatenate words to create larger ones. This is ideal for a trace-based optimizer as it can rapidly concatenate micro-ops to produce the IR.

I'm just trying to imagine what's the best approach for optimization passes

I'm not saying that this is the best approach, but it works:

  1. Region selection (short traces for tier 2)
  2. Guard elimination and strengthening
  3. Simple redundancy elimination
  4. Object allocation sinking and unboxing
  5. Reference count sinking
  6. Lowering
  7. Instruction selection and register allocation
  8. Machine code generation

An important advantage of using high-level "micro ops", is that the results of passes 1-4 can be executed by an interpreter, meaning that we can test the earlier passes independently and portably.

Depending on the IR, we might want to swap 5. and 6.

For passes 1-4 we want a quite high level IR, possibly a bit higher level than https://github.com/faster-cpython/ideas/issues/454#issuecomment-1231480990

gvanrossum commented 2 years ago

Thinking aloud as always... (There's a question at the end.)

Click to expand my thinking aloud:

**1. Region selection** This looks like it takes a number of bytecodes, right? So once it's decided (based on various counters?) that a sequence of bytecodes together form a suitable region, we can create an array of uops by concatenating the lists uops for each of those bytecodes. Easy-peasy. **2. Guard elimination and strengthening** I think I understand guard elimination -- this would take guards that are repeated and elide all but the first one. For example if the Python code for a trace contains something `obj.foo + obj.bar` and we've specialized both attribute accesses to `LOAD_ATTR_INSTANCE_VALUE`, there are two guards that verify that `obj` still has the same version, but the second one is redundant. (Or is it? If the GIL were dropped between the two instructions another thread might mess this up... We can probably safely assume that the GIL is held for the duration of a trace, at least as long as there is a GIL.) I'm guessing that guard strengthening refers to two guards that aren't identical but where one guard implies another, and we replace this by the stronger of the two. Both these optimizations require that we can figure out that two guards refer to the same object (`ob1.foo + ob2.bar` should not get the second guard removed). I guess to verify this we'll have to look at the LOAD_FAST (or equivalen) instructions that put the values on the stack, in particular the most recent `UOP_GET_FAST()`. Looking at Brandt's prototype we'd actually have to look for `UOP_GET_FAST(value, oparg)`, `UOP_STACK_ADJUST(1)`, `UOP_STACK_SET(value)` sequences, more or less. I can see that it would be easier to verify this if the uops were slightly higher level, like Mark's single `GET_LOCAL()` uop. **3. Simple redundancy elimination** Not sure what that is, I'm guessing other redundant ops like repeatedly loading the same variable can be replaced by loading it once and then using `DUP()`? **4. Object allocation sinking and unboxing** I've never heard of allocation sinking -- I'm guessing it's [this](http://wiki.luajit.org/Allocation-Sinking-Optimization#examples_motivating-example-again), so maybe this could be done for certain tuples (e.g. `for x, y in points: ...`). Unboxing I get.

I'll defer thinking about the remaining points. But I have a question: what does it mean that these forms can be executed by an interpreter? That would only be possible if said interpreter did dispatch on every micro-operation, which sounds like it would be quite a lot of overhead. Sure, we can write such an interpreter for testing, but it wouldn't be used for actual user code execution, would it? What am I missing here? Also, designing an instruction format for uops would be a challenge -- right now they are glorified macros, some of which need additional info (temp variables especially) that would have to be encoded in the uop instruction format.

gvanrossum commented 2 years ago

Off-line we established that Mark really does intend to interpret the uops. I assume that's why he doesn't want the uops to be quite so granular as in Brandt's prototype, because the instruction decoding overhead would kill any benefits we could get from optimization. It also gives a good reason for sticking with the current stack machine -- deoptimization can then just fall back on the bytecode.

On further thought I think the instruction format isn't a big issue since the uops (in this vision) will never be written to disk -- we can just have an array of structs containing the unpacked args to the uops, so we can write a straightforward decoding loop. Inline caches would just use extra fields in those structs. We'd lose a bit of memory locality but we'd otherwise reduce interpreter overhead which seems important.

```cc pc = ucode; // start of the micro-code array while (1) { switch (pc->uopcode) { case GET_LOCAL: *sp++ = frame->f_localsplus[pc->uoparg]; break; ... ... ... } pc++; } ```

It would be an interesting experiment to try and design some of the optimizer passes for the uop code. E.g. how to do guard elimination? What information is needed in the uops to do that effectively?

carljm commented 2 years ago

My ambition is to eventually (around Python 3.14?) have a JIT in CPython that makes it unnecessary for Cinder to maintain their own JIT.

Perhaps this is veering off-topic for this thread, but: I like this goal a lot, and judging by reaccs other folks do too :) The current design directions for the faster-cpython JIT would make this goal quite hard to achieve, though. Until/unless nogil happens, I'm not aware of any way we can efficiently utilize a multi-core server fleet in Python without prefork/multiprocess. And I think, under that execution model, a JIT that has to repeat the same compilation work in every process will have a difficult time competing with one that compiles before fork.

If the faster-cpython JIT were to be a method JIT, then there'd be some possibility of having multiple ways to trigger compilation of a method, either ahead-of-time in the prefork scenario or by call count in other scenarios. If it's a tracing JIT, it's difficult to see how it could be adapted to work efficiently in the prefork scenario at all.

gvanrossum commented 2 years ago

I have to admit I hadn't thought of that until you mentioned it. I guess we will need to plan an AOT interface too to cover that use case. Something that somehow forces an entire function to be compiled when it is first imported (or however Cinder does it, maybe just by calling a "compile this" API with a function or code object).

(PS. I refuse to say "JIT" in this context because of the original meaning of that acronym. :-)

carljm commented 2 years ago

I guess we will need to plan an AOT interface too to cover that use case. Something that somehow forces an entire function to be compiled when it is first imported (or however Cinder does it, maybe just by calling a "compile this" API with a function or code object).

It would be great if this were possible! "How Cinder does it" is with an env var that points to a file that lists function qualnames one per line; there's also a "compile this function now" function available. But I don't think we'd be at all picky about exactly how it works, as long as it is possible. If there's a "compile this function" function, anything else we'd need can be built on that.

(PS. I refuse to say "JIT" in this context because of the original meaning of that acronym. :-)

Fair! I suppose the original meaning (taken literally) could still apply if we observe that in our context "just in time" means "right before we fork" -- anything else would be "too late compilation" :P

gvanrossum commented 1 year ago

I had another idea here. I explained it to Brandt and he said "I like that a lot, go ahead and work on it, just assign the issue to yourself."

The idea is that instead of trying to put the input for the generator in a separate file and add dummy stuff to the top and bottom of that file to make it look like C, we keep the input to the generator in ceval.c, like this:

#ifdef SOME_SYMBOL
    switch (opcode) {
    case NOP:
        // stack-effect: 0
        break;
    ...
    }
#else
#include "generated.h"
#endif

The code generator can write the uglified, optimized code to generated.h, e.g. replacing case NOP: with TARGET(NOP) { and break; with DISPATCH();, adding a '}' where needed.

The code generator could even change the code it generates based on the compiler or platform.

Other things the code generator could do:

Anyways, we've come full circle -- this was one of the earliest ideas (#5), eventually we decided to skip it for 3.11, now it's back on the 3.12 workflow as "Generate version of the interpreter better suited to emscripten and web assembly".

gvanrossum commented 1 year ago

This could maybe be closed since we now have a working code generator -- though the discussion about the level of uops is still useful.

gvanrossum commented 1 year ago

FWIW, I looked at mark_stacks() and experimented a bit, and I don't think we should strive to generate this function automatically. The special cases are too special to be derived even from extended metadata.

The only think we might be able to do is, in the default case, to take the separate "popped" and "pushed" counts and ensure that we can actually pop that many values off the stack, rather than just accounting for the delta. E.g. BINARY_OP has a stack effect of -1 but it really pops 2 values and then pushes another, so we could:

But these are completely theoretical, I don't think our bytecode compiler would ever generate such code.

Also, there's an additional caveat: certain operations guarantee that certain stack elements are not changed even though they appear popped and pushed (e.g. LIST_EXTEND does this).

mdboom commented 1 year ago

I removed the 3.12 label for this, because I believe we no longer intend to do this for 3.12.

markshannon commented 7 months ago

We now have micro-ops.