Closed brandtbucher closed 7 months ago
I think the micro-ops you propose are too low level and try to do several different things:
prev-instr
It also requires local variables, which moves it into the space of a IR similar to LLVM's.
From your proof of concept, LOAD_FAST
is defined as:
PyObject *value;
UOP_JUMP(1);
UOP_WRITE_PREV_INSTR();
UOP_UPDATE_STATS();
UOP_GET_FAST(value, oparg);
if (value == NULL) {
goto unbound_local_error;
}
Py_INCREF(value);
PUSH(value);
DISPATCH();
UOP_STACK_ADJUST(1);
UOP_STACK_SET(1, value);
UOP_INCREF(value);
UOP_NEXT_OPCODE();
UOP_NEXT_OPARG();
UOP_LLTRACE();
UOP_CHECK_TRACING();
UOP_DISPATCH();
LOAD_FAST
does three things. Moves the local variable to the stack, check to see if that value is NULL
, and increments its refcount:
LOAD_FAST = GET_LOCAL CHECK_UNBOUND_LOCAL INCREF_TOS ;
We need to bottom out somewhere, and the three primitives GET_LOCAL
, CHECK_UNBOUND_LOCAL
, INCREF_TOS
seem simple enough already.
GET_LOCAL ( -- obj) {
obj = frame->f_localsplus[oparg];
}
CHECK_UNBOUND_LOCAL(obj -- obj) {
if (obj == NULL) {
goto unbound_local_error;
}
}
INCREF_TOS(obj -- obj) {
Py_INCREF(obj);
}
The above are amenable to optimization.
Handling tracing, dispatching, etc are the responsibility of the code generator.
Here's another example, LOAD_ATTR_INSTANCE_VALUE
.
LOAD_ATTR_INSTANCE_VALUE
needs to check the type version, whether the object has values, before extracting the value and finally checking if it is NULL
.
LOAD_ATTR_INSTANCE_VALUE =
SKIP_COUNTER TYPE_VERSION_CHECK INSTANCE_VALUES_CHECK LOAD_VALUE NULL_CHECK INCREF_TOS;
SKIP_COUNTER(#counter) {}
TYPE_VERSION_CHECK(##version obj -- obj) {
DEOPT_IF(version != Py_TYPE(obj)->tp_version);
}
INSTANCE_VALUES_CHECK(obj -- obj) {
assert(tp->tp_dictoffset < 0);
assert(tp->tp_flags & Py_TPFLAGS_MANAGED_DICT);
PyDictOrValues dorv = *_PyObject_DictOrValuesPointer(obj);
DEOPT_IF(!_PyDictOrValues_IsValues(dorv));
LOAD_VALUE(#index obj -- attr) {
PyDictOrValues dorv = *_PyObject_DictOrValuesPointer(obj);
attr = _PyDictOrValues_GetValues(dorv)->values[index];
Py_DECREF(obj);
NULL_CHECK(obj -- obj) {
DEOPT_IF(obj == NULL);
If we are willing to have non-objects on the stack (which we can provided the code generator ensures that they never escape), then we lower the above a bit further:
LOAD_ATTR_INSTANCE_VALUE =
SKIP_COUNTER TYPE_VERSION_CHECK INSTANCE_VALUES_CHECK GET_VALUES_POINTER LOAD_VALUE_FROM_VALUES NULL_CHECK
GET_VALUES_POINTER(obj -- dorv: PyDictOrValues) {
dorv = *_PyObject_DictOrValuesPointer(obj);
Py_DECREF(obj);
LOAD_VALUE_FROM_VALUES(#index dorv: PyDictOrValues -- obj) {
obj = _PyDictOrValues_GetValues(dorv)->values[index];
Note on stack effects.
(in0 -- out0 out1)
is a stack effect, saying that this operation takes in0
from the stack, and puts out0
and out1
back on the stack.
#counter
notes that this operation takes a 16 bit code unit from the instruction stream. ##version
is a 32 bit value.
It is the job of the code generator to make sure that the stack and instruction stream are unchanged after a DEOPT_IF
branch.
I think the micro-ops you propose are too low level and try to do several different things:
- Define the semantics of the higher-level instructions
Not accomplishing this first bullet would feel like a huge missed opportunity to me in terms of tooling/testing/maintenance, but I admit that my post probably overstated the value of this relative to other, more important design goals.
- Implement VM features like tracing and recording the
prev-instr
- Specify the mechanism for handling the stack, rather than treating it as an abstract stack.
It also requires local variables, which moves it into the space of a IR similar to LLVM's.
I mostly agree with your second bullet: parts of the instruction preamble and dispatch sequence can probably be omitted when actually specifying the uop sequences for generation. As you say, "handling tracing, dispatching, etc are the responsibility of the code generator". I included them in the prototype because, well, I am the code generator in this case. ;)
If you remove some of that fluff, though, our uop specifications are quite similar. As you point out, the major differences are that you want to avoid local temporaries and explicit stack manipulations in favor of using just an abstract stack that is manipulated directly by many diverse uops (similar to our current instruction set). On the surface, those both seem less desirable to me, in terms of both maintenance and usefulness.
As a simple, concrete example, here is how a couple of simple passes might optimize a recorded trace (or superinstruction) for the statement a = b + c
. Click to expand:
The steps for analyzing and optimizing these uops are simple and intuitive to me, and hopefully others. I tried working through the same example with a uop specification closer to what you suggest, and each step seems so much more difficult. I won't straw-man by sharing my attempts, but it seems like it requires either significantly increasing the number of uops, emitting a final result that's less optimized than the one I worked out above, or adding significant complexity to the various passes to handle all of the different possible uops (rather than a tiny subset specific to that pass).
From your proof of concept,
LOAD_FAST
is defined as:
Just a minor note: this looks like a typo, since it includes both old and new code. Maybe copied-and-pasted from a unified diff of the changes to LOAD_FAST_CHECK
?
Just a minor note: this looks like a typo, since it includes both old and new code. Maybe copied-and-pasted from a unified diff of the changes to
LOAD_FAST_CHECK
?
Yeah, Mark was definitely talking about LOAD_FAST_CHECK
, which has this definition:
TARGET(LOAD_FAST_CHECK) {
// TODO
PyObject *value;
UOP_JUMP(1);
UOP_WRITE_PREV_INSTR();
UOP_UPDATE_STATS();
UOP_GET_FAST(value, oparg);
if (value == NULL) {
goto unbound_local_error;
}
UOP_STACK_ADJUST(1);
UOP_STACK_SET(1, value);
UOP_INCREF(value);
UOP_NEXT_OPCODE();
UOP_NEXT_OPARG();
UOP_LLTRACE();
UOP_CHECK_TRACING();
UOP_DISPATCH();
}
I do find this distractingly verbose. It is also lacking a bit more explanation (e.g. why does every case starts with a jump? It seems that you've eliminated the next_instr++
implied by TARGET()
(in the past this was part of the instruction decoding, word = *next_instr++
etc.).
I am still trying to get my head around the rest of the various proposals; I find Mark's idea, closer to a proper IR, attractive for its compactness. I think the ease of implementation of the various optimization steps is key to which way we go.
Looking at Mark's first comment again, he writes
GET_LOCAL ( -- obj) {
obj = frame->f_localsplus[oparg];
}
What seems almost hidden here behind the near-C syntax is that this pushes obj
onto the stack. Let's explore this in more depth.
I do find this distractingly verbose. It is also lacking a bit more explanation (e.g. why does every case starts with a jump? It seems that you've eliminated the
next_instr++
implied byTARGET()
(in the past this was part of the instruction decoding,word = *next_instr++
etc.).
Yeah, my branch exposes these parts which are currently tucked away in the TARGET
and various DISPATCH
macros. They would likely be generated, not specified by us. So we can just ignore them.
The real body of the instruction is just:
PyObject *value;
UOP_GET_FAST(value, oparg);
if (value == NULL) {
goto unbound_local_error;
}
UOP_STACK_ADJUST(1);
UOP_STACK_SET(1, value);
UOP_INCREF(value);
The if
... goto
would probably be a single uop too, like Mark's CHECK_UNBOUND_LOCAL
. I just haven't gotten to it yet. So more like:
PyObject *value;
UOP_GET_FAST(value, oparg);
UOP_CHECK_UNBOUND_LOCAL(value);
UOP_STACK_ADJUST(1);
UOP_STACK_SET(1, value);
UOP_INCREF(value);
So how would the tooling for generating the eval loop work? I imagine there would be a file containing blocks like this (one block per opcode):
// opcode: LOAD_FAST_CHECK
PyObject *value;
UOP_GET_FAST(value, oparg);
UOP_CHECK_UNBOUND_LOCAL(value);
UOP_STACK_ADJUST(1);
UOP_STACK_SET(1, value);
UOP_INCREF(value);
Then the tool would read that file and generate the switch block for the eval loop in another file, which would then be #
-included into ceval.c instead of the current sequence of cases, right?
So far, so good. In order to do the optimization stages you describe (very nicely) in your comment above we'd need an in-memory representation of the sequence of uops. I guess this could be somewhat similar to the blocks of instructions we currently use in the bytecode compiler. We'd have to have a way to reference variables like value
, probably each variable would just be assigned a number. Then concatenating blocks of uops together would require renumbering the variables in each block consistently (I think this is what you do in your "SSA-ification" step).
And then what? Once you've reached the final optimized form you have to generate machine code (uops look terrible for an interpreter due to the dispatch overhead). So if we're going any of these routes (this applies to Mark's proposal as well) we probably should first think about our story for generating machine code.
Until then, UOPS (even the shortened form you give, augmented with more traditional TARGET() and DISPATCH() macros) look like a readability loss -- I'd much rather read the original code than the uops version, which isn't readable at all until I've memorized most of the uop definitions.
The benefit of being able to derive the stack effect and similar things just doesn't warrant the cost of the uops migration to me. So maybe we should not be so hasty to convert to uops before tackling eval loop generation.
(I have some thoughts about Mark's IR too, I'm putting those in a separate comment.)
Regarding Mark's IR ideas, I was trying to see how it compares for Brandt's example of a = b + c
but I got stuck on the proper definition of STORE_FAST
.
As a refresher, LOAD_FAST_CHECK
is defined like this (I'm laying it out vertically for readabillity):
LOAD_FAST_CHECK:
GET_LOCAL
CHECK_UNBOUND_LOCAL
INCREF_TOS
There is no explicit PUSH
uop here: GET_LOCAL
pushes the local indicated by oparg
on top of the stack. (OT: Why oh why did we rename LOAD_FAST
to LOAD_FAST_CHECK
, introducing a new LOAD_FAST
that doesn't check, rather than keeping the definition of LOAD_FAST
unchanged and adding LOAD_FAST_NOCHECK
? That would have spared us some confusion here.)
Anyway, I was thinking about what STORE_FAST
should look like in Mark's IR. He didn't give any examples that store anything so I tried this:
STORE_FAST:
XDECREF_LOCAL
SET_LOCAL
with IR definitions
XDECREF_LOCAL () {
Py_XDECREF(frame->f_localsplus[oparg]);
}
SET_LOCAL (obj --) {
frame->f_localsplus[oparg] = obj;
}
Alas, this is neither elegant, nor symmetric with the LOAD
form, nor correct. (The XDECREF
should happen after the store because decref operations may call out to Python code which may acces the frame and could find a pointer to freed memory.)
Second try:
STORE_FAST:
GET_LOCAL
SWAP
SET_LOCAL
XDECREF_TOS
POP
with IR definitions:
XDECREF_TOS (obj -- obj) {
Py_XDECREF(obj);
}
POP (obj --) {
}
SWAP (a b -- c d) {
PyObject *tmp = b;
c = a;
d = tmp;
}
I don't like this definition of SWAP
much. I tried to make it more elegant (note that the stack top is on the right, so b
is the top of the input stack and d
is the top of the output stack):
SWAP (a b -- c d) {
c = b;
d = a;
}
But that still requires some kind of implicit temporary, else the c = b
assignment would overwrite a
before it is read.
And even if we had a satisfactory SWAP
, we'd still be stuck with the fact that we're loading the old variable on top of the stack just so we can decref it. This seems to require a lot of optimization effort before it's even equivalent to the current hand-written STORE_FAST
definition.
So something is still missing. :-( I'm guessing this is why Brandt didn't want to post his attempts.
But we should look at MLIR and see what they are doing. Maybe we can learn from it. We should also look at the Cinder IRs (they have two, hir
and lir
).
But we should look at MLIR and see what they are doing. Maybe we can learn from it.
I watched a presentation of theirs from 2020 earlier this week, and it appears on the surface that their framework is sort of overkill for what we're doing here. I guess if we wanted to, we would define our own dialect that maps to instructions/uops and define how to optimize and lower that dialect to LLVM IR. But I figure that at that point, if we want LLVM to be our backend, it may just be easier to optimize the uops ourselves and define a direct translation to LLVM IR.
Also, C++... :(
We should also look at the Cinder IRs (they have two,
hir
andlir
).
Yeah, trycinder.com is great for exploring this. I preloaded that link with our little example, so you can see all of the passes they perform on it.
It appears that they do replace the stack with an SSA form almost immediately. Also interesting is that they include a pass to insert refcounts to IR that initially ignores them, rather than removing incref/decref pairs where possible. If we are defining our opcodes as a sequence of uops, the latter approach seems like it makes more sense, since we would just have that information anyways when we start compiling a trace. It would be interesting to learn why they went that route instead.
it appears on the surface that their framework is sort of overkill for what we're doing here.
Yeah, we should probably not try to use MLIR directly. But we should understand it well enough to be either inspired by it to do something similar ourselves, or able to reject their approach confidently. Right now we can do neither because we don't understand their approach well enough. (At least I don't, although I've dug into their tutorial briefly.)
Yeah, trycinder.com is great for exploring this.
Oh, cool. I looked at Cinder's Hir source code a bit and it definitely looks like something we should understand better before we design our own.
Yeah, trycinder.com is great for exploring this.
I was just coming here to say this! Glad you're already aware of it.
It appears that they do replace the stack with an SSA form almost immediately. Also interesting is that they include a pass to insert refcounts to IR that initially ignores them, rather than removing incref/decref pairs where possible. If we are defining our opcodes as a sequence of uops, the latter approach seems like it makes more sense, since we would just have that information anyways when we start compiling a trace. It would be interesting to learn why they went that route instead.
We were faced with the task of translating Python bytecode (which hides refcount operations) into our HIR (where we wanted to expose refcount operations), so our choices were "manually handle refcounting in the translation of each Python opcode, and then later try to optimize some of them out" or "encode the externally visible refcount semantics of each HIR opcode (what references does it borrow/steal/create) and then have a pass to insert the minimal refcount operations." The latter approach seemed less error-prone since it avoids manual coding of refcounting operations, which is easy to get wrong.
I'd love to write/chat more about our refcount insertion pass later, but quickly for now: we have a human-readable design doc for it here. It covers implementation rather than motivation, but what Carl said about making it harder to mess up refcounting operations in HIR pretty much covers it. Having worked on another JIT that went the other way, it's really nice having the freedom to optimize and otherwise manipulate HIR without having to worry about refcount operations.
In general, if you find Jit/ -name "*.md"
there are a few other high-level docs, including one about our type system and one about our deoptimization code (aka OSR/side-exits).
And then what? Once you've reached the final optimized form you have to generate machine code (uops look terrible for an interpreter due to the dispatch overhead). So if we're going any of these routes (this applies to Mark's proposal as well) we probably should first think about our story for generating machine code.
Okay, let's think about it! You know I can't resist a good protyping opportunity. ;)
Using an awesome x86-64 assembler library I found called PeachPy, I've created a tool that can compile sequences of the uops we've been using here. The compiler itself is just 100 lines of Python, and it's not too difficult to follow once you've gotten used to PeachPy's way of doing things:
https://github.com/brandtbucher/cpython/blob/uop-compiler/Tools/uop_compiler/uop_compiler.py
Here are some examples of its output for different forms of the uops for a = b + c
:
If you check out the link, you can see that this machine code actually works! I've set up a little harness that lets you specify a sequence of uops, compile them to assembly, load the assembly, and run it within the context of arbitrary frames.
Obviously there's room for improvement and many unanswered questions, but I think this does a good job of answering the question of "then what?" by completing the story of how different specialized forms of our little a = b + c
example can be lowered from bytecode, through uops, into machine code without too much difficulty.
Yup, I see that clearly now.
So the main question is really whether uops are the right abstraction for optimizations. From what I've learned so far about Cinder's Jit, in a sense they have their own version of this, HIR (high-level IR), but they don't translate HIR directly into machine code. Instead, they have a few stages where they transform the HIR into more optimal HIR, before they eventually generate machine code. (Their LIR is just an abstraction so they can target different hardware architectures easily).
So now the next question is, do uops lend themselves to the same optimization as HIR, or are there things that are easlier to do with a representation like HIR? I don't know, but I worry that uops will mostly be useful to eliminate the instruction dispatch overhead, which IIRC is 20-30%, whereas Cinder claims speedups from 1.5-4x.
IIUC, HIR uses a register machine (with unlimited registers) and they have an elegant way of translating bytecode to HIR -- a form of symbolic interpretation where the evaluation stack is replaced by a stack that tells the builder in which register a given stack value referenced by the bytecode lives. The resulting register-based IR feels more powerful as a basis for optimizing transformations than an IR that is a sequence of uops that manipulate the same stack machine as our bytecode.
What am I missing? -- I'm sure there's not one thing, I'm just trying to imagine what's the best approach for optimization passes. My ambition is to eventually (around Python 3.14?) have a JIT in CPython that makes it unnecessary for Cinder to maintain their own JIT.
The resulting register-based IR feels more powerful as a basis for optimizing transformations than an IR that is a sequence of uops that manipulate the same stack machine as our bytecode.
I disagree. Stack-based instructions are much more easily composed than register-based ones. Forth and Joy are called concatenative programming languages because you can simply concatenate words to create larger ones. This is ideal for a trace-based optimizer as it can rapidly concatenate micro-ops to produce the IR.
I'm just trying to imagine what's the best approach for optimization passes
I'm not saying that this is the best approach, but it works:
An important advantage of using high-level "micro ops", is that the results of passes 1-4 can be executed by an interpreter, meaning that we can test the earlier passes independently and portably.
Depending on the IR, we might want to swap 5. and 6.
For passes 1-4 we want a quite high level IR, possibly a bit higher level than https://github.com/faster-cpython/ideas/issues/454#issuecomment-1231480990
Thinking aloud as always... (There's a question at the end.)
Click to expand my thinking aloud:
I'll defer thinking about the remaining points. But I have a question: what does it mean that these forms can be executed by an interpreter? That would only be possible if said interpreter did dispatch on every micro-operation, which sounds like it would be quite a lot of overhead. Sure, we can write such an interpreter for testing, but it wouldn't be used for actual user code execution, would it? What am I missing here? Also, designing an instruction format for uops would be a challenge -- right now they are glorified macros, some of which need additional info (temp variables especially) that would have to be encoded in the uop instruction format.
Off-line we established that Mark really does intend to interpret the uops. I assume that's why he doesn't want the uops to be quite so granular as in Brandt's prototype, because the instruction decoding overhead would kill any benefits we could get from optimization. It also gives a good reason for sticking with the current stack machine -- deoptimization can then just fall back on the bytecode.
On further thought I think the instruction format isn't a big issue since the uops (in this vision) will never be written to disk -- we can just have an array of structs containing the unpacked args to the uops, so we can write a straightforward decoding loop. Inline caches would just use extra fields in those structs. We'd lose a bit of memory locality but we'd otherwise reduce interpreter overhead which seems important.
It would be an interesting experiment to try and design some of the optimizer passes for the uop code. E.g. how to do guard elimination? What information is needed in the uops to do that effectively?
My ambition is to eventually (around Python 3.14?) have a JIT in CPython that makes it unnecessary for Cinder to maintain their own JIT.
Perhaps this is veering off-topic for this thread, but: I like this goal a lot, and judging by reaccs other folks do too :) The current design directions for the faster-cpython JIT would make this goal quite hard to achieve, though. Until/unless nogil happens, I'm not aware of any way we can efficiently utilize a multi-core server fleet in Python without prefork/multiprocess. And I think, under that execution model, a JIT that has to repeat the same compilation work in every process will have a difficult time competing with one that compiles before fork.
If the faster-cpython JIT were to be a method JIT, then there'd be some possibility of having multiple ways to trigger compilation of a method, either ahead-of-time in the prefork scenario or by call count in other scenarios. If it's a tracing JIT, it's difficult to see how it could be adapted to work efficiently in the prefork scenario at all.
I have to admit I hadn't thought of that until you mentioned it. I guess we will need to plan an AOT interface too to cover that use case. Something that somehow forces an entire function to be compiled when it is first imported (or however Cinder does it, maybe just by calling a "compile this" API with a function or code object).
(PS. I refuse to say "JIT" in this context because of the original meaning of that acronym. :-)
I guess we will need to plan an AOT interface too to cover that use case. Something that somehow forces an entire function to be compiled when it is first imported (or however Cinder does it, maybe just by calling a "compile this" API with a function or code object).
It would be great if this were possible! "How Cinder does it" is with an env var that points to a file that lists function qualnames one per line; there's also a "compile this function now" function available. But I don't think we'd be at all picky about exactly how it works, as long as it is possible. If there's a "compile this function" function, anything else we'd need can be built on that.
(PS. I refuse to say "JIT" in this context because of the original meaning of that acronym. :-)
Fair! I suppose the original meaning (taken literally) could still apply if we observe that in our context "just in time" means "right before we fork" -- anything else would be "too late compilation" :P
I had another idea here. I explained it to Brandt and he said "I like that a lot, go ahead and work on it, just assign the issue to yourself."
The idea is that instead of trying to put the input for the generator in a separate file and add dummy stuff to the top and bottom of that file to make it look like C, we keep the input to the generator in ceval.c, like this:
#ifdef SOME_SYMBOL
switch (opcode) {
case NOP:
// stack-effect: 0
break;
...
}
#else
#include "generated.h"
#endif
The code generator can write the uglified, optimized code to generated.h, e.g. replacing case NOP:
with TARGET(NOP) {
and break;
with DISPATCH();
, adding a '}' where needed.
The code generator could even change the code it generates based on the compiler or platform.
Other things the code generator could do:
Anyways, we've come full circle -- this was one of the earliest ideas (#5), eventually we decided to skip it for 3.11, now it's back on the 3.12 workflow as "Generate version of the interpreter better suited to emscripten and web assembly".
This could maybe be closed since we now have a working code generator -- though the discussion about the level of uops is still useful.
FWIW, I looked at mark_stacks()
and experimented a bit, and I don't think we should strive to generate this function automatically. The special cases are too special to be derived even from extended metadata.
The only think we might be able to do is, in the default
case, to take the separate "popped" and "pushed" counts and ensure that we can actually pop that many values off the stack, rather than just accounting for the delta. E.g. BINARY_OP
has a stack effect of -1
but it really pops 2 values and then pushes another, so we could:
assert()
that the stack depth is at least 2[Lasti, Exception]
, after this instruction we should see [Object]
, not [Lasti]
But these are completely theoretical, I don't think our bytecode compiler would ever generate such code.
Also, there's an additional caveat: certain operations guarantee that certain stack elements are not changed even though they appear popped and pushed (e.g. LIST_EXTEND
does this).
I removed the 3.12 label for this, because I believe we no longer intend to do this for 3.12.
We now have micro-ops.
We've talked about this concept on-and-off for some time now, so I thought I would create this issue to collect some concrete ideas and start trying to reach a consensus on what the best path to take here is. This is a big, high-risk project, so we should probably start discussing it sooner rather than later.
Currently, our instructions are pretty much defined as "whatever the corresponding blob of C code in
ceval.c
does". While there are several places (mark_stacks
,stack_effect
,dis
and its docs, the varioushaswhatever
lists inopcode.py
) that describe aspects of their behavior, they are scattered all over the place and must be kept in-sync (by hand) with the actual implementation. In other words, they are a burden just as much as they are a useful resource.It's also not always very clear (for example):
NULL
to the stack.NULL
on the stack.And potentially, in the future:
...and, of course, much more.
One way to improve this situation is by ditching the free-form C code and defining our opcodes in terms of structured, statically-analyzable micro-ops (or "uops"). These uops represent lowered, composable units of work, such as incrementing the instruction pointer, decrementing a value's reference count, writing values to the stack, etc. Ideally, they'd represent a sort of IR that bridges the gap between Python and native code.
Last week I put together a simple proof-of-concept that converts much of the common logic in our instructions to a couple dozen initial uops. There is much more to be done, of course, but it does a pretty good job of communicating what this could look like, and many opcodes have been mostly or entirely converted. Note that GitHub collapses the huge
ceval.c
diff by default, but most of the interesting stuff is there.Other random thoughts to spark discussion:
mark_stacks
andstack_effect
, or eval loops with debugging features added or removed.