Stitching it all together

markshannon commented 1 year ago

We are currently building the region selector, optimizer and components to execute the optimized regions, but we lack an architecture link the regions together and give us good performance.

The general idea is that once we are executing in tier 2, we should try and stay in tier 2.

So exits should be able to adapt as they becomes hot.

We solve this problem with the Fundamental theorem of software engineering.

We want something that looks like an executor, but that takes a pointer to the pointer to the executor.

Something like:

typedef struct _PyAdaptiveExecutorObject {
    PyObject_HEAD
    struct _PyInterpreterFrame *(*execute)(struct _PyAdaptiveExecutorObject **ptr, PyInterpreterFrame *frame, PyObject **stack_pointer);
    /* Data needed by the executor goes here */
} _PyAdaptiveExecutorObject;

Note the additional indirection. This moves the memory load from the caller to the callee, making it a bit slower, but allows the callee executor to modify which executor the caller calls next time.

On exiting from tier 2 execution, instead of returning to the tier 1 interpreter, we should jump to an adaptive stub which checks for hotness and dispatches accordingly.

For side exits with additional micro-ops to exit, the code for such a side-exit stub would look like:

_PyInterpreterFrame *
cold_execute(_PyAdaptiveExecutorObject **ptr, _PyInterpreterFrame *frame, PyObject **stack_pointer)
{
    _PyAdaptiveExecutorObject *self = *ptr;
    if (is_hot(self->counter)) {
        _PyAdaptiveExecutorObject *executor = compile(self->uops);
        *ptr = executor;
        return executor->execute(ptr, frame, stack_pointer);
    }
    else {
        return uop_interpret(self->uops, frame, stack_pointer); 
    }
}

For an end exit (or a side exit with no attached uops) the code looks like this:

_PyInterpreterFrame *
end_execute(_PyAdaptiveExecutorObject **ptr, _PyInterpreterFrame *frame, PyObject **stack_pointer)
{
    _PyAdaptiveExecutorObject *self = *ptr;
    if (!is_hot(self->counter)) {
         return frame;  
    }
    if (frame->next_instr->op.code == ENTER_EXECUTOR) {
        _PyAdaptiveExecutorObject *executor = frame->f_code->co_executors[frame->next_instr->op.arg];
    }
    else {
        Excecutor *executor = NULL;
        _PyOptimizerObject opt = _PyInterpreterState_Get()->optimizer;
        int err = opt->optimize(opt, frame->f_code, frame->next_instr, &executor, stack_pointer);
        if (err < 0) {
            return NULL;
        }
    }
    *ptr = executor;
    return executor->execute(ptr, frame, stack_pointer);    
}

gvanrossum commented 1 year ago

(I've filled in some missing detail and corrected some typos.)

This makes sense. The API change (to pass a pointer to a pointer) is simple. The rest can be elaborated over time.

From https://github.com/python/cpython/issues/108866 I understand that you want _PyAdaptiveExecutorObject to replace the current _PyExecutorObject type, right? It wouldn't make much sense to have both types.

Where do you imagine the calls to cold_execute() and end_execute() to originate? Are those new uops that replace EXIT_TRACE? Or is the idea that these are only used by the Tier 2 machine code, perhaps special-casing EXIT_TRACE in the copy-and-patch translator to machine code? (I realize there's some missing nomenclature here, we need to distinguish between the Tier 2 interpreter and the machine code executor.)

markshannon commented 1 year ago

I understand that you want _PyAdaptiveExecutorObject to replace the current _PyExecutorObject type, right? It wouldn't make much sense to have both types.

Yes, _PyAdaptiveExecutorObject is just a rename of _PyExecutorObject, with a few details changed.

Where do you imagine the calls to cold_execute() and end_execute() to originate?

cold_execute and end_execute are the the execute functions for the _PyAdaptiveExecutorObjects that are attached to exits. cold_execute for side exits, and end_execute for the final exit.

EXIT_TRACE should jump to the next executor instead of returning. We will need to always compile the tier 2 interpreter in a way that enforces tailcalls though, which might be tricky.

gvanrossum commented 1 year ago

I am not ready yet to consider tail calls. Not until we've officially given up on being able to support other C compilers than clang and GCC.

cold_execute and end_execute are the the execute functions for the _PyAdaptiveExecutorObjects that are attached to exits.

I'm not sure how a function is "attached" to an exit. I guess this requires translation to machine code (i.e., copy-and-patch)? In the Tier 2 interpreter, the best I can imagine doing is for the exit stubs to use different special uops instead of the current EXIT_TRACE uop that's used there, where those new uops call cold_execute and end_execute, respectively.

I also have some conceptual problem understanding how to code with tail calls. But that's just me, I will be working on it.

markshannon commented 1 year ago

I'm not sure how a function is "attached" to an exit.

The executor is attached to the exit. Executors are Python objects. The function pointer is in the executor.

markshannon commented 1 year ago

In light of want we've learned so far, this is how I think the all the parts of the tier 2 optimizer should work.

Optimizer interface

The VM will call the optimizer when a back edge (or a RESUME, once implemented) counter reaches the given threshold. It is the job of the optimizer to handle cold code and code in cannot optimize, not just optimize hot code.

Region selection

The optimizer should build a linear sequence or tree (with possible jumps back to the top) in a pre-allocated buffer in the interpreter object. In order to perform good selection, we will need to record branch history. Although the region selection pass should not be optimizing, it does need to virtualize the instruction pointer. The stack pointer and frame pointer do not need special treatment in this pass.

Micro ops

The micro-ops in the selected region should contain as much information as necessary for optimization and efficient execution. Ideally it should contain all information that the region selector had that is not readily available later. For example, all instructions implicitly starts with SAVE_CURRENT_IP. The region selector should convert this to SAVE_IP with an operand of the instruction offset. Likewise, EXITs should contain the offset of the next instruction.

Optimization

Using partial evaluation, peephole optimizers or some other magic, the uops in the buffer should be transformed into something that will run faster. Since the optimizer is not required for correctness, this can be worked on independently.

Making executor(s)

An executor, or graph of executors, should be created from the buffer. If we create a graph of executors, we can compile some, leave others to be interpreted, and have some that act as bridges.

gvanrossum commented 1 year ago

I am struggling to extract action items from @markshannon's last comment. I think I have the following:

Prioritize measuring branch bias using a counter on branch instructions (see discussion, issue and PR)
Continue to split instructions into uops (next low-hanging fruit is probably LOAD_ATTR, next challenge is SEND/YIELD)
If we want a larger temporary buffer into which to collect a trace, instead of C stack allocation (which risks C stack overflow) we could use a fixed-size buffer attached to the thread or interpreter upon creation
The business with SAVE_IP and SAVE_CURRENT_IP is confusing for all of us (resolved, see comment below and CPython issue)
The way we use a counter to decide when to try creating an executor is sub-optimal, but it's not a high priority to improve it; anyway, it's a local change to how that counter is used
We'd like to also try creating an executor on RESUME; there are some issues before that can work

markshannon commented 1 year ago

The business with SAVE_IP and SAVE_CURRENT_IP is confusing for all of us

First of all let's rename SAVE_IP to SET_IP; it doesn't save anything and makes the distinction between the two clearer.

Here are the definitions:

SET_IP Sets frame->instr_ptr to the address of the zeroth code unit of the current code object plus the operand.
SAVE_CURRENT_IP Sets frame->instr_ptr to the address of the currently executing instruction

Note that neither of the above definitions depend on internal interpreter state, so are independent of the tier or whether compiling or interpreting.

gvanrossum commented 1 year ago

Okay, those definitions are helpful. To summarize, SET_IP sets it from the oparg, SAVE_CURRENT_IP from current instruction.

Note that neither of the above definitions depend on internal interpreter state, so are independent of the tier or whether compiling or interpreting.

Hm, "the currently executing instruction" feels problematic. In Tier 1 this would (currently) be next_instr - 1, but in Tier 2 this would be the last value set using SET_IP, assuming it's not been elided by some optimization.

Also, the definition doesn't tell us when SAVE_CURRENT_IP should be used. (In practice, in the current code, IIRC it is used to sync from next_instr to frame->prev_instr in a case where the latter may not be up to date.)

markshannon commented 1 year ago

By "currently executing instruction", I mean "currently executing instruction as if executing in tier 1".

Ideally, SAVE_CURRENT_IP should be explicit in the instruction definition, although it is currently the implicit first uop of each instruction.

gvanrossum commented 1 year ago

Okay, so then SAVE_CURRENT_IP is what the execution engine must do at the start of each instruction. And SET_IP is a uop that implements it in Tier 2.

In Tier 1, the IP is implicitly saved at the start of each instruction, via TARGET(op) which calls INSTRUCTION_START(op) which does frame->prev_instr = next_instr++ (among other things).

If it was explicitly part of the instruction, we could drop it for instructions that don't call ERROR_IF() or DEOPT_IF() (like LOAD_FAST) and we'd save some microscopic fraction of dispatch overhead, and we wouldn't have to worry about removing redundant SET_IP uops in Tier 2.

Now, there's an explicit SAVE_CURRENT_IP in some macros (before every _POP_FRAME and _PUSH_FRAME). That's because it also acts as a flag for the code generator, which, combined with the preceding SAVE_IP, makes it effectively save a pointer to the next instruction (which is what's needed by the frame push/pop uops). But there's so much special-casing here that we might as well either introduce a new uop for the combined special effects or special-case _POP_FRAME and _PUSH_FRAME.

Let me just copy all this into a new issue.

faster-cpython / ideas