faster-cpython / ideas

1.67k stars 49 forks source link

Data structures and code for exits from executors. #644

Closed markshannon closed 4 months ago

markshannon commented 5 months ago

To support trace stitching we need data for exits embedded into trace objects.

First of all we need an exit struct: https://github.com/faster-cpython/ideas/blob/main/3.13/engine.md#making-cold-exits-small lists the fields we need.

typedef struct _exit_data {
    _PyExecutorObject *executor; // The executor to jump to
    uint32_t target; // offset in the code object of the target code unit.
    int16_t hotness; // How hot this exit is. Start at a negative number and count up?
} _PyExitData;

Since all executors are now micro-ops based, we can get rid of the distinction between _PyExecutorObject and _PyUOpExecutorObject. The new combined executor struct would look something like:

typedef struct _PyExecutorObject {
    PyObject_VAR_HEAD
    _PyVMData vm_data;
    union {
        void *jit_code; // Will be a function pointer
        _PyUOpInstruction *trace;
    };
    _PyExitData exits[1];
} _PyExecutorObject;

We will pre-allocate an array of executors for cold exits, one for each possible index into the exits array. Since there cannot be more exits than the maximum number of uops, the size of the array will be _Py_UOP_MAX_TRACE_LENGTH, currently 512.

Cold exits will be implemented as a single micro-op _COLD_EXIT with oparg set to the index in the array. _COLD_EXIT needs to do the following:

The JIT will need to compile the code for these cold exits on start up, but we can static allocate the data structures.

gvanrossum commented 5 months ago

Since there cannot be more exits than half the maximum number of uops,

Uh, why?

the size of the array will be _Py_UOP_MAX_TRACE_LENGTH/2, currently 256.

Why not dynamically allocated based on the size of the trace? (Which we know before we create the executor object (in make_executor_from_uops() in optimizer.c).

I take it that the targets are basically moved from the _PyOptimizerObject struct into the _PyExitData struct; exit and hotness correspond to the executors and counters arrays in my defunct side-exits PR (gh-112902).

Cold exits will be implemented as a single micro-op _COLD_EXIT with oparg set to the index in the array.

I'm missing something here. My understanding was that exits are (largely) reached by calls to DEOPT_IF() (or EXIT_IF() if you wish) included in the C code for the bytecode definition in bytecodes.c. How does a new uop help here? (Fine if that's explained in another design issue, just point me to it.)

markshannon commented 5 months ago

Since there cannot be more exits than half the maximum number of uops,

Uh, why?

Because at least half will be _SET_IP/_CHECK_VALID.

Why not dynamically allocated based on the size of the trace? (Which we know before we create the executor object (in make_executor_from_uops() in optimizer.c).

There will 256 globally (per-process) as they are const, so can be shared.

I'm missing something here. My understanding was that exits are (largely) reached by calls to DEOPT_IF() (or EXIT_IF() if you wish) included in the C code for the bytecode definition in bytecodes.c. How does a new uop help here? (Fine if that's explained in another design issue, just point me to it.)

Each EXIT_IF will have an executor/exit attached (in the exits array), which is jumped to unconditionally. It is then the job of that exit to handle counts and further optimization. The _COLD_EXIT micro-op will perform that task.

markshannon commented 5 months ago

The _COLD_EXIT code will look something like this: (This untested, not even compiled)

        op(_COLD_EXIT, (--)) {
            TIER_TWO_ONLY
            _PyExitData *exit = current_executor->exits[oparg];
            Py_DECREF(current_executor);
            exit->hotness++;
            UNLIKELY_EXIT_IF(exit->hotness < 0);  
            PyCodeObject *code = _PyFrame_GetCode(frame);
            _Py_CODEUNIT *target = _PyCode_CODE(code) + exit->target;
            _PyOptimizerObject *opt = interp->optimizer;
            _PyExecutorObject *executor = NULL;
            int optimized = opt->optimize(opt, code, target, &executor, (int)(stack_pointer - _PyFrame_Stackbase(frame)));
            ERROR_IF(optimized < 0, error);
            if (optimized) {
                exit->executor = executor;
                Py_INCREF(executor);
                GOTO_TIER_TWO();
            }
            else {
                exit->hotness = -10000; /* Choose a better number */
            }
        }
markshannon commented 5 months ago

And the code for EXIT_IF will be something like this:

    _PyExitData *exit = current_executor->exits[index];
    Py_INCREF(exit->exit);
    Py_DECREF(current_executor);
    current_executor = exit->executor;
    GOTO_TIER_TWO();

index is computed when creating the executor from the uop IR.

gvanrossum commented 5 months ago

Oh, I see what I didn't get.

The (dynamically allocated) array of exit structs is part of the executor, and each possibly-deoptimizing uop in the trace array (or each group of uops belonging to the same Tier 1 instruction?) has a corresponding exit struct, with its own target and hotness counter.

However, the exit field in those exit structs initially all point to one of 256 statically allocated executors (i.e., to the one whose index in that static array equals to the index of the exit struct in the array of such structs in the executor). And each of those static executors contains a single _COLD_EXIT uop whose oparg is that same index.

When the hotness counter warms up, eventually the exit field in the exit struct is made to point to a new, dynamically created executor representing the code where the hot side exit continues. (There's the issue of how ownership of that executor is determined, but that's solvable.)

I'm not sure I follow the reasoning about how many cold exit executors we need (initially 1/3rd of the uops are _SET_IP and 1/3rd are _CHECK_VALID, so maybe only MAX/3 instead of MAX/2?) but I can see the attraction in making this array shorter than the trace array. Possibly the allocation of entries can be different when using the Tier 2 interpreter than for the JIT, since the index can be baked into the JIT code. Anyways, this is a minor detail; we just must be able to calculate the index of the exit struct corresponding to the current micro instruction efficiently in the Tier 2 interpreter. (Echoes of an early, later abandoned, design for the 3.11 Tier 1 "inline" cache. :-)

PS 1: I take it that the target field is moved from the UOpInstruction struct to the exit struct?

PS 2: Maybe the field referencing an executor in the exit struct should be called executor instead of exit -- the reuse of "exit" added to my confusion.

markshannon commented 5 months ago

(There's the issue of how ownership of that executor is determined, but that's solvable.)

No one need own the executor, we have reference counting. Each exit->executor field would be a strong reference to the executor it points to. Since the cold exits are immortal, we don't need to worry about reference counts when initializing the field.

I'm not sure I follow the reasoning about how many cold exit executors...

1/2 is an approximation. Currently we only need a 1/3, but some instructions expand to several micro-ops, so we could end up needed more than half. Maybe we should just make the array MAX for safety. As you say, it isn't important for the design.

PS 1: I take it that the target field is moved from the UOpInstruction struct to the exit struct?

Yes.

gvanrossum commented 5 months ago

Each exit->executor field would be a strong reference to the executor it points to

Makes sense (as does the rest). The destructor for executors needs to know how many exit structs an executor has, and decref the executors in those exits. The static ones can be immortal.

gvanrossum commented 5 months ago

I would still plead for renaming the exit field.

markshannon commented 5 months ago

I would still plead for renaming the exit field.

Already done 🙂

gvanrossum commented 5 months ago

@markshannon Presumably before we can work on this, we need to implement having separate structs for uops during the trace collection/optimization chain and uops in the executor, right? Because in the executor the 'target' field lives in the exit blocks, but during optimization, 'target' is part of the instruction struct.

markshannon commented 4 months ago

Done 🎉