Closed markshannon closed 4 months ago
Since there cannot be more exits than half the maximum number of
uops
,
Uh, why?
the size of the array will be
_Py_UOP_MAX_TRACE_LENGTH/2
, currently 256.
Why not dynamically allocated based on the size of the trace? (Which we know before we create the executor object (in make_executor_from_uops()
in optimizer.c).
I take it that the targets are basically moved from the _PyOptimizerObject
struct into the _PyExitData
struct; exit
and hotness
correspond to the executors
and counters
arrays in my defunct side-exits PR (gh-112902).
Cold exits will be implemented as a single micro-op
_COLD_EXIT
withoparg
set to the index in the array.
I'm missing something here. My understanding was that exits are (largely) reached by calls to DEOPT_IF()
(or EXIT_IF()
if you wish) included in the C code for the bytecode definition in bytecodes.c. How does a new uop help here? (Fine if that's explained in another design issue, just point me to it.)
Since there cannot be more exits than half the maximum number of uops,
Uh, why?
Because at least half will be _SET_IP/_CHECK_VALID
.
Why not dynamically allocated based on the size of the trace? (Which we know before we create the executor object (in make_executor_from_uops() in optimizer.c).
There will 256 globally (per-process) as they are const
, so can be shared.
I'm missing something here. My understanding was that exits are (largely) reached by calls to DEOPT_IF() (or EXIT_IF() if you wish) included in the C code for the bytecode definition in bytecodes.c. How does a new uop help here? (Fine if that's explained in another design issue, just point me to it.)
Each EXIT_IF
will have an executor/exit attached (in the exits
array), which is jumped to unconditionally. It is then the job of that exit to handle counts and further optimization.
The _COLD_EXIT
micro-op will perform that task.
The _COLD_EXIT
code will look something like this:
(This untested, not even compiled)
op(_COLD_EXIT, (--)) {
TIER_TWO_ONLY
_PyExitData *exit = current_executor->exits[oparg];
Py_DECREF(current_executor);
exit->hotness++;
UNLIKELY_EXIT_IF(exit->hotness < 0);
PyCodeObject *code = _PyFrame_GetCode(frame);
_Py_CODEUNIT *target = _PyCode_CODE(code) + exit->target;
_PyOptimizerObject *opt = interp->optimizer;
_PyExecutorObject *executor = NULL;
int optimized = opt->optimize(opt, code, target, &executor, (int)(stack_pointer - _PyFrame_Stackbase(frame)));
ERROR_IF(optimized < 0, error);
if (optimized) {
exit->executor = executor;
Py_INCREF(executor);
GOTO_TIER_TWO();
}
else {
exit->hotness = -10000; /* Choose a better number */
}
}
And the code for EXIT_IF
will be something like this:
_PyExitData *exit = current_executor->exits[index];
Py_INCREF(exit->exit);
Py_DECREF(current_executor);
current_executor = exit->executor;
GOTO_TIER_TWO();
index
is computed when creating the executor from the uop IR.
Oh, I see what I didn't get.
The (dynamically allocated) array of exit structs is part of the executor, and each possibly-deoptimizing uop in the trace
array (or each group of uops belonging to the same Tier 1 instruction?) has a corresponding exit struct, with its own target and hotness counter.
However, the exit
field in those exit structs initially all point to one of 256 statically allocated executors (i.e., to the one whose index in that static array equals to the index of the exit struct in the array of such structs in the executor). And each of those static executors contains a single _COLD_EXIT
uop whose oparg is that same index.
When the hotness counter warms up, eventually the exit
field in the exit struct is made to point to a new, dynamically created executor representing the code where the hot side exit continues. (There's the issue of how ownership of that executor is determined, but that's solvable.)
I'm not sure I follow the reasoning about how many cold exit executors we need (initially 1/3rd of the uops are _SET_IP
and 1/3rd are _CHECK_VALID
, so maybe only MAX/3 instead of MAX/2?) but I can see the attraction in making this array shorter than the trace
array. Possibly the allocation of entries can be different when using the Tier 2 interpreter than for the JIT, since the index can be baked into the JIT code. Anyways, this is a minor detail; we just must be able to calculate the index of the exit struct corresponding to the current micro instruction efficiently in the Tier 2 interpreter. (Echoes of an early, later abandoned, design for the 3.11 Tier 1 "inline" cache. :-)
PS 1: I take it that the target
field is moved from the UOpInstruction struct to the exit struct?
PS 2: Maybe the field referencing an executor in the exit struct should be called executor
instead of exit
-- the reuse of "exit" added to my confusion.
(There's the issue of how ownership of that executor is determined, but that's solvable.)
No one need own the executor, we have reference counting. Each exit->executor
field would be a strong reference to the executor it points to. Since the cold exits are immortal, we don't need to worry about reference counts when initializing the field.
I'm not sure I follow the reasoning about how many cold exit executors...
1/2 is an approximation.
Currently we only need a 1/3, but some instructions expand to several micro-ops, so we could end up needed more than half.
Maybe we should just make the array MAX
for safety. As you say, it isn't important for the design.
PS 1: I take it that the target field is moved from the UOpInstruction struct to the exit struct?
Yes.
Each
exit->executor
field would be a strong reference to the executor it points to
Makes sense (as does the rest). The destructor for executors needs to know how many exit structs an executor has, and decref the executors in those exits. The static ones can be immortal.
I would still plead for renaming the exit
field.
I would still plead for renaming the exit field.
Already done 🙂
@markshannon Presumably before we can work on this, we need to implement having separate structs for uops during the trace collection/optimization chain and uops in the executor, right? Because in the executor the 'target' field lives in the exit blocks, but during optimization, 'target' is part of the instruction struct.
Done 🎉
To support trace stitching we need data for exits embedded into trace objects.
First of all we need an exit struct: https://github.com/faster-cpython/ideas/blob/main/3.13/engine.md#making-cold-exits-small lists the fields we need.
Since all executors are now micro-ops based, we can get rid of the distinction between
_PyExecutorObject
and_PyUOpExecutorObject
. The new combined executor struct would look something like:We will pre-allocate an array of executors for cold exits, one for each possible index into the
exits
array. Since there cannot be more exits than the maximum number ofuops
, the size of the array will be_Py_UOP_MAX_TRACE_LENGTH
, currently 512.Cold exits will be implemented as a single micro-op
_COLD_EXIT
withoparg
set to the index in the array._COLD_EXIT
needs to do the following:exit = current_executor->exit[oparg]
exit->hotness
exit->hotness == 0
then trigger generator on a new executor to replaceexit->executor
.next_instr = PyCode_CODE(code) + exit->target; ENTER_TIER1();
The JIT will need to compile the code for these cold exits on start up, but we can static allocate the data structures.