Open markshannon opened 2 years ago
Another use case for this is avoiding jumping in and out of the bytecode interpreter, e.g. tuple(f(v) for v in seq)
The call to tuple
goes into C code, and then repeated calls into PyEval_EvalDefault()
via several layers of gen_next()
.
So, for generators and comprehensions we might want to replace tuple
with the something like:
def tuple(seq):
return LIST_TO_TUPLE( # Not an actual call, but the instruction
[ val for val in seq ] # No need for the extra scope, this can be inlined.
)
One problem with schemes like this is that they prevent specialization, as all calls to tuple
end up in the same code.
So we might want several specialized versions depending on the kind of seq
.
the increasing size of the interpreter makes for more icache misses
Not to side-track from your focus in this issue, but how much of this could be mitigated with special-case eval loops (e.g. via our code generation tooling)?
@markshannon: What's the status of this? Is this still targetted for 3.12?
Probably not, but we might get some of this done for 3.12
https://github.com/python/cpython/pull/107788 changes output of LOAD_ATTR
with oparg &1
from NULL attr
to attr NULL
If we want to replace LOAD_ATTR
oparg &1
with a call, we will need to push an additional shim frame to push the NULL
after the call instead of pushing the NULL
before the call. It will be a bit more expensive, but we already do something similar in CALL_NO_KW_ALLOC_AND_ENTER_INIT
@markshannon Speaking of CALL_NO_KW_ALLOC_AND_ENTER_INIT
, how would we translate that into Tier 2 uops? There is no Tier 2 equivalent to goto start_frame
(the executor can only choose between goto resume_frame
and goto resume_with_error
). I'm asking because splitting CALL_PY_EXACT_ARGS
also needs to figure out how to goto start_frame
-- currently it ends up just skipping the call to _Py_EnterRecursivePy
, which is obviously wrong; without copying all of _Py_CheckRecursiveCallPy
from ceval.c, I fear I will just have to do tstate->py_recursion_remaining--
and pray.
(And why does the start_frame
block of code use goto exit_unwind
instead of goto error
on error?)
Also, perhaps more pertinent to this issue, when the superblock projector gains the ability to project through a call (soon: gh-107793), will it need to trace through trampolines too? How will it find the bytecode for the trampoline? (I guess the trampoline needs to be an actual function object, not just a code object -- the design of my draft PR caches function objects indexed by func_version
.)
Reviewing https://github.com/python/cpython/pull/110317 got me thinking that we could handle this by defining the long tail specializations in terms of micro-ops, much as we do for common ones like LOAD_ATTR_INSTANCE_VALUE
, but generating a single tier1 instruction that does additional dispatch on one of its inline cache entries.
Let's use LOAD_ATTR_WITH_HINT
as an example.
Currently we define it as:
macro(LOAD_ATTR_WITH_HINT) =
unused/1 +
_GUARD_TYPE_VERSION +
_CHECK_ATTR_WITH_HINT +
_LOAD_ATTR_WITH_HINT +
unused/5;
so there is plenty of unused space for an addition operand.
For tier1 we could generate LOAD_ATTR_EXTRA
which would dispatch to the long-tail specializations, something like:
TARGET(LOAD_ATTR_EXTRA) {
switch(inner_opcode) {
case LOAD_ATTR_WITH_HINT:
/* Generated code for LOAD_ATTR_WITH_HINT goes here */
break;
...
For tier2 we simply emit the micro-ops as we do now.
We would need a special name to indicate that an inline cache entry was to be used as additional dispatching. E.g.
inner_opcode
macro(LOAD_ATTR_WITH_HINT) =
unused/1 +
inner_opcode/1 +
_GUARD_TYPE_VERSION +
_CHECK_ATTR_WITH_HINT +
_LOAD_ATTR_WITH_HINT +
unused/4;
Maybe with an additional qualifier as well.
uncommon macro(LOAD_ATTR_WITH_HINT) =
unused/1 +
inner_opcode/1 +
_GUARD_TYPE_VERSION +
_CHECK_ATTR_WITH_HINT +
_LOAD_ATTR_WITH_HINT +
unused/4;
We are adding increasingly uncommon specializations of the
LOAD_ATTR
andCALL
instructions. Each of these specializations increases our hit-rate, improving the potential for tier-2 optimizations, but often has a negligible or even negative impact on tier-1 (PEP 659) performance as the increasing size of the interpreter makes for more icache misses.Instead of adding more and more specialization we can add trampoline functions (in bytecode, but not full Python functions) to perform some of the work that would be done in a specialized instructions.
These trampolines are unlikely to boost tier-1 performance, in fact they may make it a bit worse, but they allow much more open ended specializations, and expose the details of the specializations to higher tier optimizers in a way that they can understand without needing custom code for each specialization.
This idea isn't limited to
CALL
andLOAD_ATTR
, but those have the most complex and varied semantics.Examples:
Calling a Python function with complex arguments or parameters. When specializing a call, we know both the shape of the arguments and the parameters, allowing us to compute (or lookup for the most common cases) a sequence of operations to move the arguments to the right place in the locals. Having individual specializations for each of these cases would be silly. We could have a custom format describing the transformation packed into the cache, but to do that would need a custom mini-interpreter. Better to create a trampoline function that does the argument shuffle and then tailcalls into the function.
Calling a Python class. https://github.com/python/cpython/pull/93221 specializes calling Python classes using a custom specialized instruction and a stub function to clean up after the call. This could be replaced with a more general trampoline specialized instruction. Doing so would allow trampolines to handle differing shapes of arguments, without a proliferation of instructions.