Handle long tail of specialization for `LOAD_ATTR` and `CALL` with trampolines.

markshannon commented 2 years ago

We are adding increasingly uncommon specializations of the LOAD_ATTR and CALL instructions. Each of these specializations increases our hit-rate, improving the potential for tier-2 optimizations, but often has a negligible or even negative impact on tier-1 (PEP 659) performance as the increasing size of the interpreter makes for more icache misses.

Instead of adding more and more specialization we can add trampoline functions (in bytecode, but not full Python functions) to perform some of the work that would be done in a specialized instructions.

These trampolines are unlikely to boost tier-1 performance, in fact they may make it a bit worse, but they allow much more open ended specializations, and expose the details of the specializations to higher tier optimizers in a way that they can understand without needing custom code for each specialization.

This idea isn't limited to CALL and LOAD_ATTR, but those have the most complex and varied semantics.

Examples:

Calling a Python function with complex arguments or parameters. When specializing a call, we know both the shape of the arguments and the parameters, allowing us to compute (or lookup for the most common cases) a sequence of operations to move the arguments to the right place in the locals. Having individual specializations for each of these cases would be silly. We could have a custom format describing the transformation packed into the cache, but to do that would need a custom mini-interpreter. Better to create a trampoline function that does the argument shuffle and then tailcalls into the function.
Calling a Python class. https://github.com/python/cpython/pull/93221 specializes calling Python classes using a custom specialized instruction and a stub function to clean up after the call. This could be replaced with a more general trampoline specialized instruction. Doing so would allow trampolines to handle differing shapes of arguments, without a proliferation of instructions.

markshannon commented 1 year ago

Another use case for this is avoiding jumping in and out of the bytecode interpreter, e.g. tuple(f(v) for v in seq) The call to tuple goes into C code, and then repeated calls into PyEval_EvalDefault() via several layers of gen_next(). So, for generators and comprehensions we might want to replace tuple with the something like:

def tuple(seq):
    return LIST_TO_TUPLE(   # Not an actual call, but the instruction
        [ val for val in seq ] # No need for the extra scope, this can be inlined.
    )

One problem with schemes like this is that they prevent specialization, as all calls to tuple end up in the same code. So we might want several specialized versions depending on the kind of seq.

ericsnowcurrently commented 1 year ago

the increasing size of the interpreter makes for more icache misses

Not to side-track from your focus in this issue, but how much of this could be mitigated with special-case eval loops (e.g. via our code generation tooling)?

mdboom commented 1 year ago

@markshannon: What's the status of this? Is this still targetted for 3.12?

markshannon commented 1 year ago

Probably not, but we might get some of this done for 3.12

markshannon commented 1 year ago

https://github.com/python/cpython/pull/107788 changes output of LOAD_ATTR with oparg &1 from NULL attr to attr NULL

If we want to replace LOAD_ATTR oparg &1 with a call, we will need to push an additional shim frame to push the NULL after the call instead of pushing the NULL before the call. It will be a bit more expensive, but we already do something similar in CALL_NO_KW_ALLOC_AND_ENTER_INIT

gvanrossum commented 1 year ago

@markshannon Speaking of CALL_NO_KW_ALLOC_AND_ENTER_INIT, how would we translate that into Tier 2 uops? There is no Tier 2 equivalent to goto start_frame (the executor can only choose between goto resume_frame and goto resume_with_error). I'm asking because splitting CALL_PY_EXACT_ARGS also needs to figure out how to goto start_frame -- currently it ends up just skipping the call to _Py_EnterRecursivePy, which is obviously wrong; without copying all of _Py_CheckRecursiveCallPy from ceval.c, I fear I will just have to do tstate->py_recursion_remaining-- and pray.

gvanrossum commented 1 year ago

(And why does the start_frame block of code use goto exit_unwind instead of goto error on error?)

gvanrossum commented 1 year ago

Also, perhaps more pertinent to this issue, when the superblock projector gains the ability to project through a call (soon: gh-107793), will it need to trace through trampolines too? How will it find the bytecode for the trampoline? (I guess the trampoline needs to be an actual function object, not just a code object -- the design of my draft PR caches function objects indexed by func_version.)

markshannon commented 1 year ago

Reviewing https://github.com/python/cpython/pull/110317 got me thinking that we could handle this by defining the long tail specializations in terms of micro-ops, much as we do for common ones like LOAD_ATTR_INSTANCE_VALUE, but generating a single tier1 instruction that does additional dispatch on one of its inline cache entries.

Let's use LOAD_ATTR_WITH_HINT as an example. Currently we define it as:

macro(LOAD_ATTR_WITH_HINT) =
            unused/1 +
            _GUARD_TYPE_VERSION +
            _CHECK_ATTR_WITH_HINT +
            _LOAD_ATTR_WITH_HINT +
            unused/5;

so there is plenty of unused space for an addition operand. For tier1 we could generate LOAD_ATTR_EXTRA which would dispatch to the long-tail specializations, something like:

TARGET(LOAD_ATTR_EXTRA) {
     switch(inner_opcode) {
         case LOAD_ATTR_WITH_HINT:
              /* Generated code for LOAD_ATTR_WITH_HINT goes here */
              break;
        ...

For tier2 we simply emit the micro-ops as we do now.

We would need a special name to indicate that an inline cache entry was to be used as additional dispatching. E.g. inner_opcode

macro(LOAD_ATTR_WITH_HINT) =
            unused/1 +
            inner_opcode/1 +
            _GUARD_TYPE_VERSION +
            _CHECK_ATTR_WITH_HINT +
            _LOAD_ATTR_WITH_HINT +
            unused/4;

Maybe with an additional qualifier as well.

uncommon macro(LOAD_ATTR_WITH_HINT) =
            unused/1 +
            inner_opcode/1 +
            _GUARD_TYPE_VERSION +
            _CHECK_ATTR_WITH_HINT +
            _LOAD_ATTR_WITH_HINT +
            unused/4;

faster-cpython / ideas

Handle long tail of specialization for `LOAD_ATTR` and `CALL` with trampolines. #447

Examples: