Creation of superblocks

A superblock will be a linear sequence of "micro ops", that may have multiple side exits, but will only have one entry point.

The start of the superblock will be determined by "hotspots" in the code. The tricky bit is creating the rest of the superblock. For non-branching code, extending the superblock is easy enough. For branches we either need the base interpreter to record which way the branch goes, or to make a estimate based on static analysis and the value being tested, if it is available. For non-local jumps, like calls and return, we need to rely on the information recorded by the specializing adaptive interpreter.

We should end the superblock when either the estimated likelihood of execution staying in the superblock drops below a threshold, say 40%, or when we cannot estimate that likelihood.

An example:

    ...
    x = typing.cast(int, x)
    if unpredicable:
        ...

The bytecode for the above snippet is:

 LOAD_GLOBAL              0 (typing)
 LOAD_ATTR                3 (NULL|self + cast)
 LOAD_GLOBAL              4 (int)
 LOAD_FAST                0 (x)
 CALL                     2
 STORE_FAST               0 (x)
 LOAD_GLOBAL              6 (unpredicable)
 POP_JUMP_IF_FALSE       13 (to 100)

and the bytecode for typing.cast is:

 RESUME                   0

 LOAD_FAST                1 (val)
 RETURN_VALUE

Starting at the LOAD_GLOBAL 0 (typing) the resulting superblock might look something like:

 SAVE_IP
 CHECK_GLOBALS_DICT_KEY_VERSION 
 SAVE_IP
 LOAD_GLOBAL_MODULE_UNCHECKED     (typing)
 SAVE_IP
 CHECK_MODULE_DICT_VERSION
 SAVE_IP
 LOAD_ATTR_MODULE_UNCHECKED       3 (NULL|self + cast)
 SAVE_IP
 CHECK_BUILTINS_DICT_KEY_VERSION
 SAVE_IP
 LOAD_GLOBAL_BUILTINS_UNCHECKED  (int)
 SAVE_IP
 LOAD_FAST                0 (x)
 SAVE_IP
 CHECK_FUNCTION_VERSION 
 SAVE_IP
 PUSH_FRAME
 SAVE_IP
 CHECK_EVAL_BREAKER
 SAVE_IP
 LOAD_FAST                1 (val)
 SAVE_IP
 RETURN_VALUE
 SAVE_IP
 STORE_FAST               0 (x)
 SAVE_IP
 CHECK_GLOBALS_DICT_KEY_VERSION 
 SAVE_IP
 LOAD_GLOBAL_MODULE_UNCHECKED     (unpredicable)
 SAVE_IP
 EXIT                     # Can't predict where we are going

All the SAVE_IP instructions exist to make sure that the frame->prev_instr field gets updated correctly. Don't worry, optimization should remove almost all of them.

We need a way to create the above superblock.

Creation of the superblock involves abstract interpretation over the (hopefully specialized) bytecodes. During interpretation we should maintain five values:

trace: The sequence of micro-ops generated so far
code: The current code object
offset: The current offset into the code object of the next instruction
call_stack: A stack of code, offset pairs (for tracking calls, sends, etc.)
on_target: The estimated likelihood that execution will reach this point in the superblock.

The initial values are:

trace: Empty
code and offset: As provided by the optimizer API call.
call_stack: Empty
on_target: 100%

The algorithm is something like:

while (on_target > on_target_threshold) {
    inst = read_instruction(code, offset)
    switch(inst.opcode) {
        case NOP:
            /* no micro-ops to add */
            /* code object is unchanged */
            offset += 1
            on_target = on_target * 1.0;
            break;
        /* etc, etc... */
    }
}

This would incredibly tedious and error-prone to write by hand. So we need to extend the interpreter generator.

With sufficient metadata we can generate the above loop. We need to know which instructions are jumps, branches, contain guards, etc.

We will need to transform some micro-ops.

Non-local jumps like calls, yields, etc should be explicit in the bytecode. We may need to understand things like RESUME_FRAME, to handle these. We need to recognize frame pops and pushes, not only so that we can emit the right micro-op, but so that we push to the call_stack.

Local jumps (jumps and branches) need explicit handling. Jumps become no-ops (as all instructions need to update offset), and branches become conditional exits.

Guards remain as guards (the behavior differs, but that is dealt with by the tier2 interpreter/compiler).

An incremental way of doing this using the code generator might be to add a default case to the switch that bails out of the loop when an unsupported opcode is encountered, and then trying to add stuff to bytecodes.c and/or to the generator to allow more and more opcodes to be supported.

Some metadata can be recovered by scanning the C code for an instruction (@iritkatriel is starting to do this in https://github.com/python/cpython/pull/105482 already); presumably there's also metadata that will require us to mark certain instructions explicitly.

faster-cpython / ideas

Creation of superblocks #558

An example: