Micro-op (tier 2) interpreter

markshannon commented 1 year ago

The output of the tier 2 optimization pipeline is a superblock of micro-ops. We will need an interpreter for these micro-ops for a few reasons:

To execute the micro-ops if no JIT compiler is present.
To serve as a specification for the JIT
To help debug or verify the JIT

Instructions

To execute the micro-ops they will need to be represented in memory in an efficient format. We can reject rare or complex bytecodes when optimizing, so we can ignore many corner cases.

Each instruction should consist of an opcode and oparg, as they do in normal bytecode, but the oparg can be either the original oparg or a cache entry, so we will need more than 8 bits. A 32bit instruction with 8 bit opcode, 22 oparg and 2 spare bits should be a good starting pointing. We might want a 9 or 10 bit opcode if we are to have a large number of superinstructions, or we might need 24 bit operands. We'll see.

The interpreter

The interpreter should be created from bytecodes.c, much as we do for PyEval_EvalDefault()

Changes to bytecodes.c

In order to create the interpreter each micro-op can take at most one operand: either the original oparg or one cache entry. To do this we will need to break down some instructions into quite small parts.

E.g. LOAD_ATTR_METHOD_WITH_VALUES has the signature: inst(LOAD_ATTR_METHOD_WITH_VALUES, (unused/1, type_version/2, keys_version/2, descr/4, self -- res2 if (oparg & 1), res))

We can ignore the oparg as the optimizer will eliminate it. That leaves the type_version, keys_version and descr operand, so this instruction will be need split into at least three parts, something like:

macro(LOAD_ATTR_METHOD_WITH_VALUES) = 
    SKIP_COUNTER + CHECK_TYPE_VERSION + 
    CHECK_SHARED_DICT_KEYS_VERSION + LOAD_INLINED_ATTRIBUTE;

Where the parts would be defined as something like:

op(SKIPCOUNTER, (unused/1 -- )) {}

op(CHECK_TYPE_VERSION, (type_version/2, self -- self))
{
    PyTypeObject *self_cls = Py_TYPE(self);
    assert(type_version != 0);
    DEOPT_IF(self_cls->tp_version_tag != type_version);
    assert(self_cls->tp_flags & Py_TPFLAGS_MANAGED_DICT);
}

op(CHECK_SHARED_DICT_KEYS_VERSION, (keys_version/2, self -- self))
{
    PyTypeObject *self_cls = Py_TYPE(self);
    PyDictOrValues dorv = *_PyObject_DictOrValuesPointer(self);
    DEOPT_IF(!_PyDictOrValues_IsValues(dorv), LOAD_ATTR);
    PyHeapTypeObject *self_heap_type = (PyHeapTypeObject *)self_cls;
    DEOPT_IF(self_heap_type->ht_cached_keys->dk_version != keys_version);
    STAT_INC(opcode, hit);
}

op(LOAD_INLINED_ATTRIBUTE, (descr/4, self -- res, self))
{
    assert(descr != NULL);
    res = Py_NewRef(descr);
    assert(_PyType_HasFeature(Py_TYPE(res), Py_TPFLAGS_METHOD_DESCRIPTOR));
}

Fitting the 64 bit descr field into 22 bits is difficult, if not impossible. However, we need to shrink this entry in order to reduce the size of code objects. So we'll need to do that first, or make the instructions 64 bit each.

iritkatriel commented 1 year ago

Why can't the micro-ops have caches // variable length inputs too?

markshannon commented 1 year ago

It is not an absolute requirement, but having only a single operand per instruction has a number of advantages.

It makes the optimizer simpler if each micro-op has one action. That way most, if not all, optimizations can remove or replace one micro-op at a time.
It speeds up the optimizer if each instruction is a fixed width.
Backward passes over the IR are tricky if the instructions aren't of a fixed size.

In terms of the interpreter, it is less important. However, using the same format as the optimizer means we don't need an additional translation step which should help test the optimizer and compiler.

gvanrossum commented 1 year ago

I always assumed that the tier 2 IR would have fixed-width instructions. But how important is it to have 32 bits instead of 64 bits? Seems a classic time/space tradeoff.
How do we fit an 8-byte pointer in 7 bytes? It seems CPU specific and possibly OS specific which bits of a pointer are always zero (apart from the lower three).
Could you elaborate on how the optimizer would be simpler? (Or link to a description of the optimizer design from which this can be derived.)

gvanrossum commented 1 year ago

I'm not sure at what point we will be blocked on this, but based on my recent experiences modernizing parts of bytecodes.c, splitting all bytecodes into uops given some constraint on the size of the cache entries + opcode will be quite time consuming, so we probably need to get a head start on this.

As a counterproposal to Marks suggestion of 32-bit instructions with an opcode and either an oparg or one cache entry, let me suggest that we could start out with 64-bit instructions with an opcode and up to 56 bits worth of opcode and cache data. This would waste space for short instructions like LOAD_FAST, but will reduce (initially) the need to split up too many instructions into tiny uops.

A hack we could use eventually once we are ready for 32-bit instructions: we could still have the occasional variable-length instruction, e.g. we could have the top 8 bits as opcode and the lower 24 bits for cache and/or opcode, and when we need 48 bits, we could use two words where the second word has an opcode that causes a debug exit when executed. You could then even read the instruction stream backwards without confusing cache data with instructions, since the top 8 bits will always be a dedicated bit pattern. This would be more like the current cache entries in bytecode (fixed per opcode) and not like the current EXTENDED_ARG opcode (which must be decoded dynamically), but combine some benefits of both.

markshannon commented 1 year ago

I wouldn't worry too much about the size. Just make the instructions a struct. We can start with

struct Inst {
    int32_t opcode;
    int64_t oparg;
}

And whittle it down to 64 bits, then to

struct Inst {
    int32_t opcode:8;
    int32_t oparg:22;
}

without having to change the interpreter.

markshannon commented 1 year ago

How do we fit an 8-byte pointer in 7 bytes? It seems CPU specific and possibly OS specific which bits of a pointer are always zero (apart from the lower three).

We don't. We eliminate pointers from the instruction stream, but that's a separate issue.

faster-cpython / ideas