Open markshannon opened 1 year ago
Why can't the micro-ops have caches // variable length inputs too?
It is not an absolute requirement, but having only a single operand per instruction has a number of advantages.
In terms of the interpreter, it is less important. However, using the same format as the optimizer means we don't need an additional translation step which should help test the optimizer and compiler.
I'm not sure at what point we will be blocked on this, but based on my recent experiences modernizing parts of bytecodes.c, splitting all bytecodes into uops given some constraint on the size of the cache entries + opcode will be quite time consuming, so we probably need to get a head start on this.
As a counterproposal to Marks suggestion of 32-bit instructions with an opcode and either an oparg or one cache entry, let me suggest that we could start out with 64-bit instructions with an opcode and up to 56 bits worth of opcode and cache data. This would waste space for short instructions like LOAD_FAST
, but will reduce (initially) the need to split up too many instructions into tiny uops.
A hack we could use eventually once we are ready for 32-bit instructions: we could still have the occasional variable-length instruction, e.g. we could have the top 8 bits as opcode and the lower 24 bits for cache and/or opcode, and when we need 48 bits, we could use two words where the second word has an opcode that causes a debug exit when executed. You could then even read the instruction stream backwards without confusing cache data with instructions, since the top 8 bits will always be a dedicated bit pattern. This would be more like the current cache entries in bytecode (fixed per opcode) and not like the current EXTENDED_ARG
opcode (which must be decoded dynamically), but combine some benefits of both.
I wouldn't worry too much about the size. Just make the instructions a struct. We can start with
struct Inst {
int32_t opcode;
int64_t oparg;
}
And whittle it down to 64 bits, then to
struct Inst {
int32_t opcode:8;
int32_t oparg:22;
}
without having to change the interpreter.
How do we fit an 8-byte pointer in 7 bytes? It seems CPU specific and possibly OS specific which bits of a pointer are always zero (apart from the lower three).
We don't. We eliminate pointers from the instruction stream, but that's a separate issue.
The output of the tier 2 optimization pipeline is a superblock of micro-ops. We will need an interpreter for these micro-ops for a few reasons:
Instructions
To execute the micro-ops they will need to be represented in memory in an efficient format. We can reject rare or complex bytecodes when optimizing, so we can ignore many corner cases.
Each instruction should consist of an opcode and oparg, as they do in normal bytecode, but the oparg can be either the original oparg or a cache entry, so we will need more than 8 bits. A 32bit instruction with 8 bit opcode, 22 oparg and 2 spare bits should be a good starting pointing. We might want a 9 or 10 bit opcode if we are to have a large number of superinstructions, or we might need 24 bit operands. We'll see.
The interpreter
The interpreter should be created from bytecodes.c, much as we do for
PyEval_EvalDefault()
Changes to bytecodes.c
In order to create the interpreter each micro-op can take at most one operand: either the original oparg or one cache entry. To do this we will need to break down some instructions into quite small parts.
E.g.
LOAD_ATTR_METHOD_WITH_VALUES
has the signature:inst(LOAD_ATTR_METHOD_WITH_VALUES, (unused/1, type_version/2, keys_version/2, descr/4, self -- res2 if (oparg & 1), res))
We can ignore the
oparg
as the optimizer will eliminate it. That leaves thetype_version
,keys_version
anddescr
operand, so this instruction will be need split into at least three parts, something like:Where the parts would be defined as something like:
Fitting the 64 bit
descr
field into 22 bits is difficult, if not impossible. However, we need to shrink this entry in order to reduce the size of code objects. So we'll need to do that first, or make the instructions 64 bit each.