Open gvanrossum opened 1 year ago
Let's take the BINARY_OP
family as an example. This has one oparg (indicating the operation, e.g. +
or *=
) and one cache entry, a counter:
inst(BINARY_OP, (unused/1, lhs, rhs -- res)) {
#if ENABLE_SPECIALIZATION
_PyBinaryOpCache *cache = (_PyBinaryOpCache *)next_instr;
if (ADAPTIVE_COUNTER_IS_ZERO(cache->counter)) {
next_instr--;
_Py_Specialize_BinaryOp(lhs, rhs, next_instr, oparg, &GETLOCAL(0));
DISPATCH_SAME_OPARG();
}
STAT_INC(BINARY_OP, deferred);
DECREMENT_ADAPTIVE_COUNTER(cache->counter);
#endif /* ENABLE_SPECIALIZATION */
assert(0 <= oparg);
assert((unsigned)oparg < Py_ARRAY_LENGTH(binary_ops));
assert(binary_ops[oparg]);
res = binary_ops[oparg](lhs, rhs);
DECREF_INPUTS();
ERROR_IF(res == NULL, error);
}
Note that the unused/1
in the inst
header is a bit of a lie -- it is in fact used (cache->counter
) but we don't want the generator to generate instructions to load it. (This is a bit of a historic limitation of the generator -- the "counter" cache entry is special, and is always manipulated using macros like ADAPTIVE_COUNTER_IS_ZERO
and DECREMENT_ADAPTIVE_COUNTER
.)
How would we split this up into micro-ops with at most one argument? It's tempting to try to make the specialization part into one opcode and the actual operation into another, since the former uses the cache and the latter uses the oparg. But the former also needs the oparg, in order to pass it to the specialization function!
The best I can think of is to indeed split it up like this, using a hack to access the oparg of the second instruction from the definition of the first:
macro(BINARY_OP) = _BINARY_OP_SPECIALIZE + _BINARY_OP_ACTION;
op(_BINARY_OP_SPECIALIZE, (unused/1 --)) {
_PyBinaryOpCache *cache = (_PyBinaryOpCache *)next_instr;
if (ADAPTIVE_COUNTER_IS_ZERO(cache->counter)) {
next_instr--;
_Py_Specialize_BinaryOp(lhs, rhs, next_instr, ?????, &GETLOCAL(0));
DISPATCH_SAME_OPARG();
}
STAT_INC(BINARY_OP, deferred);
DECREMENT_ADAPTIVE_COUNTER(cache->counter);
}
op(_BINARY_OP_ACTION, (lhs, rhs -- res)) {
assert(0 <= oparg);
assert((unsigned)oparg < Py_ARRAY_LENGTH(binary_ops));
assert(binary_ops[oparg]);
res = binary_ops[oparg](lhs, rhs);
DECREF_INPUTS();
ERROR_IF(res == NULL, error);
}
Note the ?????
in the first opcode definition -- here we have to substitute something that peeks into the next instruction. We can do this using something similar to what we do for super-instructions like LOAD_FAST__LOAD_FAST
. It should be a macro so that the generator can substitute something different when generating tier-1 bytecode (in which case it can just reference oparg
) compared to when it's generating tier-2 micro-ops.
Note that the
unused/1
in the inst header is a bit of a lie -- it is in fact used (cache->counter
) but we don't want the generator to generate instructions to load it. (This is a bit of a historic limitation of the generator -- the "counter" cache entry is special, and is always manipulated using macros likeADAPTIVE_COUNTER_IS_ZERO
andDECREMENT_ADAPTIVE_COUNTER
.)
One way to improve this situation would be to have a way to mark the counter
cache entry as an lvalue (possibly just by special-casing the name) and expand it in the generated code in a way that allows assigning to it. The expansion could be as simple as next_instr[0].cache
, I think.
We don't want to split up BINARY_OP
as it isn't specialized.
Is this reuse of oparg
a problem in specialized instructions?
For various reasons the Tier 2 optimizer and interpreter prefer to have an instruction format that only contains one argument per instruction -- the argument can be either the opcode or a cache value.
This is a substantial task, but luckily we can parallelize it.
I expect that we'll find we need some improvements to the generator, to ensure that the original bytecodes (which become macro-instructions) are as efficient as their original version.