faster-cpython / ideas

1.67k stars 49 forks source link

Split bytecodes into micro-ops #592

Open gvanrossum opened 1 year ago

gvanrossum commented 1 year ago

For various reasons the Tier 2 optimizer and interpreter prefer to have an instruction format that only contains one argument per instruction -- the argument can be either the opcode or a cache value.

This is a substantial task, but luckily we can parallelize it.

I expect that we'll find we need some improvements to the generator, to ensure that the original bytecodes (which become macro-instructions) are as efficient as their original version.

gvanrossum commented 1 year ago

Let's take the BINARY_OP family as an example. This has one oparg (indicating the operation, e.g. + or *=) and one cache entry, a counter:

        inst(BINARY_OP, (unused/1, lhs, rhs -- res)) {
            #if ENABLE_SPECIALIZATION
            _PyBinaryOpCache *cache = (_PyBinaryOpCache *)next_instr;
            if (ADAPTIVE_COUNTER_IS_ZERO(cache->counter)) {
                next_instr--;
                _Py_Specialize_BinaryOp(lhs, rhs, next_instr, oparg, &GETLOCAL(0));
                DISPATCH_SAME_OPARG();
            }
            STAT_INC(BINARY_OP, deferred);
            DECREMENT_ADAPTIVE_COUNTER(cache->counter);
            #endif  /* ENABLE_SPECIALIZATION */
            assert(0 <= oparg);
            assert((unsigned)oparg < Py_ARRAY_LENGTH(binary_ops));
            assert(binary_ops[oparg]);
            res = binary_ops[oparg](lhs, rhs);
            DECREF_INPUTS();
            ERROR_IF(res == NULL, error);
        }

Note that the unused/1 in the inst header is a bit of a lie -- it is in fact used (cache->counter) but we don't want the generator to generate instructions to load it. (This is a bit of a historic limitation of the generator -- the "counter" cache entry is special, and is always manipulated using macros like ADAPTIVE_COUNTER_IS_ZERO and DECREMENT_ADAPTIVE_COUNTER.)

How would we split this up into micro-ops with at most one argument? It's tempting to try to make the specialization part into one opcode and the actual operation into another, since the former uses the cache and the latter uses the oparg. But the former also needs the oparg, in order to pass it to the specialization function!

The best I can think of is to indeed split it up like this, using a hack to access the oparg of the second instruction from the definition of the first:

    macro(BINARY_OP) = _BINARY_OP_SPECIALIZE + _BINARY_OP_ACTION;
    op(_BINARY_OP_SPECIALIZE, (unused/1 --)) {
            _PyBinaryOpCache *cache = (_PyBinaryOpCache *)next_instr;
            if (ADAPTIVE_COUNTER_IS_ZERO(cache->counter)) {
                next_instr--;
                _Py_Specialize_BinaryOp(lhs, rhs, next_instr, ?????, &GETLOCAL(0));
                DISPATCH_SAME_OPARG();
            }
            STAT_INC(BINARY_OP, deferred);
            DECREMENT_ADAPTIVE_COUNTER(cache->counter);
    }
    op(_BINARY_OP_ACTION, (lhs, rhs -- res)) {
            assert(0 <= oparg);
            assert((unsigned)oparg < Py_ARRAY_LENGTH(binary_ops));
            assert(binary_ops[oparg]);
            res = binary_ops[oparg](lhs, rhs);
            DECREF_INPUTS();
            ERROR_IF(res == NULL, error);
    }

Note the ????? in the first opcode definition -- here we have to substitute something that peeks into the next instruction. We can do this using something similar to what we do for super-instructions like LOAD_FAST__LOAD_FAST. It should be a macro so that the generator can substitute something different when generating tier-1 bytecode (in which case it can just reference oparg) compared to when it's generating tier-2 micro-ops.

gvanrossum commented 1 year ago

Note that the unused/1 in the inst header is a bit of a lie -- it is in fact used (cache->counter) but we don't want the generator to generate instructions to load it. (This is a bit of a historic limitation of the generator -- the "counter" cache entry is special, and is always manipulated using macros like ADAPTIVE_COUNTER_IS_ZERO and DECREMENT_ADAPTIVE_COUNTER.)

One way to improve this situation would be to have a way to mark the counter cache entry as an lvalue (possibly just by special-casing the name) and expand it in the generated code in a way that allows assigning to it. The expansion could be as simple as next_instr[0].cache, I think.

markshannon commented 1 year ago

We don't want to split up BINARY_OP as it isn't specialized. Is this reuse of oparg a problem in specialized instructions?