faster-cpython / ideas

1.67k stars 49 forks source link

A proposal for variable-length instruction decoding #539

Closed gvanrossum closed 1 year ago

gvanrossum commented 1 year ago

This contains an idea on how to do EXTENDED_ARG for variable-length instructions. It was weirdly inspired by the blog post that @lpereira mentioned in #537.

gvanrossum commented 1 year ago

Sorry about the force pushes, I'm a gidiot. It should now be correct.

brandtbucher commented 1 year ago

Just a thought I had while reading this: we could avoid the (slightly clunkly, in my opinion) double-dispatch into the middle of instructions by using a decoding scheme like this:

#define DECODE_OPARGS(OPARG_A, OPARG_B) \
    do {                                \
        assert((OPARG_A) & 0xFF == 0);  \
        assert((OPARG_B) & 0xFF == 0);  \
        word = *next_instr++;           \
        (OPARG_A) |= word.first_byte;   \
        (OPARG_B) |= word.second_byte;  \
    } while (0)

#define CLEAR_OPARGS() \
    do {               \
        oparg1 = 0;    \
        oparg2 = 0;    \
        oparg3 = 0;    \
        oparg4 = 0;    \
        oparg5 = 0;    \
    } while (0)

uint32_t oparg1, oparg2, oparg3, oparg4, oparg5;
CLEAR_OPARGS();
while (true) {
    _Py_CODEUNIT word = *next_instr++;
    uint8_t opcode = word.first_byte;
    assert(oparg1 & 0xFF == 0);
    oparg1 |= word.second_byte;
    switch (oparg) {
        case OP_1:
            // ...
            CLEAR_OPARGS();
            break;
        case OP_3:
            DECODE_OPARGS(oparg2, oparg3);
            // ...
            CLEAR_OPARGS();
            break;
        case OP_5:
            DECODE_OPARGS(oparg2, oparg3);
            DECODE_OPARGS(oparg4, oparg5);
            // ...
            CLEAR_OPARGS();
            break;
        case EXTENDED_ARG_1:
            oparg1 <<= 8;
            break;
        case EXTENDED_ARG_3:
            DECODE_OPARGS(oparg2, oparg3);
            oparg1 <<= 8;
            oparg2 <<= 8;
            oparg3 <<= 8;
            break;
        case EXTENDED_ARG_5:
            DECODE_OPARGS(oparg2, oparg3);
            DECODE_OPARGS(oparg4, oparg5);
            oparg1 <<= 8;
            oparg2 <<= 8;
            oparg3 <<= 8;
            oparg4 <<= 8;
            oparg5 <<= 8;
            break;
    }
}

Basically, the aforementioned clunkiness comes from the assignments to opargs, which would clobber any extended values. Instead, this zeroes all opargs at the end of each normal instruction and reads them in at the start of the next instruction using |.

I'm not sure how the cost of zeroing and or-ing the opargs in the common case compares to the double-switches in the uncommon case, but I know that I find this quite a bit easier to reason about.

lpereira commented 1 year ago

@brandtbucher wrote:

I'm not sure how the cost of zeroing and or-ing the opargs in the common case compares to the double-switches in the uncommon case, but I know that I find this quite a bit easier to reason about.

Is this pattern of zero+or necessary? Could opargs be just assigned to the locals instead?

brandtbucher commented 1 year ago

@brandtbucher wrote:

I'm not sure how the cost of zeroing and or-ing the opargs in the common case compares to the double-switches in the uncommon case, but I know that I find this quite a bit easier to reason about.

Is this pattern of zero+or necessary? Could opargs be just assigned to the locals instead?

But how would you extend them? The idea here is that the entry to an instruction is the same whether the opargs have been extended or not.

gvanrossum commented 1 year ago

I originally had a version where the opargs were zeroed out, but it seems pretty crazy to waste time writing zeros as part of every instruction, so I came up with the nested switch. At least the nested switch only exists for EXTENDED_ARG_{3,5} rather than in every DISPATCH() call as in the blog that L mentioned.

gvanrossum commented 1 year ago

Since we're not doing the register VM (yet) this is no longer important. Closing.