faster-cpython / ideas

1.67k stars 49 forks source link

What should the family be of a super-instruction in our DSL? #495

Open gvanrossum opened 1 year ago

gvanrossum commented 1 year ago

It looks like "families" have a double meaning: they server to map specialized/super instructions back to their "base" instruction, and they are used to double check that all family members have the same cache and stack effects. From the DSL spec it appeared that the latter was the most important, so I only implemented that, but found out that super-instructions are also in the families as defined by opcode.py (_specializations).

ISTM that super-instructions shouldn't be required to have the same stack/cache effects as their first constituent -- their effects should be the combination of all constituents. The table that maps these back to their base can be generated separately from the super-instruction definitions (e.g., LOAD_FAST__LOAD_CONST clearly maps to LOAD_FAST).

I also finally realized there are some pseudo-super-instructions (e.g. BINARY_OP_INPLACE_ADD_UNICODE). These behave like super-instructions in that they skip the next opcode, but are not generated using super(X) = Y + Z; -- instead, they use an explicit code block. We might be able to define these as proper super-instructions constructed from constituent parts where the second part is a dummy that ends up shrinking the stack but does nothing else. I'll play with this.

Sorry for the ramble. Thanks to @brandtbucher for clearing up some of my confusion.

gvanrossum commented 1 year ago

For now I've solved this by using the first sub-instruction of a super-instruction. For example, if we have a super-instruction BINARY_OP_INPLACE_ADD_UNICODE that executes _B_PART1 followed by _B_PART2 (not their real names :-), the binary_op family lists _B_PART1 as one of its members, and its stack and cache effects must match those of BINARY_OP (the "head" of the family).

It follows that only _B_PART1 may use DEOPT_IF(), since the family is (presumably) mainly interesting as the destination of deoptimization. I expect that eventually we will have to relax this constraint, at least if we allow the components of a super-instruction themselves to be macro instructions, each composed of several micro ops.

However, there may be a more fundamental constraint. If specialization replaces a single instruction with a more specialized version (e.g. BINARY_OP -> BINARY_OP_ADD_INT), the requirement is that the cache and stack effects of the latter match the former. But when specialization replaces several instructions with a specialized super-instruction (e.g. BINARY_OP + STORE_FAST -> BINARY_OP_INPLACE_ADD_UNICODE), the real requirement is that the super-instruction's cache and stack effects match the combined cache and stack effects of the original sequence.

Right now the super-instruction syntax is only designed to easily define simple compositions like LOAD_FAST__LOAD_CONST = LOAD_FAST + LOAD_CONST, where this constraint is true by construction. It would be nice if there was an alternate syntax that could express the constraint separate from the definition, e.g.

super(BINARY_OP_INPLACE_ADD_UNICODE, BINARY_OP + STORE_FAST) {
    <code block implementing the new instruction>
}

Edit: The new syntax should allow specifying stack and cache effects for the new instruction, so it would more likely be a variation of this:

super(BINARY_OP_INPLACE_ADD_UNICODE,
      BINARY_OP + STORE_FAST,
      (left, right, unused/1 -- ))
{
    <code block>
}
gvanrossum commented 1 year ago

Possibly the info should be rolled into the inst syntax instead, e.g.

inst(BINARY_OP_INPLACE_ADD_UNICODE,
     (left, right, unused/1 -- ),
     BINARY_OP + STORE_FAST)
{
    <code block>
}

If we did this for all specialized instructions, we wouldn't even need family definitions any more. E.g.

inst(BINARY_OP_MULTIPLY_INT, (left, right, unused/1 -- prod), BINARY_OP) {
    <code block>
}