faster-cpython / ideas

1.67k stars 49 forks source link

Tier 2 IR format is not flexible enough for LOAD_ATTR_PROPERTY #634

Open gvanrossum opened 8 months ago

gvanrossum commented 8 months ago

In trying to split LOAD_ATTR_PROPERTY into uops I encountered a problem: its guard needs to check that two cache entries match: func_version must equal to fget->func_version. Both func_version and fget are inline cache entries. Because Tier 2 IR is limited to a single cache entry per uop, we can't express this.

I tried to hack around it by loading fget onto the stack, but of course that can't work: if the guard fails, there's an extra entry on top of the stack. A worse hack might pop it off the stack before calling DEOPT_IF(), but that seems very unprincipled (what's the JIT going to do about this?).

Moreover it's possible that we're already at the pre-calculated max stack level. We could do a stack space check before pushing, growing the frame if neede (it would be safe to deopt if there's no room), but that also seems very unprincipled. (Also in a generator, whose frame has already been copied, we can't grow the frame.)

All in all I'm stumped here. It would be nice if we could translate this specialized opcode to Tier 2, since simple property getters would be good targets for function inlining in the optimizer (a common idiom being @property / def foo(self): return self._foo) -- but we don't seem to have the machinery here.

Maybe @Fidget-Spinner has a thought here?

Fidget-Spinner commented 8 months ago

Hmm I don't have any ideas on how to fix this. One possible solution (which matches Mark's plan on shrinking inline caches) is to store pointers as 32 bit offsets from a certain point in memory (maybe the first allocated function object?) and shrink the function version.

markshannon commented 8 months ago

The problematic micro-op needs to get the function, and then deopt if the version doesn't match, only pushing the result if the version matches.

That needs 96 bits of data. We already have that in the tier 2 IR, it is just that we call the 32 bit value oparg. Given that the micro-op doesn't need an oparg, we could allow two operands, one of the 32 bits and one of 64 bits. Implicitly setting the 32 bit operand to the oparg, if that is used.

In other words, this would be OK:

op(_LOAD_AND_CHECK_PROPERTY_FUNC, (func_version/2, fget/4 -- f) {
    assert(Py_IS_TYPE(fget, &PyFunction_Type));
    PyFunctionObject *f = (PyFunctionObject *)fget;
   assert(func_version != 0);
   DEOPT_IF(f->func_version != func_version);    
}

only because oparg is not present.

Given that the opcode should always fit into 16 bits, and a 16 bit oparg should be enough for almost all code, we could add another operand:

struct {
    uint16_t opcode;
    uint16_t oparg;
    uint32_t operand32;
    uintptr_t operandp;
};

As long as we don't need the space for the unused operands in the interpreter or JIT compiled code, we can have as many operands as we want.

gvanrossum commented 8 months ago

Thanks, I will try this next. I think Ken already has plans to put 16 bits of the opcode to some other use (the deopt target IIRC) so putting the extra thing in oparg if oparg isn't otherwise used by the uop makes sense.

Not sure what the optimizer will make of it, it'll just have to be special-cased I'm afraid.

gvanrossum commented 7 months ago

Alas, we've now added a 32-bit 'target' field, which gives the deopt target; to fit it we've reduced 'opcode' and 'oparg' to 16 bits. Since the critical uop would need 32 bits for func_version and 64 bits for fget, we'd have to expand the size of _PyUOpInstruction even further, from 128 bits to 192 bits... :-( Before we go there we'd probably have to do something different about the code representation in the executor. In the benchmarks this probably won't make a big difference though; LOAD_ATTR_PROPERTY is on a modest 3rd place among unsupported uops, with 5k occurrences, behind FOR_ITER_GEN (74k) and CALL (9k).