faster-cpython / ideas

1.67k stars 49 forks source link

Avoid BUILD_TUPLE followed by RETURN_VALUE if UNPACK_SEQUENCE is right after CALL #509

Closed lpereira closed 1 year ago

lpereira commented 1 year ago

This should avoid an unnecessary round-trip to the allocator in what I believe to be a very common case.

The way I see this being implemented is:

Thoughts on this?

gvanrossum commented 1 year ago

Looks like a winner for various common cases, let's ping @markshannon.

Could we do something similar for for-loops, e.g. for i, x in enumerate(xs)?

lpereira commented 1 year ago

FOR_ITER and UNPACK_SEQUENCE seems like another great candidate for something like this, but it looks like a big can of worms. Would need to change how tp_iternext() works if I understood the machinery correctly.

Maybe if we add a new interface to the iterator, tp_iternext_inplace() that takes a pointer to an array of fixed size (like the stack), would work fine? Since we need a fixed amount of items for the UNPACK_SEQUENCE to work, this seems like it's a plausible solution.

lpereira commented 1 year ago

Ah, just looked into enumobject.c and it seems that the enumerator caches the tuple object (modifying it in place before returning the result), so this might not be that necessary for the FOR_ITER+UNPACK_SEQUENCE case.

markshannon commented 1 year ago

This is an optimization for the tier-2 (short-trace) optimizer.

The sequence

BUILD_TUPLE n
RETURN_VALUE
UNPACK_SEQUENCE n

Can be transformed into a bulk transfer of values from callee to caller.

sweeneyde commented 1 year ago

Adding FOR_ITER_DICT_ITEMS and FOR_ITER_ENUMERATE here did speed up dict.items() and enumerate() a bit, but for some reason, it slowed other things down when I measured (I may have measured incorrectly).

I'm not sure if we're still interested in virtual (index on the stack) iterators, but maybe that could be revisited.