faster-cpython / ideas

1.67k stars 49 forks source link

Avoid reallocations for list comprehensions #549

Open lpereira opened 1 year ago

lpereira commented 1 year ago

Can we cache the resulting list size and pre-allocate a list when executing a list comprehension? (And adjust ob_size when done with the real size. If there's a somewhat large discrepancy between ob_size and ob_allocated, the cached size can be adjusted, or the resulting list be reallocated to fit.)

This should avoid quite a bit of reallocations.

brandtbucher commented 1 year ago

Some previous discussion over at https://github.com/python/cpython/issues/80732.

lpereira commented 1 year ago

Some previous discussion over at python/cpython#80732.

Interesting discussion! Some of the points might not make a whole lot of sense now, with cache entries being there and whatnot.

For instance, without thinking too much: instead of having a RETURN_VALUE opcode at the end of a listcomp, we could have a RETURN_LISTCOMP_VALUE (or something), that takes an oparg to the BUILD_LIST opcode; this new opcode would then change the cache entry for the BUILD_LIST with the length of the list on the top of the stack and then execute as if it were a RETURN_VALUE. Next time that BUILD_LIST is executed, the list will have a preallocated size.

Something like that would make many of the concerns in that previous discussion gone, as there would be no reliance on __length_hint__ anymore.

JelleZijlstra commented 1 year ago

That would rely on the assumption that when the same listcomp executes multiple times, it's likely to build the same size of list each time, right?

I don't have data but I wouldn't expect that assumption to hold up well. For example, I just looked at some of my code and in this file there are three listcomps: one iterates over the children of a module AST node, one over the children of a classdef AST node, and one over all the names defined in a module. Typical usage would call those functions for a bunch of different modules, so there's no reason to think the list lengths will be the same each time.

carljm commented 1 year ago

I agree that an optimization that caches the length, assuming that different runs will result in similar-sized output, probably does not make sense. But I think something more similar to what was originally pursued in https://github.com/python/cpython/issues/80732 (pre-allocating per-run based on the actual length of the input to the comprehension) may still make sense. I think this is a good summary from the OP of the conclusions in that issue:

My conclusion was that the list comprehension should be initialized to the length of the target, before GET_ITER is run. This would remove the overhead for range objects, because you could simply call _PyObject_HasLen, which would return true for dict, list, tuple and set, but false for range objects (which is what you want).

The issue is that GET_ITER is called outside the code object for the comprehension, so you'd have to pass an additional argument to the comprehension generator.

This is way outside of my expertise, but the only way I can see to find a sizeable benefit, with minimal code and no edge cases which are slower.

Inlining can also solve this problem and make it pretty trivial to cheaply presize the output list based on the known size of the input, if it's a known type.