faster-cpython / ideas

1.67k stars 49 forks source link

Handling "tuple/any/all comprehensions" in the compiler. #545

Open markshannon opened 1 year ago

markshannon commented 1 year ago

Consider the following function containing a "tuple comprehension":

def f():
     return tuple(expr for var in seq)

Which produces the following code:

>>> dis.dis(f)
  1           0 RESUME                   0

  2           2 LOAD_GLOBAL              1 (NULL + tuple)
             14 LOAD_CONST               1 (<code object <genexpr> at 0x7f0074ed0d50, file "<stdin>", line 2>)
             16 MAKE_FUNCTION            0
             18 LOAD_GLOBAL              2 (seq)
             30 GET_ITER
             32 CALL                     0
             42 CALL                     1
             52 POP_TOP
             54 LOAD_CONST               0 (None)
             56 RETURN_VALUE

Disassembly of <code object <genexpr> at 0x7f0074ed0d50, file "<stdin>", line 2>:
  2           0 RETURN_GENERATOR
              2 POP_TOP
              4 RESUME                   0
              6 LOAD_FAST                0 (.0)
        >>    8 FOR_ITER                11 (to 34)
             12 STORE_FAST               1 (var)
             14 LOAD_GLOBAL              0 (expr)
             26 YIELD_VALUE              2
             28 RESUME                   1
             30 POP_TOP
             32 JUMP_BACKWARD           13 (to 8)
        >>   34 END_FOR
             36 LOAD_CONST               0 (None)
             38 RETURN_VALUE
        >>   40 CALL_INTRINSIC_1         3
             42 RERAISE                  1
ExceptionTable:
  4 to 38 -> 40 [0] lasti

What we would like is:

>>> dis.dis(f)
  RESUME                   0
  BUILD_LIST               0
  LOAD_GLOBAL              2 (seq)
  FOR_ITER
  STORE_FAST               1 (var)
  LOAD_GLOBAL              0 (expr)
  LIST_APPEND              2
  JUMP_BACKWARD  
  END_FOR
  LIST_TO_TUPLE
  RETURN_VALUE

For list comprehensions, we can use escape analysis and inlining to remove the overhead. @carljm is working on doing just that.

However, the above "tuple comprehension" is resistant to escape analysis as tuple could hypothetically be anything. We could use something like the approach described in @vstinner's FAT Python and convert:

def f():
     return tuple(expr for var in seq)

into:

def f():
    $tmp = tuple
    if $tmp is __the_builtin_tuple_type:
         return LIST_TO_TUPLE([expr for var in seq])
     return tuple(expr for var in seq)

where LIST_TO_TUPLE is a VM instruction, not a call.

With escape analysis and inlining, we should get the ideal code, with the small prefix:

    LOAD_GLOBAL 'tuple'
    LOAD_CONST tuple
    IS_OP
    POP_JUMP_IF_FALSE

We would expect the tier 2 optimizer to eliminate the prefix.

The above optimization also applies to any and all as well as tuple. The potential for performance improvements is even greater in those cases as they tend to not exhaust the generator.

itamaro commented 1 year ago

Is this done now, with the pep-709 implementation as it landed?

JelleZijlstra commented 1 year ago

No, PEP 709 only handles list/set/dict comprehensions, not generator expressions.

iritkatriel commented 11 months ago

Unless we can implement LIST_TO_TUPLE in a way that doesn't double the necessary space, I think memory might be an issue for large outputs.

markshannon commented 11 months ago

PySequence_Tuple also needs to perform a number of copies. It is mainly a matter of luck whether an additional allocation is needed at the end.

The main difference is likely to be the number of resizes building the list. PySequence_Tuple doubles, but PyList_Append() only scales by 9/8.

markshannon commented 11 months ago

1/n allocations and copying one pointer is going to be much cheaper than a round trip between PySequence_Tuple, the tower of gen_iternext, gen_send, ... and the interpreter.

markshannon commented 11 months ago

With a growth factor of 9/8 we need 6 reallocations per doubling, so that's ~10 additional copies per element for a large tuple. I suspect that will still be faster than the round-trip through PySequence_Tuple, but we should measure it.

We can look to change the growth factor for lists in another issue.