faster-cpython / ideas

1.7k stars 49 forks source link

Superblock optimization: Partial evaluation #561

Open markshannon opened 1 year ago

markshannon commented 1 year ago

This pass defers the creation and update of objects, including the frame stack.

By maintaining a "shadow frame stack" we can defer operations that create new objects, or modify the frame stack, only materializing the frame stack when needed.

Doing so for SAVE_IP is trivial, we just record the current prev_instr on the shadow frame. I will omit the SAVE_IPs for clarity in the following example.

Taking the code from the specialization example:

 POP_TOP
 LOAD_CONST  (typing.cast)
 LOAD_CONST (int)
 LOAD_FAST                0 (x)
 PUSH_FRAME               2
 CHECK_EVAL_BREAKER
 LOAD_FAST                1 (val)
 RETURN_VALUE
 STORE_FAST               0 (x)
 LOAD_GLOBAL_MODULE_UNCHECKED     (unpredicable)
 EXIT                     # Can't predict where we are going

This produces the following superblock with commented shadow frame stack:

 NOP    # stack: [ typing ]
 NOP    # stack: [  ]
 NOP    # stack: [ typing.cast ]
 NOP    # stack: [ typing.cast, int ]
 NOP    # stack: [ typing.cast, int, LOCAL(0) ]
 NOP    # prev_frame: []  frame: (locals=(int, LOCAL(0)), stack: [])   We push a shadow frame instead of a real one
 CHECK_EVAL_BREAKER # We still need to check the eval breaker, but only once per superblock or loop.
 NOP    # prev_frame: []  frame: (locals=(int, LOCAL(0)), stack: [ LOCAL(0) ]) 
 NOP    # stack: [ LOCAL(0) ]  The return is performed on the shadow stack.
 NOP    # stack: []   (this is a NOP as we are storing a value to itself)
 LOAD_GLOBAL_MODULE_UNCHECKED     (unpredicable)
 SAVE_IP
 EXIT                     # Can't predict where we are going

Which after the cleanup pass is:

 CHECK_EVAL_BREAKER
 LOAD_GLOBAL_MODULE_UNCHECKED     (unpredicable)
 SAVE_IP
 EXIT                     # Can't predict where we are going

This is obviously a somewhat contrived example, but you get the idea.

Fidget-Spinner commented 1 year ago

Please bear with me as I'm quite confused. What keeps track of the shadow stack (how does the shadow stack suddenly know that typing or other things are on the stack)? Or is it just another operand stack?

Fidget-Spinner commented 1 year ago

Hopefully this helps anyone trying to understand this better. After reading Mark's PhD theses https://theses.gla.ac.uk/2975/ , this is what I surmise:

Ref page 114 (5.7, Deferred Object Creation): What's a shadow frame stack?

"The DOC pass defers creating objects for as long as possible. To do this, it maintains a shadow data stack and a shadow frame stack to record objects that it is currently deferring. The shadow data stack and a shadow frame stack record the difference between the original, non-deferred state and the actual, deferred state. When the DOC pass encounters an instruction that would create a new object which would be of a type that the DOC pass understands, such as tuple, then a deferred object is pushed to the shadow stack. The DOC pass also maintains a shadow line number."

(Shannon, 2011)

What happens at boundaries between shadow stack objects and operand stack objects

"There are a number of instructions that the DOC pass understands but cannot defer. To handle these, the DOC pass is able to mix objects that have already been created with deferred ones. For example, if the DOC pass encounters an i_add instruction it must ensure that the top two values on the stack actually exist, emitting the code to create any deferred objects. It then emits the i_add instruction and pushes a marker to the deferred stack, showing that the object on top of the shadow stack corresponds to the one on top of the real stack."

(Shannon, 2011)

So in the example Mark gives above, it is effectively seeing:

x = typing.cast(int, x)
return unpredictable  # return or some other exit, not sure what EXIT is, unpredictable is a module global

and deciding that, after inlining typing.cast, the whole intermediate op is effectively a no-op, and just the only thing that really matters is the return unpredictable. Thus x = typing.cast(x, int) becomes a no-op.

In short, partial evaluation, to my understanding, is when the optimizer can determine the net input and output effects, and optimize away any known intermediate operations. Really cool idea!

markshannon commented 1 year ago

Everything you could possibly want to know about partial evaluation (and then some): https://www.itu.dk/~sestoft/pebook/pebook.html

In this case we are partial evaluating the function interp(sb, ctx), where intepr is the interpreter, sb is the unoptimized superblock and ctx is the executing context (known types, etc) to produce a residual program opt. Full partial evaluation would include compilation, such that opt(ctx) == interp(sb, ctx), but we want to produce a program for a tier2 interpreter interp2, such that interp2(opt) == interp(sb, ctx).

So this is a limited form of partial evaluation, but many of the same concepts apply.

Fidget-Spinner commented 3 months ago

I've read enough literature to understand this properly now. Here's a proposed definition/invariant/pseudo-algo:

  1. Binding-time analysis: according to the literature, we need a static/dynamic (S/D) split of all variables, at bytecode granularity.
  2. The (truncated) lattice is below.
  3. For the simplest starting scheme, all variables from LOAD_CONST are considered static.
  4. All else are approximated as dynamic.
  5. Non-escaping instructions can be eliminated if all stack inputs are static.
  6. Escaping instructions can never be eliminated.
  7. All stack outputs of an escaping instruction are dynamic. Except for 9.
  8. Stack outputs from a non-escaping instruction are static iff all stack inputs to the instruction are static. Else stack outputs from a non-escaping instruction are dynamic. Except for 9.
  9. All stack peeks preserve their static/dynamic nature.
  10. At each instruction, we store a shadow stack, which will allow us to reify the stack when needed on deopt.

We shall start with a simple optimization: e.g. LOAD_FAST, POP_TOP or constant propagation will do.

flowchart BT
    A(Bottom, i.e. empty) --> B(S)
    A --> C(D)
    B --> E(S,S)
    B --> D(S, D)
    C --> D
    C --> F(D, D);
Fidget-Spinner commented 2 months ago

Relevant paper https://dl.acm.org/doi/10.1145/1929501.1929508