[Proposal] Abstract interpretation algorithm for optimization of tier 2 uops

Fidget-Spinner commented 10 months ago

Sub-issue of https://github.com/faster-cpython/ideas/issues/559

Please see our proposal for optimizations to tier 2 uops https://docs.google.com/presentation/d/1jyqyO-OxF53NzPYZqkTzfCfCbzX2BC4Vo-BP5Da3zdk/

Proposed optimizations to be performed:

Partial evaluation
Dead store elimination
Common subexpression elimination
Global value numbering
Constant propagation
Type propagation
Guard elimination

gvanrossum commented 10 months ago

I see that you are proposing a DSL change like this:

op(_BINARY_OP_MULTIPLY_INT,
           (unused/1, left: &PyLong_Type, right: &PyLong_Type -- res: &PyLong_Type),
               flags=[
                   pure,
               ],
               symbolic=ADD_INT(left, right)
           )

I'm not sure what syntax you are proposing for types that allows &PyLong_Type, but assuming this just means it's a Python int, I propose that you use PyLongObject* instead, because that maps directly to what the code generator would/could generate for Tier 1. In fact we could write this today (I added the postfix * operator recently):

        op(_GUARD_BOTH_INT, (left: PyLongObject*, right: PyLongObject* -- left: PyLongObject*, right: PyLongObject*)) {
            ...
        }
        op(_BINARY_OP_ADD_INT, (unused/1, left: PyLongObject*, right: PyLongObject* -- res)) {
            ...
        }

(Cross-ref: https://github.com/faster-cpython/ideas/issues/611)

gvanrossum commented 10 months ago

Continuing on that DSL proposal, why flags=[pure] and not pure=true?

And where is ADD_INT(left, right) supposed to be defined? Won't moving the definition there from bytecodes.c just move code without adding any real power?

iritkatriel commented 10 months ago

Continuing on that DSL proposal, why flags=[pure] and not pure=true?

It seem like a good idea to have syntax for hard-coding values of metadata flags. We will probably have more use cases for this, like irregular bytecodes where the derivation logic isn't working well.

gvanrossum commented 10 months ago

In your GDoc about reading the paper you write

Instead of updating the symbolic expressions by doing data-flow analysis through the loops, instead express the symbolic relations as recursive definitions (see figure on the right). Xl2 is defined recursively. These are called cyclic term graphs (symbolic terms that can refer to themselves)

I do not understand what is happening here. I saw this in the YouTube video and didn't get it there either. How to they get from

(l2) x +-> x[l2]
  x = x + 1
(l3) x +-> x[l2] + 1

to recognizing the fixpoint?

gvanrossum commented 10 months ago

Continuing on that DSL proposal, why flags=[pure] and not pure=true?

It seem like a good idea to have syntax for hard-coding values of metadata flags. We will probably have more use cases for this, like irregular bytecodes where the derivation logic isn't working well.

That's another argument for pure=true -- we might very well want to say "set HAS_ARG to false" and the flags=[flag1,flag2,...] form doesn't easily extend to that, whereas has_arg=false comes out very naturally.

Fidget-Spinner commented 10 months ago

In your GDoc about reading the paper you write

Instead of updating the symbolic expressions by doing data-flow analysis through the loops, instead express the symbolic relations as recursive definitions (see figure on the right). Xl2 is defined recursively. These are called cyclic term graphs (symbolic terms that can refer to themselves)

I do not understand what is happening here. I saw this in the YouTube video and didn't get it there either. How to they get from
(l2) x +-> x[l2]
  x = x + 1
(l3) x +-> x[l2] + 1
to recognizing the fixpoint?

I cannot claim to understand this part fully as well (I did not pay too much attention to it as we won't need to do this for CPython unless we plan to do loop invariant code motion as well). However, here's what I understand,

At location $l2$, it realizes it has visited this location before, it then looks at all incoming edges and creates the $\phi$ node of the incoming edges. In that case, the incoming edge (from before the loop) ,is xl = xl2 = 0, and the other incoming edge, from the backward edge, is x = xl2 + 1 The phi of both is $\phi(0, xl2 + 1)$, which is the recursive definition.

Continuing on that DSL proposal, why flags=[pure] and not pure=true?

It seem like a good idea to have syntax for hard-coding values of metadata flags. We will probably have more use cases for this, like irregular bytecodes where the derivation logic isn't working well.

That's another argument for pure=true -- we might very well want to say "set HAS_ARG to false" and the flags=[flag1,flag2,...] form doesn't easily extend to that, whereas has_arg=false comes out very naturally.

Hmm this does look cleaner. I don't mind doing this first, then if we have too many flags, use the list method.

And where is ADD_INT(left, right) supposed to be defined? Won't moving the definition there from bytecodes.c just move code without adding any real power?

I plan to use the abstract interpreter to create the symbolic expression nodes. The symbolic expressions will use tagged unions (just a field set in a struct indicating its real type). So that can be read as "construct a tagged union of type ~ADD_INT~ (oops wrong one) _BINARY_OP_MULTIPLY_INT with the symbolic operands left and right on the abstract stack). I will add that to the slides.

Fidget-Spinner commented 10 months ago

I'm not sure what syntax you are proposing for types that allows &PyLong_Type, but assuming this just means it's a Python int, I propose that you use PyLongObject* instead, because that maps directly to what the code generator would/could generate for Tier 1. In fact we could write this today (I added the postfix * operator recently):

This is for convenience during the analysis phase, where we want to check Py_TYPE(x) == &PyLong_Type for example.

gvanrossum commented 10 months ago

This is for convenience during the analysis phase, where we want to check Py_TYPE(x) == &PyLong_Type for example.

Oh, interesting. It seems we have somewhat conflicting goals for the type attribute of stack effects. Maybe we can introduce a convention where if the type is PyFooObject * the type field is assumed to be &PyFoo_Type?

gvanrossum commented 10 months ago

Regarding my question about loops and your answer, I think I get it now.

At the point (l2) where the loop goes back to previous control flow, at the start of the loop we just have x = 0. After the first iteration we have x = phi(0, 0+1). But because this is a phi node, it gets a new name, and we write x = xl2 (understanding that xl2 stands for phi(0, 0+1)). Then we continue with the next iteration; at the bottom of the loop (at l3) we get x = xl2+1. (Not x = xl3, because there's no new phi here.) But then we loop around to l2, where we do have a merge point, so we need a phi, and we write x = xl2, but now xl2 = phi(0, xl2+1). Because x = xl2 is something we saw before at this point, the algorithm realizes it has reached a fixpoint and it stops.

The crux here is that we don't write out the phi functions in the symbolic store; instead of x = phi(0, 0+1) we write x = xl2 in the symbolic store, whereas the term graph holds the definition of xl2 etc., in this case {xl2 = phi(0, xl2+1)}. This distinction between abstract store and term graph is introduced a bit earlier in the talk, with the goal of naming terms (initially introducing x0, x1, ... instead of xl0, xl1, ... -- the latter refinement is needed for loops).

Watching a bit more of the video, what strikes me is that the abstract interpretation (if there are loops) may actually iterate an indefinite number of times, until the fixpoints are all found. I think in our case things will be simpler, because (at least initially) we only attempt to optimize one superblock at a time. If this superblock contains JUMP_TO_TOP uops there is a phi node at the very start (with as many terms as there are JUMP_TO_TOP uops, plus one), but we can just assume that at the top all variables have arbitrary contents, so we don't even need that phi, and everything becomes much simpler. Right? At least for the current semester's work.

It does concern me a bit that the talk has an Open Question on whether their approach can outperform traditional approaches. Although in our case we only need to be faster than PEP 659. :-)

Trivia question: what does the "square U" symbol mean? It is used in some slide titles.

gvanrossum commented 10 months ago

@Fidget-Spinner Another question: How do your current plans compare to the plans that Mark wrote up back in March or so, gathered under the label epic-tier2-optimizer?

Fidget-Spinner commented 10 months ago

Trivia question: what does the "square U" symbol mean? It is used in some slide titles.

From the paper: "the join operation ⊔ℓ, which merges abstract stores at control-flow join points such as ℓ3".

@Fidget-Spinner Another question: How do your current plans compare to the plans that Mark wrote up back in March or so, gathered under the label epic-tier2-optimizer?

They are complementary. Our proposal concerns only the "superblock optimization:" partial evaluation and specialization passes. It will generate the same code, if not better code than just traditional partial evaluation. The other parts about superblock management, creation, etc. are unaffected.

mlemerre commented 10 months ago

Hi, @Fidget-Spinner told me that you were discussing ideas from my paper here. I am very happy that these ideas could find a widespread use! I took the liberty of jumping in the conversation to introduce myself and add some comments; don't hesitate to ask me if you have further questions about the paper!

Regarding my question about loops and your answer, I think I get it now. [...]

Your understanding is correct.

It does concern me a bit that the talk has an Open Question on whether their approach can outperform traditional approaches. Although in our case we only need to be faster than PEP 659. :-)

The main potential performance problem in my opinion is the worst-case complexity when you have to do a fixpoint iteration. The traditional approach is more "pessimistic" and thus converges faster (but would be less precise in that it would initially introduce more phi nodes). It seems to me that your loop-free "superblocks" should not suffer from this potential performance issue.

faster-cpython / ideas

[Proposal] Abstract interpretation algorithm for optimization of tier 2 uops #615