Open Fidget-Spinner opened 10 months ago
I see that you are proposing a DSL change like this:
op(_BINARY_OP_MULTIPLY_INT,
(unused/1, left: &PyLong_Type, right: &PyLong_Type -- res: &PyLong_Type),
flags=[
pure,
],
symbolic=ADD_INT(left, right)
)
I'm not sure what syntax you are proposing for types that allows &PyLong_Type
, but assuming this just means it's a Python int, I propose that you use PyLongObject*
instead, because that maps directly to what the code generator would/could generate for Tier 1. In fact we could write this today (I added the postfix *
operator recently):
op(_GUARD_BOTH_INT, (left: PyLongObject*, right: PyLongObject* -- left: PyLongObject*, right: PyLongObject*)) {
...
}
op(_BINARY_OP_ADD_INT, (unused/1, left: PyLongObject*, right: PyLongObject* -- res)) {
...
}
(Cross-ref: https://github.com/faster-cpython/ideas/issues/611)
Continuing on that DSL proposal, why flags=[pure]
and not pure=true
?
And where is ADD_INT(left, right)
supposed to be defined? Won't moving the definition there from bytecodes.c just move code without adding any real power?
Continuing on that DSL proposal, why
flags=[pure]
and notpure=true
?
It seem like a good idea to have syntax for hard-coding values of metadata flags. We will probably have more use cases for this, like irregular bytecodes where the derivation logic isn't working well.
In your GDoc about reading the paper you write
Instead of updating the symbolic expressions by doing data-flow analysis through the loops, instead express the symbolic relations as recursive definitions (see figure on the right). Xl2 is defined recursively. These are called cyclic term graphs (symbolic terms that can refer to themselves)
I do not understand what is happening here. I saw this in the YouTube video and didn't get it there either. How to they get from
(l2) x +-> x[l2]
x = x + 1
(l3) x +-> x[l2] + 1
to recognizing the fixpoint?
Continuing on that DSL proposal, why
flags=[pure]
and notpure=true
?It seem like a good idea to have syntax for hard-coding values of metadata flags. We will probably have more use cases for this, like irregular bytecodes where the derivation logic isn't working well.
That's another argument for pure=true
-- we might very well want to say "set HAS_ARG to false" and the flags=[flag1,flag2,...]
form doesn't easily extend to that, whereas has_arg=false
comes out very naturally.
In your GDoc about reading the paper you write
Instead of updating the symbolic expressions by doing data-flow analysis through the loops, instead express the symbolic relations as recursive definitions (see figure on the right). Xl2 is defined recursively. These are called cyclic term graphs (symbolic terms that can refer to themselves)
I do not understand what is happening here. I saw this in the YouTube video and didn't get it there either. How to they get from
(l2) x +-> x[l2] x = x + 1 (l3) x +-> x[l2] + 1
to recognizing the fixpoint?
I cannot claim to understand this part fully as well (I did not pay too much attention to it as we won't need to do this for CPython unless we plan to do loop invariant code motion as well). However, here's what I understand,
At location $l2$, it realizes it has visited this location before, it then looks at all incoming edges and creates the $\phi$ node of the incoming edges. In that case, the incoming edge (from before the loop) ,is xl = xl2 = 0
, and the other incoming edge, from the backward edge, is x = xl2 + 1
The phi of both is $\phi(0, xl2 + 1)$, which is the recursive definition.
Continuing on that DSL proposal, why
flags=[pure]
and notpure=true
?It seem like a good idea to have syntax for hard-coding values of metadata flags. We will probably have more use cases for this, like irregular bytecodes where the derivation logic isn't working well.
That's another argument for
pure=true
-- we might very well want to say "set HAS_ARG to false" and theflags=[flag1,flag2,...]
form doesn't easily extend to that, whereashas_arg=false
comes out very naturally.
Hmm this does look cleaner. I don't mind doing this first, then if we have too many flags, use the list method.
And where is
ADD_INT(left, right)
supposed to be defined? Won't moving the definition there from bytecodes.c just move code without adding any real power?
I plan to use the abstract interpreter to create the symbolic expression nodes. The symbolic expressions will use tagged unions (just a field set in a struct indicating its real type). So that can be read as "construct a tagged union of type ~ADD_INT
~ (oops wrong one) _BINARY_OP_MULTIPLY_INT
with the symbolic operands left
and right
on the abstract stack). I will add that to the slides.
I'm not sure what syntax you are proposing for types that allows
&PyLong_Type
, but assuming this just means it's a Python int, I propose that you usePyLongObject*
instead, because that maps directly to what the code generator would/could generate for Tier 1. In fact we could write this today (I added the postfix*
operator recently):
This is for convenience during the analysis phase, where we want to check Py_TYPE(x) == &PyLong_Type
for example.
This is for convenience during the analysis phase, where we want to check
Py_TYPE(x) == &PyLong_Type
for example.
Oh, interesting. It seems we have somewhat conflicting goals for the type
attribute of stack effects. Maybe we can introduce a convention where if the type is PyFooObject *
the type field is assumed to be &PyFoo_Type
?
Regarding my question about loops and your answer, I think I get it now.
At the point (l2
) where the loop goes back to previous control flow, at the start of the loop we just have x = 0
. After the first iteration we have x = phi(0, 0+1)
. But because this is a phi
node, it gets a new name, and we write x = xl2
(understanding that xl2
stands for phi(0, 0+1)
). Then we continue with the next iteration; at the bottom of the loop (at l3
) we get x = xl2+1
. (Not x = xl3
, because there's no new phi
here.) But then we loop around to l2
, where we do have a merge point, so we need a phi
, and we write x = xl2
, but now xl2 = phi(0, xl2+1)
. Because x = xl2
is something we saw before at this point, the algorithm realizes it has reached a fixpoint and it stops.
The crux here is that we don't write out the phi
functions in the symbolic store; instead of x = phi(0, 0+1)
we write x = xl2
in the symbolic store, whereas the term graph holds the definition of xl2
etc., in this case {xl2 = phi(0, xl2+1)}
. This distinction between abstract store and term graph is introduced a bit earlier in the talk, with the goal of naming terms (initially introducing x0, x1, ...
instead of xl0, xl1, ...
-- the latter refinement is needed for loops).
Watching a bit more of the video, what strikes me is that the abstract interpretation (if there are loops) may actually iterate an indefinite number of times, until the fixpoints are all found. I think in our case things will be simpler, because (at least initially) we only attempt to optimize one superblock at a time. If this superblock contains JUMP_TO_TOP
uops there is a phi
node at the very start (with as many terms as there are JUMP_TO_TOP
uops, plus one), but we can just assume that at the top all variables have arbitrary contents, so we don't even need that phi
, and everything becomes much simpler. Right? At least for the current semester's work.
It does concern me a bit that the talk has an Open Question on whether their approach can outperform traditional approaches. Although in our case we only need to be faster than PEP 659. :-)
Trivia question: what does the "square U" symbol mean? It is used in some slide titles.
@Fidget-Spinner Another question: How do your current plans compare to the plans that Mark wrote up back in March or so, gathered under the label epic-tier2-optimizer?
Trivia question: what does the "square U" symbol mean? It is used in some slide titles.
From the paper: "the join operation ⊔ℓ, which merges abstract stores at control-flow join points such as ℓ3".
@Fidget-Spinner Another question: How do your current plans compare to the plans that Mark wrote up back in March or so, gathered under the label epic-tier2-optimizer?
They are complementary. Our proposal concerns only the "superblock optimization:" partial evaluation and specialization passes. It will generate the same code, if not better code than just traditional partial evaluation. The other parts about superblock management, creation, etc. are unaffected.
Hi, @Fidget-Spinner told me that you were discussing ideas from my paper here. I am very happy that these ideas could find a widespread use! I took the liberty of jumping in the conversation to introduce myself and add some comments; don't hesitate to ask me if you have further questions about the paper!
Regarding my question about loops and your answer, I think I get it now. [...]
Your understanding is correct.
It does concern me a bit that the talk has an Open Question on whether their approach can outperform traditional approaches. Although in our case we only need to be faster than PEP 659. :-)
The main potential performance problem in my opinion is the worst-case complexity when you have to do a fixpoint iteration. The traditional approach is more "pessimistic" and thus converges faster (but would be less precise in that it would initially introduce more phi nodes). It seems to me that your loop-free "superblocks" should not suffer from this potential performance issue.
Sub-issue of https://github.com/faster-cpython/ideas/issues/559
Please see our proposal for optimizations to tier 2 uops https://docs.google.com/presentation/d/1jyqyO-OxF53NzPYZqkTzfCfCbzX2BC4Vo-BP5Da3zdk/
Proposed optimizations to be performed: