Open markshannon opened 1 year ago
What's the general timeline you're looking for implementing all these?
What's the general timeline you're looking for implementing all these?
3.13. Except the compiler, which will probably be 3.14
Sorry I'm confused: which compiler are you referring to? The one used for the optimization passes on the tier 2 code?
A superblock is a linear piece of code with one entry and multiple exits. It differs from an extended (basic) block in that it may duplicate some code.
Take this CFG:
With extended blocks the CFG would look like this:
Whereas with superblocks, we get this:
When considering regions for optimization, we want as large as regions as possible without excessive duplication. But we also want regions that we can optimize quickly and effectively. Linear sequences of code can be optimized quickly. However basic blocks are too small to optimized effectively. So we choose superblocks.
In the example above, the duplication of D
allows it to be specialized for the context.
Obviously we don't want excessive duplication of code, but used judiciously it can help optimization considerably.
Sorry I'm confused: which compiler are you referring to? The one used for the optimization passes on the tier 2 code?
The compiler to machine code. Nothing else is referred to as a "compiler" in this tier in order to avoid any more confusion. It is already confusing enough that we have the source -> bytecode compiler, the C compiler and in the future a bytecode -> machine compiler.
[Accidently edited your comment earlier]
Whereas with superblocks, we get this:
The basic blocks in the lazy basic block versioning paper generate this and not the former (in fact, it's exactly the same as the diagram you drew). The paper's name is a little misleading. The only part of it that's about "basic blocks" is that the type propagation occurs at basic block boundaries, the actual code generated are superblocks as well.
In fact, this is how it would look like:
I don't see how they have the type information available to compile the specialized D'
when compiling A
.
They want the final code to look like the above, but it is built a basic block at a time. Isn't it?
They cannot know whether B
or C
is the most likely successor of A
, just which happens to execute first immediately after A
is compiled.
Compiling one block at a time allows propagation of type information, but can prevent effective partial evaluation.
They cannot know whether
B
orC
is the most likely successor ofA
, just which happens to execute first immediately afterA
is compiled.
Yes their argument is that the code will naturally generate like that since it's just generated according to the path taken at runtime.
They want the final code to look like the above, but it is built a basic block at a time. Isn't it?
Yes. The code object is like a stack, so the next generated block is written onto the empty space following the next basic block. That illustration is after both branches are generated.
Compiling one block at a time allows propagation of type information, but can prevent effective partial evaluation.
If we're emitting tier 2 bytecode, the final machine code compiler can convert the entire chunk that is generated from the basic block versioning and do the same large-scale partial evaluation and other optimisations that this proposes right?
So like
Tier 1 --- (n times) ---> Tier 2 bytecode ------ (x times) -----> Machine code
By the time the machine code is generated, assuming the tier 2 bytecode is executed many times, there will be enough basic blocks generated to do the same large scale optimisations you're suggesting.
We don't want to have to convert to machine code to do partial evaluation.
Creating larger regions and optimizing those is the job of the tier 3 optimizer.
The optimizer pipeline should look something like this:
Yes their argument is that the code will naturally generate like that since it's just generated according to the path taken at runtime.
When considering branches that are taken for dynamic type tests, it happens very often that these branches are heavily biased in one direction (or even only ever go one direction). So, being able to see which branch is first executed at run time can give you a nice linear sequence of blocks. It's a very good heuristic in practice.
Am I correct that peephole optimizations like removing unnecessary stack operations and so on will be performed on uops, which a bit earlier have been formed by tracer? If so, how do we convert optimized uops to the super instructions? It looks like that we need to return to the interpreter with a super instruction whose logic can have a huge number of variations of combinations of uops, and in this case the interpreter will need to know a particular variation of combinations in order to be able to interpret
Hi @markshannon!
I have a question about Tier 2 optimizer. If this feature is implemented, will the format of future .pyc
be impacted? Will this change influence only runtime bytecode format?
This all happens at runtime. The format of .pyc
files will not be changed.
We might change the format of .pyc
files for other reasons, like faster imports, though.
@GoEdgar
Am I correct that peephole optimizations like removing unnecessary stack operations and so on will be performed on uops, which a bit earlier have been formed by tracer? If so, how do we convert optimized uops to the super instructions? It looks like that we need to return to the interpreter with a super instruction whose logic can have a huge number of variations of combinations of uops, and in this case the interpreter will need to know a particular variation of combinations in order to be able to interpret
Maybe there's some terminological confusion here. "Super-instructions" are the exact opposite of what we emit to superblocks (for the latter, we emit micro-ops, usually called uops). For an example of a superblock, see this comment: https://github.com/python/cpython/issues/106529#issuecomment-1631649920
Also, we're not using tracing to form superblocks -- we're using projection.
This is the top level issue for the tier 2 optimizer for CPython 3.13 and beyond.
Try to keep discussion here at a high level and discuss the details on the sub-issues.
The tier 2 optimizer has always promised to optimize larger regions than the tier 1 (PEP 659) optimizer. But we have been a bit vague as to what those regions would be.
In an earlier discussion, I referred to them as "projected short traces". The term "trace" is a bit misleading, as it suggest some sort of recording of program execution.
The optimization I propose is more akin to basic block versioning, than the trace recording of PyPy. However, instead of basic blocks, we would be optimizing dynamic superblocks. The extent of the superblocks would be determined at runtime from profiling data gathered by the tier 1 interpreter.
The term "superblocks" might also be a bit misleading as they might include inlined calls, but I it's the best name I could come up with for now. We could call the tier 2 optimization, "superblock versioning", as we intend to handle polymorphism in much the same way as BBV.
For this we to work, we need to be able to do the following: