faster-cpython / ideas

1.68k stars 48 forks source link

Tier 2 optimizer. #557

Open markshannon opened 1 year ago

markshannon commented 1 year ago

This is the top level issue for the tier 2 optimizer for CPython 3.13 and beyond.

Try to keep discussion here at a high level and discuss the details on the sub-issues.

The tier 2 optimizer has always promised to optimize larger regions than the tier 1 (PEP 659) optimizer. But we have been a bit vague as to what those regions would be.

In an earlier discussion, I referred to them as "projected short traces". The term "trace" is a bit misleading, as it suggest some sort of recording of program execution.

The optimization I propose is more akin to basic block versioning, than the trace recording of PyPy. However, instead of basic blocks, we would be optimizing dynamic superblocks. The extent of the superblocks would be determined at runtime from profiling data gathered by the tier 1 interpreter.

The term "superblocks" might also be a bit misleading as they might include inlined calls, but I it's the best name I could come up with for now. We could call the tier 2 optimization, "superblock versioning", as we intend to handle polymorphism in much the same way as BBV.

For this we to work, we need to be able to do the following:

Fidget-Spinner commented 1 year ago

What's the general timeline you're looking for implementing all these?

markshannon commented 1 year ago

What's the general timeline you're looking for implementing all these?

3.13. Except the compiler, which will probably be 3.14

Fidget-Spinner commented 1 year ago

Sorry I'm confused: which compiler are you referring to? The one used for the optimization passes on the tier 2 code?

markshannon commented 1 year ago

What is a "superblock"?

A superblock is a linear piece of code with one entry and multiple exits. It differs from an extended (basic) block in that it may duplicate some code.

Take this CFG:

basic_blocks gv

With extended blocks the CFG would look like this: extended_blocks gv

Whereas with superblocks, we get this: super_blocks gv

Why use superblocks?

When considering regions for optimization, we want as large as regions as possible without excessive duplication. But we also want regions that we can optimize quickly and effectively. Linear sequences of code can be optimized quickly. However basic blocks are too small to optimized effectively. So we choose superblocks.

In the example above, the duplication of D allows it to be specialized for the context. Obviously we don't want excessive duplication of code, but used judiciously it can help optimization considerably.

markshannon commented 1 year ago

Sorry I'm confused: which compiler are you referring to? The one used for the optimization passes on the tier 2 code?

The compiler to machine code. Nothing else is referred to as a "compiler" in this tier in order to avoid any more confusion. It is already confusing enough that we have the source -> bytecode compiler, the C compiler and in the future a bytecode -> machine compiler.

[Accidently edited your comment earlier]

Fidget-Spinner commented 1 year ago

Whereas with superblocks, we get this: super_blocks gv

The basic blocks in the lazy basic block versioning paper generate this and not the former (in fact, it's exactly the same as the diagram you drew). The paper's name is a little misleading. The only part of it that's about "basic blocks" is that the type propagation occurs at basic block boundaries, the actual code generated are superblocks as well.

In fact, this is how it would look like: image

markshannon commented 1 year ago

I don't see how they have the type information available to compile the specialized D' when compiling A. They want the final code to look like the above, but it is built a basic block at a time. Isn't it? They cannot know whether B or C is the most likely successor of A, just which happens to execute first immediately after A is compiled.

Compiling one block at a time allows propagation of type information, but can prevent effective partial evaluation.

Fidget-Spinner commented 1 year ago

They cannot know whether B or C is the most likely successor of A, just which happens to execute first immediately after A is compiled.

Yes their argument is that the code will naturally generate like that since it's just generated according to the path taken at runtime.

They want the final code to look like the above, but it is built a basic block at a time. Isn't it?

Yes. The code object is like a stack, so the next generated block is written onto the empty space following the next basic block. That illustration is after both branches are generated.

Compiling one block at a time allows propagation of type information, but can prevent effective partial evaluation.

If we're emitting tier 2 bytecode, the final machine code compiler can convert the entire chunk that is generated from the basic block versioning and do the same large-scale partial evaluation and other optimisations that this proposes right?

So like

Tier 1 --- (n times) ---> Tier 2 bytecode ------ (x times) -----> Machine code

By the time the machine code is generated, assuming the tier 2 bytecode is executed many times, there will be enough basic blocks generated to do the same large scale optimisations you're suggesting.

markshannon commented 1 year ago

We don't want to have to convert to machine code to do partial evaluation.

Creating larger regions and optimizing those is the job of the tier 3 optimizer.

markshannon commented 1 year ago

The optimizer pipeline should look something like this:

optimize_pipeline gv

maximecb commented 1 year ago

Yes their argument is that the code will naturally generate like that since it's just generated according to the path taken at runtime.

When considering branches that are taken for dynamic type tests, it happens very often that these branches are heavily biased in one direction (or even only ever go one direction). So, being able to see which branch is first executed at run time can give you a nice linear sequence of blocks. It's a very good heuristic in practice.

GoEdgar commented 1 year ago

Am I correct that peephole optimizations like removing unnecessary stack operations and so on will be performed on uops, which a bit earlier have been formed by tracer? If so, how do we convert optimized uops to the super instructions? It looks like that we need to return to the interpreter with a super instruction whose logic can have a huge number of variations of combinations of uops, and in this case the interpreter will need to know a particular variation of combinations in order to be able to interpret

image
diohabara commented 1 year ago

Hi @markshannon! I have a question about Tier 2 optimizer. If this feature is implemented, will the format of future .pyc be impacted? Will this change influence only runtime bytecode format?

markshannon commented 1 year ago

This all happens at runtime. The format of .pyc files will not be changed. We might change the format of .pyc files for other reasons, like faster imports, though.

gvanrossum commented 1 year ago

@GoEdgar

Am I correct that peephole optimizations like removing unnecessary stack operations and so on will be performed on uops, which a bit earlier have been formed by tracer? If so, how do we convert optimized uops to the super instructions? It looks like that we need to return to the interpreter with a super instruction whose logic can have a huge number of variations of combinations of uops, and in this case the interpreter will need to know a particular variation of combinations in order to be able to interpret image

Maybe there's some terminological confusion here. "Super-instructions" are the exact opposite of what we emit to superblocks (for the latter, we emit micro-ops, usually called uops). For an example of a superblock, see this comment: https://github.com/python/cpython/issues/106529#issuecomment-1631649920

Also, we're not using tracing to form superblocks -- we're using projection.