faster-cpython / ideas

1.67k stars 49 forks source link

Optimization of superblocks #559

Open markshannon opened 1 year ago

markshannon commented 1 year ago

Once a superblock is created it will have a lot of redundancy.

We want to remove that redundancy and either eliminate expensive operations or move them to the slow path.

There are three major passes to consider for tier 2+ optimizers

We will also need a few minor passes to fill in gaps and create code suitable for execution.

Note that I'm ignoring reference count elimination, as we don't know exactly where it will go.

Superblock creation will need to know about types in order to estimate the likelihood of staying in the superblock, so we might want to merge some parts of specialization into superblock creation.

The optimization pipeline will look something like

Once we have a compiler, preparation for execution will consist of translation to machine code, and setting the necessary code pointer.

If we are going to interpreter the superblock, preparation will consist of:

Peephole cleanup.

Optimizations tend to leave NOPs and some redundant stack manipulations. We insert a cleanup pass between major passes to clean them up and keep the other passes simpler.

Super instructions

We expect the final superblock to consist of quite low-level instructions. We can reduce interpretative overhead with superinstructions. The choice and creation of superinstructions should be automated using data gathered from running "real code".

markshannon commented 8 months ago

Tier 2 super instructions

Unlike tier 1, where superinstructions cannot cross line or function boundaries, there are no restrictions on superinstruction formation in tier 2.

Assuming this is representative and making the worse case assumption that all uops occur independently, then we can half the number of tier 2 instructions executed with about 1000 super-uops. With ~200 super-uops combining 3 uops, and ~800 combining 2 uops, we would expect about 20% 3-uops, 60% 2-uops and 20% single uops.

A 1000 super-uops might see like a lot, but we can auto-generate them.

What the performance impact of this would be, I don't know.

The impact of adding more super-uops is likely to become negative before we get to a 1000, so we will need to experiment. Once we can auto-generate the super-uops, experimentation should be easy.

Inserting super-uops should be the last pass of the optimizer, before final code generation.