Optimization of superblocks

Once a superblock is created it will have a lot of redundancy.

We want to remove that redundancy and either eliminate expensive operations or move them to the slow path.

There are three major passes to consider for tier 2+ optimizers

Specialization https://github.com/faster-cpython/ideas/issues/560
Partial evaluation https://github.com/faster-cpython/ideas/issues/561
Compilation

We will also need a few minor passes to fill in gaps and create code suitable for execution.

Note that I'm ignoring reference count elimination, as we don't know exactly where it will go.

Superblock creation will need to know about types in order to estimate the likelihood of staying in the superblock, so we might want to merge some parts of specialization into superblock creation.

The optimization pipeline will look something like

Specialization
Peephole cleanup
Partial evaluation
Peephole cleanup
Prepare for execution*

Once we have a compiler, preparation for execution will consist of translation to machine code, and setting the necessary code pointer.

If we are going to interpreter the superblock, preparation will consist of:

Insert superinstructions
Convert instructions to interpretable format.

Peephole cleanup.

Optimizations tend to leave NOPs and some redundant stack manipulations. We insert a cleanup pass between major passes to clean them up and keep the other passes simpler.

Super instructions

We expect the final superblock to consist of quite low-level instructions. We can reduce interpretative overhead with superinstructions. The choice and creation of superinstructions should be automated using data gathered from running "real code".

Tier 2 super instructions

Unlike tier 1, where superinstructions cannot cross line or function boundaries, there are no restrictions on superinstruction formation in tier 2.

Assuming this is representative and making the worse case assumption that all uops occur independently, then we can half the number of tier 2 instructions executed with about 1000 super-uops. With ~200 super-uops combining 3 uops, and ~800 combining 2 uops, we would expect about 20% 3-uops, 60% 2-uops and 20% single uops.

A 1000 super-uops might see like a lot, but we can auto-generate them.

What the performance impact of this would be, I don't know.

The impact of adding more super-uops is likely to become negative before we get to a 1000, so we will need to experiment. Once we can auto-generate the super-uops, experimentation should be easy.

Inserting super-uops should be the last pass of the optimizer, before final code generation.

faster-cpython / ideas