faster-cpython / ideas

1.67k stars 49 forks source link

Tier 2 Optimizer: Superblock Type Propagation & Unboxing #564

Open Fidget-Spinner opened 1 year ago

Fidget-Spinner commented 1 year ago

This is joint work by @JuliaPoo and me. It is also inspired by lazy basic block versioning, by Maxime Chevalier-Boisvert and Marc Feeley. It also learns from the Multi Level Quickening paper by Stefan Brunthaler.

We can use the specialised instructions to generate type guard instructions at runtime, these type guard instructions will guarantee the types seen by certain operations. We shall then propagate these known facts to all code in that superblock, thus eliminating type checks and enabling large-scale typed operations, without requiring any type hints. Guard failures are exits from the superblock.

To do this, we will need an " types interpreter" that walks over these instructions, and we will need typed stack effects to generate this interpreter. The types interpreter operates on references to types, not the types themselves, this will allow us to update multiple type references with a single operation. There are only a few sources of references: the const array, the locals array, and intermediate operation values. There shouldn't be any reference cycles except when there's a COPY instruction.

Each superblock shall store a type context (a snapshot of the type stack and the type of fastlocals) at the end of its execution. This will help propagate information across superblocks.

A demo of type propagation:

We already have working code demonstrating this (however, the type propagation algo used here is simpler than the one we propose, that is a WIP): Say for the following code:

def f(x, y):
 z = x + y
 return z + 1

f(1, 2)

# This is working code, running on my computer.
>>> dis.dis(f, tier2=True)
  1           0 RESUME_QUICK             0

  2           2 LOAD_FAST                0 (x)
              4 LOAD_FAST                1 (y)
              6 BINARY_CHECK_INT
              8 BB_BRANCH_IF_FLAG_UNSET     0
             10 POP_TOP

  3          12 BINARY_OP_ADD_INT_REST
             14 STORE_FAST               2 (z)
             16 LOAD_FAST                2 (z)
             18 LOAD_CONST               1 (1)
             20 BINARY_OP_ADD_INT_REST
             22 RETURN_VALUE
>>>

Ignore the BB_BRANCH instructions for now. BINARY_OP_ADD_INT_REST is an instruction that does a binary addition without any type checks. Note that only the first instruction has a type check. The second add instruction has no type check. The types propagator has decided that the second addition needs no checks, and thus has eliminated it.

With this method of type propagation, many optimisations are possible. For example, we can generate even more efficient code than PEP 659 or the multi level quickening paper when operating on unboxed values. Note that both PEP 659 and the multi level quickening paper still requires repeated type guards when loading from locals and certain other operations, because its instructions are speculative. With type guards instructions, we don't just think that an instruction has a certain type effect, we know the instruction has a type effect. This allows us to have large-scale (mostly branchless) unboxed operations.

Future extensions: associating type version tags with attribute type information (again, this is from Maxime's and Marc's paper on Extending basic block versioning with typed object shapes).

Fidget-Spinner commented 1 year ago

Status update:

sbrunthaler commented 1 year ago

Hello everyone!

Sounds awesome!

For clarification, two things to keep in mind:

  1. In MLQ, the last 1x or so of performance improvement is due to the native-machine-type delimited superinstructions, i.e., essentially arbitrarily long instruction sequences operating on unboxed native-machine data types. So, my gut feeling is you won't be able to go beyond MLQ with the presented design. (I'd guesstimate a 2.5x-3.5 improvement, though [over threaded code dispatch]).
  2. IIRC (would need to check with the code, can do tomorrow if required/desired), the MLQ interpreter does not necessarily have type guards on all LOAD instructions: Using a technique I described in my PhD called "partial stack frame caching," I had assigned local variables to dedicated interpreter variables. Since their type cannot change easily, I did not type check them in the interpreter main loop, but only upon promotion from the fast_locals to their dedicated variables. (Note: I did have a very conservative invalidation policy, e.g., each STORE_GLOBAL would undo lots of optimizations.)

Other than that: This idea rocks and I'm positive Maxime C.-B. loves to see her academic work getting promotion, too!

All the best from Munich, --stefan

PS: Am somewhat busy with teaching this week, but let me know if I can help in any way!

sbrunthaler commented 1 year ago

Ah, well, of course I think of something additional just after commenting, sorry.

I've also worked on a super-cool paper with Mohaned Qunaibit a while back when I was still at UCI: https://drops.dagstuhl.de/opus/volltexte/2018/9221/pdf/LIPIcs-ECOOP-2018-16.pdf

The idea there is to use type feedback to predict future type changes. We did this to move computationally intensive mathematical kernels from CPU to GPU, but the system and the underlying idea could (and IMO should) generalize to remove much more type guards.

HTH, --stefan

Fidget-Spinner commented 1 year ago

Thanks for the insights Stefan!

I did some quick and dirty benchmarks and for unboxed float operations, we are roughly 70% faster than CPython 3.10 (or about 20% faster than CPython main). So you're right that a big part of the speedup in the MLQ paper comes from the profile-guided superinstructions.

I'll take a look at that MEGAGUARDS paper later this week. I'm somewhat busy with taking classes :).

sbrunthaler commented 1 year ago

Ken,

this seems to be a low estimate. IIRC, for spectralnorm on my old i7-920 classical inline caching via quickening (called INCA in my experiments in the ECOOP'10 paper) yielded about 70pct as well. Unboxing normally is already a big win, especially since then the interpreter instruction implementations can be reduced down to a machine instruction for many mathematical instructions. Only when interpreter instruction implementation is that "small", will superinstructions pay off (and you will also benefit from the optimizations performed by gcc/llvm on the long superinstructions (e.g., very good register allocation & instruction selection). Maybe the baseline got a lot faster as well, my experiments were relative to Python 3.4.

If there is some branch you're working on, maybe I can take a look and help out.

Good luck with your classes ;)

Fidget-Spinner commented 1 year ago

Thanks! The branch is in https://github.com/Fidget-Spinner/cpython/tree/tier2_interpreter_no_separate_eval_no_tracer_contiguous_bbs.

The float unboxing op is at https://github.com/Fidget-Spinner/cpython/blob/tier2_interpreter_no_separate_eval_no_tracer_contiguous_bbs/Python/generated_cases.c.h#L455-L465

I suspect float unboxing had less speedups than I was expecting because the CPython 3.11 Specializing Adaptive Interpreter/ INCA instructions already directly do machine instruction float additions. It just requires a dereference to the float objects. See https://github.com/python/cpython/blob/82eb9469e717e0047543732287a8342e646c2262/Python/generated_cases.c.h#L423-L440

I will try int unboxing next. I suspect that will have the greatest speedup because the INCA/PEP 659 instructions cannot do native machine int add on them, and still has to go through the _PyLong_Add routine. See https://github.com/python/cpython/blob/82eb9469e717e0047543732287a8342e646c2262/Python/generated_cases.c.h#L442-L460

sbrunthaler commented 1 year ago

Aloha!

Sorry for being late, exams and such seem to happen much more frequent than I care for...

I glanced at your implementation two weeks ago, and there are two things I want to highlight: a) What you already mentioned, Floats are not ideal for comparison, because they are rather "shallow". b) Your implementation is using a dedicated UNBOX_x instruction that performs unboxing. (I had a similar first implementation, too, btw.) So, w/o superinstructions, you will suffer from dispatch overhead on instructions that are already quite low-level. Since going full superinstructions maybe expensive w.r.t. implementation, I think it's much simpler (and equally worthwhile for testing/evaluation) to move the unboxing into the loads (which is what I ended up with in my MLQ implementation.)

All the best from Munich, --stefan

Fidget-Spinner commented 1 year ago

Just to close off the loop, this is the final report on preliminary experiments on type propagation of basic blocks. For the superblock approach, type propagation should be significantly simpler as it's a straight-line code sequence.

https://discuss.python.org/t/preliminary-experiments-for-tier-2-interpreter-ideas-in-cpython/25874

(EDIT) NOTE

This report contains many limitations, and is not a guide for language implementers to follow. Rather it is a set of experiments in CPython. We encourage people to learn from it, but please do not follow the implementation as-is if you want better performance.

Chiefly among the features, it has the following drawbacks:

We go into a lot more details about the limitations and lack of speedups in the report.

Thanks for Stefan Brunthaler for reminding me and pointing these out.