Fidget-Spinner commented 5 months ago

I want to separate out high-level discussion from that PR to this issue. The PR comment chain should henceforth mostly be for discussion of the implementation. My reasoning is that seeing both mixed with commits makes things hard to follow.

I want to summarise my thoughts below. This is in response to @markshannon's comment at https://github.com/python/cpython/pull/114059#issuecomment-1912088213

Addressing the current concerns

Performance

For some reason that I have not yet determined, the run in that PR has the optimizer failing more often than expected. The latest run suggests a 1% speedup, especially benefiting numeric code. My own local benchmarks suggest 0-4% speedup on default pyperformance (the benchmark runner here is a little different because it pulls pyston benchmarks as well).

A lack of generality / Abstract Interpreter DSL plans

There are concerns with whether the abstract interpreter generated in that PR is too specific to a single use case. If you want a comparison, Mark wrote a proof-of-concept, but not-yet-working abstract interpreter DSL here at his branch.

The future plan is to have a similar sort of DSL to what Mark has for rewrite rules and peephole optimizations on uop sequences. It is elegant and IMO a very clean way of expressing rules.

My current plan however is to keep the current abstract interpreter in the PR as-is, and then in the future, augment the abstract interpreter cases with the DSL. I have a very simple reason: the current abstract interpreter represents what can be inferred automatically from the bytecodes definition (apart from the type annotations). It reduces the amount of DSL we have to maintain. It does not make sense to write DSL rules for constant evaluation when something like that can be inferred automatically (by definition of pure op). The vast majority of bytecode follow simple rules that do not need an abstract interpreter DSL to define.

That said, I do definitely want to adopt Mark's approach when optimizing the bytecode. In short, I want to combine both approaches and maximize what I can automatically infer, and anything else leave to DSL rules for optimization.

This also answers the concerns for extensibility.

What does the PR actually offer?

Assuming we don't merge the current DSL's abstract interpreter generator.

Data structures

The core thing is that it sets up the data structures required for an abstract interpreter. I cannot oversell this enough. This is subtle, but the abstract interpreter data structure itself is the reduced version of the full optimizer I originally wrote. I made various design decisions to make optimization simpler with the benefit of hindsight (and 3 rewrites :). .For example, its state already models inlined calls. This is why it uses a single contiguous stack to share separate frames's localsplus, so that calculation for inlined locals is made simple. A lot of experimentation has been put into the infrastructure (and admittedly less so the DSL, which has much room for improvement as mentioned above).

Type propagation

The next thing is that it offers an effective type analyzer. On the latest branch, _GUARD_BOTH_INT is cut by over 70%. Mark has said it's not an instruction we should focus optimization on. I fully agree. I am purely using its statistics as a proxy to see how effective our type propagation is.

Poor man's loop invariant code motion for guards

I called this loop peeling originally, but it gave the wrong idea. The main reason why I repeat the body of a loop a second time is because during the first round, types are not yet known by the type analyzer. Consider this:

_ITER_NEXT_RANGE
_LOAD_FAST
_LOAD_FAST
_GUARD_BOTH_INT
_BINARY_OP_ADD_INT
_JUMP_TO_TOP

Running the type propagator one more time and duplicating the body gives this code

_ITER_NEXT_RANGE
_LOAD_FAST
_LOAD_FAST
_GUARD_BOTH_INT
_BINARY_OP_ADD_INT
_JUMP_ABSOLUTE_HEADER
# New loop body
_ITER_NEXT_RANGE
_LOAD_FAST
_LOAD_FAST
_BINARY_OP_ADD_INT
_JUMP_ABSOLUTE (to after _JUMP_ABSOLUTE_HEADER)

Note: the main loop is now completely guard-free! This is the benefit of "peeling" the loop. It's a poor man's loop invariant code motion for guards.

This doubled the number of float and int guards we reduce at runtime, from roughly 15% to roughly 30%, and roughly 35% to roughly 70% respectively.

gvanrossum commented 5 months ago

(What's the DSL generator?)

Fidget-Spinner commented 5 months ago

(What's the DSL generator?)

Sorry I meant to write the DSL's abstract interpreter generator.

Fidget-Spinner commented 5 months ago

Latest performance result, pure uops, no JIT: 0-2% faster on Linux, 3% faster on macOS ARM64

gvanrossum commented 5 months ago

It's possible that my comment in the PR about extensibility belongs here.

gvanrossum commented 5 months ago

I'd like to explore an idea for simplifying the code generation a bit. See here for reference.

Currently, when we do constant evaluation for e.g. a binary add operation, we replace load A; load B; add with load A; load B; pop 2; load_const (A+B) and then peephole the load A; load B; pop 2 into oblivion.

(I'm writing load instead of any of LOAD_FAST or LOAD_CONST*, and pop for _SHRINK_STACK; load_const actually stands for _LOAD_CONST_INLINE or _LOAD_CONST_INLINE_BORROW.)

Looking at peephole_optimizations(), we only back over a few specific instructions: defined in op_is_zappable(), the set includes the "bookkeeping" instructions _SET_IP and _CHECK_VALIDITY, all flavors of _LOAD_CONST, and _LOAD_FAST. (But not _LOAD_CONST_CHECK, which is impure, i.e., it can fail, so we should never erase it.)

My first question is, when are we unable to optimize the _SHRINK_STACK instruction away? We know that this is only emitted as part of constant evaluation, which only happens when the arguments (of add, in the example) are all known constants.

I don't want to work out a formal proof, but I'm pretty sure that we can prove that the only way we can have a constant on the stack is if it was pushed by one of the following:

a _LOAD_CONST* uop
a _LOAD_FAST from a variable that is known to be a constant
a pure operation (one of the type-specific unary or binary operator

Since the first two categories are considered zappable, they are fine. However, BINARY_OP_ADD_INT is not obviously zappable. Let's try to construct an example that might not get zapped: A + B*C, where A, B and C are known constants (either actual constants or _LOAD_FAST of a variable known to hold a constant, e.g. after x = 1). This generates the following unoptimized uops: load A; load B; load C; mul; add, and it gets then optimized to load A; load B; load C; pop 2; load_const (B*C); pop 2; load_const (A + B*C). If the peepholer were to look at the final pop 2 first, it would find it preceded by the initial pop 2, which is not zappable. However! The peepholer sees the initial pop 2 first. This one is preceded by two zappable loads (B and C), so the final pop 2 now finds itself preceded by load A; nop; nop; nop; load_const (B*C), and this should be zappable.

(Alas, it currently isn't, because is_zappable() omits _NOP from its list. This seems like a simple omission -- if I add it, nothing breaks.)

So what other things might intervene that make the sequence non-zappable? Well, there could be a non-constant LOAD_FAST followed by a POP_TOP in the middle. I can't think of what would cause that, but I don't want to rule it out either. Another example (using the walrus operator) is easier to imagine: A + (X := B*C) produces something like load A; load B; load C; mul; copy 1; store X; add. And while COPY is pure, it's not zappable, so we're stuck after peepholing with load A; load_const (B*C); copy 1; store X; pop 2; load_const (A + B*C).

That's too bad, because looking at that I can see that it ought to be simplifyable to load_const (B*C); store X; load_const(A + B*C) -- but I am okay with leaving that to a future optimization pass.

Hm, so all my idea boils down to is that we can probably move the _SHRINK_STACK peephole work to emit_const(), and the _CHECK_PEP_523 to a hand-written case in uop_abstract_interpret_single_inst(). Oh well.

(What I had hoped to arrive at was a proof that we will never need more output space than input space, and that we can do this on the fly so we won't need a separate output write buffer. But at least until the walrus example is solved, we can't.)

gvanrossum commented 5 months ago

(What's the DSL generator?)

Sorry I meant to write the DSL's abstract interpreter generator.

Why would we not merge that? It seems to be an integral part of the PR.

Fidget-Spinner commented 5 months ago

(What's the DSL generator?)

Sorry I meant to write the DSL's abstract interpreter generator.

Why would we not merge that? It seems to be an integral part of the PR.

The alternative is Mark's version here https://github.com/faster-cpython/cpython/blob/8aad399240928a7dfe548c0154d7354c96a1f6ff/Python/tier2_specializer.c

I plan to eventually integrate something like that though.

gvanrossum commented 5 months ago

The file you linked to would just be another input to the generator though. You'd still need a generator. All it saves you is about 150 lines in optimizer_analysis.c, at the cost of some more work in the generator. One possible advantage, because the new file is specific to the symbolic interpreter part, you could move the stack effect annotations there and possibly invent some annotations that are more suitable to that use case (but less proper when describing opcodes in general).

Fidget-Spinner commented 4 months ago

This is solved.

faster-cpython / ideas

Design discussion for https://github.com/python/cpython/pull/114059 #648

Addressing the current concerns

Performance

A lack of generality / Abstract Interpreter DSL plans

What does the PR actually offer?

Data structures

Type propagation

Poor man's loop invariant code motion for guards