Single instruction "superinstructions"

markshannon commented 1 year ago

Superinstructions can be problematic, as they break the one-instruction-at-a-time model that instrumentation relies on, and are likely to cause problems in optimizers that make the same one-instruction-at-a-time assumption.

Instrumentation rewrites all superinstructions back to the simpler form. Super instructions also prevent specialization, so superinstructions cannot include any specializable instruction. https://github.com/faster-cpython/ideas/issues/584 proposes a specialization of LOAD_CONST, so that leaves only LOAD_FAST and STORE_FAST to be combined.

If we remove LOAD_CONST we have only:

super(LOAD_FAST__LOAD_FAST) = LOAD_FAST + LOAD_FAST;
super(STORE_FAST__LOAD_FAST)  = STORE_FAST + LOAD_FAST;
super(STORE_FAST__STORE_FAST) = STORE_FAST + STORE_FAST;

That is still quite a lot of instructions, dynamically, so we don't want to just remove them. Instead of removing them, we can combine them into a single instruction. Given that most locals will have a index in range(16) we combine the operations into a single instruction.

inst(LOAD_FAST_LOAD_FAST_COMPACT, ( -- val1, val2)) {
    val1 = GETLOCAL(oparg&15);
    val1 = GETLOCAL(oparg>>4);
    Py_INCREF(val1);
    Py_INCREF(val2);
}

There will be fewer LOAD_FAST_LOAD_FAST_COMPACT than LOAD_FAST__LOAD_FAST, but they should be a little faster (fewer memory reads). I would expect the performance to be about the same.

brandtbucher commented 1 year ago

Just a note from our previous discussions for anyone else reading: this is only legal if both instructions are on the same line. When I tried a similar idea out before, this caused almost all of the STORE_FAST__LOAD_FAST superinstructions to disappear (since they basically always span more than one line).

markshannon commented 1 year ago

Instructions don't really have lines, effects do. Since LOAD_FAST has no observable side effects, it can be moved from line to line. Consider

a = 1
return b

which compiles to:

1     LOAD_CONST               1 (1)
      STORE_FAST               0 (a)

2     LOAD_FAST                1 (b)
      RETURN_VALUE

Which implies that we cannot merge the STORE_FAST and LOAD_FAST. But as LOAD_FAST has no observable side effects, this is a legal translation:

1     LOAD_CONST               1 (1)
      STORE_FAST               0 (a)

      LOAD_FAST                1 (b)
2     RETURN_VALUE

Which can be transformed to:

1     LOAD_CONST               1 (1)
      STORE_FAST_LOAD_FAST_COMPACT  0 (a), 1 (b)

2     RETURN_VALUE

markshannon commented 1 year ago

Unfortunately, having superinstructions straddling lines prevents jumping to the line in question in a debugger, as the stack is inconsistent, which rules out the transformation above.

However, it appears that the reduction in code size more than compensates.

carljm commented 1 year ago

which rules out the transformation above.

Does this mean that in general we have to consider the state of the value stack as an "observable side effect" when making optimizations in the compiler? That seems new (relative to my previous understanding of what was safe to optimize), and unfortunate.

brandtbucher commented 1 year ago

Well, I would assume that jumping to a line would jump just prior to execution of every instruction on that line. It would be weird, I think, if a local variable load or store didn't happen after jumping to a line and and stepping through it.

brandtbucher commented 1 year ago

In the example above, I would expect that if I had two lines foo = 42 and return foo, jumping over the first line would return the prior value of foo (or None if unbound). You think that frame_setlineno should be allowed to refuse the jump?

carljm commented 1 year ago

I'm not proposing anything in particular; frame_setlineno is code I hadn't looked at before and didn't realize existed :) I've never used the capability to jump directly to a line in a debugger. So I'm just newly wrapping my head around the implications for optimization (and thinking through whether I know of any existing compiler optimizations that could already break this.)

It seems this does mean that LOAD_FAST (or anything that pushes to stack) isn't generally safe to shift to a different line. So for example this seems to invalidate the proposed superoptimizer design in https://github.com/python/cpython/issues/102869

brandtbucher commented 1 year ago

frame_setlineno is code I hadn't looked at before and didn't realize existed :)

Lucky! ;)

iritkatriel commented 1 year ago

Do we know anyone who uses this evil function?

JelleZijlstra commented 1 year ago

Here's a third-party debugger that uses it: https://github.com/web2py/web2py/blob/7685d373474378e93132f8916145fb11f84cec71/gluon/contrib/dbg.py#L307. pdb also supports it (https://github.com/python/cpython/blob/e01b04c9075c6468ed57bc883693ec2a06a6dd8e/Lib/pdb.py#L1202).

Seems like we should seriously consider deprecating support for this feature though; it feels very rarely useful and very hard to support.

brandtbucher commented 1 year ago

I mean, I'm pretty sure every Python debugger has a jump command, so we'd probably hear from them first if we removed it.

If we convinced them to drop the functionality too, then maybe we'd find out how many users are really relying on it. There are at least a handful that have filed various bug reports for CPython over the years.

I'm stuck somewhere between "I can see how it could be useful" and "it's too much pain to maintain and doesn't always work"... though, I have admittedly used jump commands in pdb a few times since learning about them (mostly if I mess up my debugging session somehow and don't want to re-run everything, but also sometimes to take a different branch or something).

iritkatriel commented 1 year ago

I know debuggers support it. But do we know of any people who use it?

brandtbucher commented 1 year ago

Not counting people who use the feature via debuggers?

I mean, you can only do it from within a trace function, so probably just a handful of people writing toys like this:

import sys

def goto(line: int) -> None:
    def jumper(frame, event, arg) -> None:
        if event == "line":
            frame.f_lineno = line
            frame.f_trace = None
    frame = sys._getframe(1)
    frame.f_trace = jumper
    sys.settrace(jumper)

# Prints "B" in an infinite loop:
goto(15)    # 13
print("A")  # 14
print("B")  # 15
goto(13)    # 16
print("C")  # 17

gvanrossum commented 1 year ago

There are already many complicated prohibitions in the f_lineno setattr -- there is no guarantee that it will work. So I don't believe it ought to be used as a reason to skip line-fusing optimizations.

Additionally, stepping through code already frequently jumps back and forth in ways that are hard to grasp intuitively. Again, if optimizations cause the sequence of observed "line events" to change, that feels like a minor inconvenience at best. (And even though PEP 626 claims otherwise, its motivation feels arbitrary to me and its specification is rather vague: "Line events and the f_lineno attribute should act as an experienced Python user would expect.")

markshannon commented 1 year ago

Let's not remove jumping in debuggers, or other features, that make our work more ~difficult~ interesting.

Many features that prevent optimizations in the bytecode compiler can be handled in the tier 1 optimizer. Many features that prevent optimizations in the tier 1 optimizer, can be handled in the tier 2 optimizer. In the tier 2 optimizer any call to a builtin function is a potential deoptimization. A single call to PyObject_GetAttr can do anything, so it makes almost no difference what features are available.

faster-cpython / ideas

Single instruction "superinstructions" #585