faster-cpython / ideas

1.67k stars 49 forks source link

Investigate "switched-goto" for compilers without computed gotos #537

Open lpereira opened 1 year ago

lpereira commented 1 year ago

Recently I came across this blog post that shows a rather weird way of having something between a standard switch dispatching for an eval loop and an eval loop with computed gotos. Now that we're experimenting with generating a lot of that code, we could maybe see if it makes any sense to adopt this strategy?

brandtbucher commented 1 year ago

Seems interesting!

We had a small related discussion here where the timings showed that computed gotos are only giving the eval loop a 1% speedup on Linux these days. (On the other hand, I got a 5-10% speedup on when I added computed gotos to the re engine last year.)

gvanrossum commented 1 year ago

IIUC the blog post essentially changes the DISPATCH() macro to be a switch on the next opcode whose cases contain gotos. E.g.

...
TARGET(OP1) {
   ...
   DISPATCH();
}
TARGET(OP2) {
    ...
    DISPATCH();
}
...

would expand to

label_OP1:
    ...
    switch (*next_instr++) {
        case OP1: goto label_OP1;
        case OP2: goto label_OP2;
        ...
    }
}
label_OP2:
    ...
    switch (*next_instr++) {
        case OP1: goto label_OP1;
        case OP2: goto label_OP2;
        ...
    }
}
...

and that switch would be repeated in each instruction. (Exactly where and hownext_instr is incremented is more complicated than shown here but doesn't affect reasoning about the scheme.)

And the theory is that having N copies of the switch (one for each opcode) helps the CPU's branch predictor because it will learn the most likely branch taken at the end of each opcode, so we won't need PREDICT() macros any more. (See #496.)

It would not be a very complicated experiment to carry out, except we'd need to wait until we have Windows benchmarking infrastructure in place, since on Linux/Mac we already have the computed goto.

One of my worries would be that the compiler sees that you have the same big piece of code in many places and it just unifies that into a single copy that it jumps to from everywhere. Compilers are weird that way.

neonene commented 1 year ago

As for Windows, _PyEval_EvalFrameDefault will hit MSVC's stuck or C4883 issue if each instruction branch has a big jump table: https://developercommunity.visualstudio.com/t/vs-155-and-vs-14-c-optimizer-fails-in-a-function-c/201909#T-N218093

The current 3.12 eval function can be less optimized getting the warning (/w14883) or the error (/we4883), even adding some code like 35000 of if (0);. And throwing /d2OptimizeHugeFunctions hidden cl flag seems actually not effective for PGO products, in which _PyEval_EvalFrameDefault is not profiled at all.