Open lpereira opened 1 year ago
Seems interesting!
We had a small related discussion here where the timings showed that computed gotos are only giving the eval loop a 1% speedup on Linux these days. (On the other hand, I got a 5-10% speedup on when I added computed gotos to the re
engine last year.)
IIUC the blog post essentially changes the DISPATCH()
macro to be a switch on the next opcode whose cases contain gotos. E.g.
...
TARGET(OP1) {
...
DISPATCH();
}
TARGET(OP2) {
...
DISPATCH();
}
...
would expand to
label_OP1:
...
switch (*next_instr++) {
case OP1: goto label_OP1;
case OP2: goto label_OP2;
...
}
}
label_OP2:
...
switch (*next_instr++) {
case OP1: goto label_OP1;
case OP2: goto label_OP2;
...
}
}
...
and that switch would be repeated in each instruction. (Exactly where and hownext_instr
is incremented is more complicated than shown here but doesn't affect reasoning about the scheme.)
And the theory is that having N copies of the switch (one for each opcode) helps the CPU's branch predictor because it will learn the most likely branch taken at the end of each opcode, so we won't need PREDICT()
macros any more. (See #496.)
It would not be a very complicated experiment to carry out, except we'd need to wait until we have Windows benchmarking infrastructure in place, since on Linux/Mac we already have the computed goto.
One of my worries would be that the compiler sees that you have the same big piece of code in many places and it just unifies that into a single copy that it jumps to from everywhere. Compilers are weird that way.
As for Windows, _PyEval_EvalFrameDefault will hit MSVC's stuck or C4883
issue if each instruction branch has a big jump table:
https://developercommunity.visualstudio.com/t/vs-155-and-vs-14-c-optimizer-fails-in-a-function-c/201909#T-N218093
The current 3.12 eval function can be less optimized getting the warning (/w14883
) or the error (/we4883
), even adding some code like 35000 of if (0);
. And throwing /d2OptimizeHugeFunctions
hidden cl flag seems actually not effective for PGO products, in which _PyEval_EvalFrameDefault is not profiled at all.
Recently I came across this blog post that shows a rather weird way of having something between a standard switch dispatching for an eval loop and an eval loop with computed gotos. Now that we're experimenting with generating a lot of that code, we could maybe see if it makes any sense to adopt this strategy?