Closed Fidget-Spinner closed 6 months ago
Hmm more benchmarking reveals ~it's~ the optimizer is still slower on reduce trace length....the implication is the optimizer is doing nothing. (This is on a JIT build).
Edit: I meant to say that the build with JIT is slower/the same on smaller trace length.
I decided to build without JIT because it was confounding the information. Here is just Tier 2 alone: Long traces (800, current on main):
Optimizer off vs Optimizer on:
pyperf compare_to ../opt_off_long_t2.json ../opt_on_long_t2.json
Mean +- std dev: [opt_off_long_t2] 148 ms +- 2 ms -> [opt_on_long_t2] 143 ms +- 2 ms: 1.03x faster
Medium traces (400, set by me):
pyperf compare_to ../opt_off_medium_t2.json ../opt_on_medium_t2.json
Mean +- std dev: [opt_off_medium_t2] 151 ms +- 2 ms -> [opt_on_medium_t2] 138 ms +- 2 ms: 1.09x faster
It seems that medium traces are indeed faster for the optimizer than long traces. I thus propose we reduce trace length to 400.
Benchmarked with
./python ../bm_nbody.py -w 20 -o
(ie. 20 warmup iterations)
Sounds good to me. Which constant is currently 800? Last time I looked it was a power of two?
EDIT: I'm not sure I understand why shorter trace lengths make things faster. See https://github.com/faster-cpython/ideas/issues/669#issuecomment-2030307478 and ff.
Sounds good to me. Which constant is currently 800? Last time I looked it was a power of two?
The trace length. It got changed to 800 after Mark introduced hot-cold splitting.
I wonder if we could prove that the optimizer is the bottleneck and what part of it. Maybe we could artificially increase the number of times we run the optimizer, and then apply perf
to the process to get maybe a flame graph or something?
I don't think there's anything in the optimizer that's quadratic, but we could very well have long traces where execution quickly deoptimizes.
Another thought: Until gh-117144 is merged, we might be wasting a lot of time unsuccessfully optimizing the same thing over and over, for lack of exponential back-off for side exits.
Another thought: Until gh-117144 is merged, we might be wasting a lot of time unsuccessfully optimizing the same thing over and over, for lack of exponential back-off for side exits.
These results are with your PR applied :). Without it, I suspect it's much worse.
These results are with your PR applied :). Without it, I suspect it's much worse.
It was worse before, when there was an issue that would cause the abstract interpreter to fail when it encountered a generator expression. But that was fixed by gh-117028. So it might not actually be so bad (now it's mostly short traces that get done over and over).
Another approach would be to have a more realistic way to calculate the on-trace confidence while we are projecting. But this is tricky. E.g., what probability do you assign to _GUARD_BOTH_INT
passing? Is that something we have in the pystats data?
We talked about this during our team meeting (sans Mark). I'm now very curious what's happening in nbody that is made faster by limiting the trace size. I'm going to spike on trying to find exactly which traces are being created and where they deoptimize in the stable state of the algorithm (which looks to be extremely stable -- not an if
in sight. :-)
Lab notes from my M3 Mac.
I copied the nbody benchmark into a file in my repo root nbody.py
(this requires installing pyperf
using the local python.exe
) and run variations of command lines like this:
PYTHON_UOPS=1 PYTHON_LLTRACE=1 ./python.exe nbody.py --inherit-environ PYTHON_UOPS,PYTHON_LLTRACE
First observations:
A lot of traces are created for loops in the pyperf infrastructure. Not sure if this matters -- presumably these are all outside the timed region. (Slight worry: possibly one of these could trace into the nbody code and cause a distraction there?)
It looks like each of 6 traces is created 21 times (presumably one warmup and 20 timed runs). Apparently the pyperf infrastructure actually creates 20 processes to run the benchmark. If I add -p 1
I only see 12 traces created for nbody.py (each one twice -- I guess one for warmup, one timed run).
There are two traces that are (initially) longer than 200 instructions: the first and third one created for advance()
. One is 269, one is 279. These would be truncated (and then presumably split) by setting the max trace length to 200. None would be affected by setting it to 400 (and there are no traces longer than 400 created for pyperf either).
With PYTHON_LLTRACE=2
I see DEOPT
log messages pointing to STORE_SUBSCR_LIST_INT
.
DEOPT: [UOp _DEOPT (0, target=181, operand=0) -> STORE_SUBSCR_LIST_INT]
DEOPT: [UOp _DEOPT (0, target=17, operand=0) -> STORE_SUBSCR_LIST_INT]
(these two repeated many times)
This is suspicious! There are quite a few STORE_SUBSCR_LIST_INT
opcodes in this trace:
Optimizing advance (/Users/guido/cpython/nbody.py:78) at byte offset 350
The first one is
93: STORE_SUBSCR_LIST_INT(0)
137 ADD_TO_TRACE: _STORE_SUBSCR_LIST_INT (0, target=93, operand=0)
138 ADD_TO_TRACE: _CHECK_VALIDITY_AND_SET_IP (0, target=95, operand=0x12f8181a6)
Looking at the surrounding code in the disassembly:
88 158 LOAD_FAST 8 (v1)
160 LOAD_CONST 2 (0)
162 COPY 2
164 COPY 2
166 BINARY_SUBSCR
170 LOAD_FAST 15 (dx)
172 LOAD_FAST 20 (b2m)
174 BINARY_OP 5 (*)
178 BINARY_OP 23 (-=)
182 SWAP 3
184 SWAP 2
186 STORE_SUBSCR
Here's line 88 in nbody.py, in advance()
:
v1[0] -= dx * b2m
Honestly that doesn't make a lot of sense to me, maybe I've got my offsets mixed up. More later.
Continued lab notes:
DEOPT: [UOp _DEOPT (0, target=17, operand=0) -> STORE_SUBSCR_LIST_INT]
comes from _ITER_NEXT_LIST
(expanded from FOR_ITER_LIST
) at the top of the loop at line 80 (the first inner loop in advance()
, for ... in pairs
).DEOPT: [UOp _DEOPT (0, target=181, operand=0) -> STORE_SUBSCR_LIST_INT]
comes from the same sequence at a different bytecode offset -- the second inner loop, at line 94, for ... in bodies
.STORE_SUBSCR_LIST_INT
-- this is probably a mistake in the debugging output, showing the last instruction that didn't hit a DEOPT_IF
or EXIT_IF
.Going back to @Fidget-Spinner's original investigation, I don't see how reducing UOP_MAX_TRACE_LENGTH
from 800 to 400 could make a difference, since the longest trace in nbody has an unoptimized length of 279. I do see how reducing it to 200 makes a difference, since that would truncate the two longest traces, and then additional traces would eventually be created -- but I don't see how that would make things slower unless the JIT is still slower than Tier 1. (I have to try this and check the debug output.)
I certainly now disagree with @Fidget-Spinner's hypothesis:
My hunch is that the trace lengths are too long to be worth optimizing right now. What I think is happening is that we are falling off trace way too early. Say for a trace of 800, the optimizer abstract interprets all 800 bytecode instructions. Whereas in reality, maybe only 40 of the instructions are executed before we fall off in most cases. In the above example, the instructions that are worth optimizing are only 40 out of the 800 instructions.
Because looking at the traces and the DEOPT
logs printed by PYTHON_LLTRACE=2
, there's nothing that exits the trace prematurely (other than, eventually, falling off the end of a for
loop over either pairs
or bodies
). There are no if
statements, just float and list operations.
Next I'm going to change my nbody.py to turn the optimizer on only around bench_nbody()
, since I am getting annoyed at the traces for the pyperf infra (which isn't so easy to filter out in the PYTHON_LLTRACE
debug output).
However, the experiment shows different. Reducing UOP_MAX_TRACE_LENGTH
from 800 to 400 changes the created traces. Here's the original set of traces, with 800:
Created a proto-trace for advance (/Users/guido/cpython/nbody.py:94) at byte offset 34 -- length 269
Created a proto-trace for advance (/Users/guido/cpython/nbody.py:94) at byte offset 362 -- length 100
Created a proto-trace for advance (/Users/guido/cpython/nbody.py:94) at byte offset 24 -- length 279
Created a proto-trace for report_energy (/Users/guido/cpython/nbody.py:116) at byte offset 6 -- length 101
Created a proto-trace for report_energy (/Users/guido/cpython/nbody.py:116) at byte offset 128 -- length 65
Created a proto-trace for offset_momentum (/Users/guido/cpython/nbody.py:128) at byte offset 6 -- length 59
And here's the new set, with 400:
Created a proto-trace for advance (/Users/guido/cpython/nbody.py:94) at byte offset 34 -- length 260
Created a proto-trace for advance (/Users/guido/cpython/nbody.py:94) at byte offset 338 -- length 11
Created a proto-trace for advance (/Users/guido/cpython/nbody.py:94) at byte offset 362 -- length 100
Created a proto-trace for advance (/Users/guido/cpython/nbody.py:94) at byte offset 24 -- length 259
Created a proto-trace for advance (/Users/guido/cpython/nbody.py:94) at byte offset 324 -- length 23
Created a proto-trace for report_energy (/Users/guido/cpython/nbody.py:116) at byte offset 6 -- length 101
Created a proto-trace for report_energy (/Users/guido/cpython/nbody.py:116) at byte offset 128 -- length 65
Created a proto-trace for offset_momentum (/Users/guido/cpython/nbody.py:128) at byte offset 6 -- length 59
Note the two runts of length 11 and 23, which are undoubtedly created by the side exits.
So with a claimed max length of 400, the actual max length is roughly 260. This is surprising.
The explanation is that translate_bytecode_to_trace()
deducts one from its max_trace_length
for each input bytecode:
// Need space for _DEOPT
max_length--;
In addition, it deducts one for each opcode that has an exit or an error. See also this comment in optimizer.c:
/* It is impossible for the number of exits to reach 1/4 of the total length,
* as the number of exits cannot reach 1/3 of the number of non-exits, due to
* the presence of CHECK_VALIDITY checks and instructions to produce the values
* being checked in exits. */
#define COLD_EXIT_COUNT (UOP_MAX_TRACE_LENGTH/4)
Now let's repeat the experiment with UOP_MAX_TRACE_LENGTH
set to 200
.
Surprise! It crashes with
Assertion failed: (trace_length < max_length), function translate_bytecode_to_trace, file optimizer.c, line 881.
This looks like a failure in the space reservation algorithm, which could hit us in other ways too, so I should figure out the root cause (probably an insufficient RESERVE()
call). But it feels like a possible explanation of @Fidget-Spinner's comment:
Hmm more benchmarking reveals it's still slower on reduce trace length....the implication is the optimizer is doing nothing. (This is on a JIT build).
Could it have silently overflowed the buffer (assuming his JIT build did not have --with-pydebug
)? Or silently bailed completely?
I notice that in his later experiments he used 400, not 200.
Next, let me go back to 400 and try to understand how effective the side exit warming is. (Note: I'm doing everything in my exp-backoff
branch, i.e. gh-117144.)
Okay, with 800, no message starting with SIDE EXIT:
are ever printed, so we're not ever taking side exits. Instead, we're deoptimizing whenever a for-loop terminates.
Aside: what happens on DEOPT_IF(x)
?
if (x) JUMP_TO_JUMP_TARGET()
goto jump_to_jump_target
jump_target
field from the last uop, sets next_uop
to that target in the current trace, and goes to tier2_dispatch
, i.e. executes the stub instruction._DEOPT
, whose body is EXIT_TO_TIER1()
.goto exit_to_tier1
DEOPT:
(etc.) and then we go back to Tier 1And what for EXIT_IF(x)
?
_GUARD_IS_TRUE_POP
) type guards (e.g. _GUARD_BOTH_INT
), and _EXIT_TRACE
EXIT_IF(x)
into if (x) JUMP_TO_JUMP_TARGET()
goto jump_to_jump_target
_SIDE_EXIT
EXIT_TO_TRACE()
goto exit_to_trace
SIDE EXIT:
(etc.) and then we continue in Tier 2_COLD_EXIT
where we decrement the temperature and either go to Tier 1, or, if the temperature hits zero (in my PR)), look for an ENTER_EXECUTOR
bytecode, or try to optimizeexit->executor
to the new executor, and start executing itMore later.
I hesitate to change the issue title, but I think the situation is quite different.
I just spent an hour and a half with @brandtbucher going over this. We found a few things.
We should probably change _GUARD_NOT_EXHAUSTED_LIST
(and the two other similar uops for range and tuple) to use EXIT_IF
instead of DEOPT_IF
. This requires changing something in prepare_for_execution()
because that currently assumes that there is only one stub (exit OR deopt) per Tier 1 bytecode -- there are other guards in expansions of FOR_ITER
that use DEOPT_IF
for a legitimate reason (e.g. DEOPT_IF(Py_TYPE(iter) != &PyListIter_Type);
in _ITER_CHECK_LIST
). (A quick hack replaces if (target != current_jump_target)
with if (1)
.)
At least when I run the nbody benchmark, I find that loops = 1
, reference = 'sun'
, and iterations = 20000
. Moreover, the length of pairs
is 10 and the length of bodies
is 5.
Let's name the loops in advance()
(A), (B) and (C): outer loop (A), first inner loop (B), second inner loop (C). There's an executor for each, starting at the JUMP_BACKWARD
instruction.
The first loop to be optimized is (B), which iterates over the 10 pairs, doing a lot of float ops (no branches or calls). Next is (C), looping over the 5 bodies, doing less work; it easily fits in even the smallest trace size contemplated. Finally, (A) has many more iterations (20000), and soon it is optimized into something that also covers the first iteration.
With the larger UOP_MAX_TRACE_LENGTH
value (800), all traces fit, and a single trace ending in _JUMP_TO_TOP
is created for each.
With the smaller value (400), the traces for (A) and (B) don't fit, and each ends somewhere in the middle of the iteration; (A) a bit earlier than (B).
They no longer contain _JUMP_TO_TOP
, but instead end in _EXIT_TRACE
, which eventually is stitched to "remainder" traces (A1) and (B1). The remainder traces are then stitched to the top of the corresponding main trace. (We think.)
We explored a few ways that somehow the version using a single perfect large trace ending in _JUMP_TO_TOP
is slower than the version that stitches (A) to (A1) back to (A).
Eventually @brandtbucher hypothesized that the version with a single perfect trace calls CHECK_EVAL_BREAKER()
for every iteration of every loop (i.e., 20000 * (10 + 5) == 300000 times), whereas in the stiching version it is called basically never (only a few times while the first few outer iterations warm up).
This is just a theory so far. I will try to confirm by building with JIT, running the nbody benchmark using pyperformance, and then rebuilding with the CHECK_EVAL_BREAKER()
simply removed from _JUMP_TO_TOP
.
Alas, that theory (that the JIT is faster with stitched traces than with a single trace per loop because of CHECK_EVAL_BREAKER()) doesn't get borne out. I tried a trace max length of 800 with two variants of the code:
exp-backoff
branchCHECK_EVAL_BREAKER()
commented out in _JUMP_TO_TOP
, and running make regen-cases
I then ran the nbody benchmark, like this, on my Mac M3 (EDIT: see later, I wasn't testing right):
pyperformance run -p ~/cpython/python.exe -b nbody
Alas, I get about the same speed (within variance):
Mean +- std dev: 54.0 ms +- 0.4 ms
Mean +- std dev: 53.9 ms +- 0.2 ms
Next I tried the "regular" version with UOP_MAX_TRACE_LENGTH
set to 400 and 200, and it's slower:
Mean +- std dev: 55.9 ms +- 0.2 ms
Mean +- std dev: 56.8 ms +- 0.3 ms
Maybe @Fidget-Spinner can repeat these experiments? Perhaps things are different on Linux or on x86.
The experiments are on tier 2 without JIT, tier 2 with JIT has no difference in numbers.
Note: you can run nbody without full pyperformance installed to save yourself some time as long as you have pyperf. Just run ./python.exe ./bm_nbody.py -w 20 -o foo.json
after running pyperf system tune
.
I started this thread because I was curious why JIT with abstract interpreter off vs on had almost no difference. That whole issue seems to be fixed by Guido's current reimplemented tier 2 counters branch:
Main with abstract interpreter on vs off.
# warmups = 20
Mean +- std dev: [main_w_20_opt] 106 ms +- 1 ms -> [main_w_20_no_opt] 109 ms +- 2 ms: 1.03x slower
Which is basically almost no change (I take anything below 4-5% on benchmarks to be attributable to noise). So I was puzzled if the abstract interpreter even does anything.
Guido's branch with reimplemented tier 2 counters. Abstract interpreter on vs off:
warmups = 20
Mean +- std dev: [backoff_w_20_opt] 103 ms +- 2 ms -> [backoff_w_20_no_opt] 109 ms +- 3 ms: 1.06x slower
So I guess we can close this issue with Guido's PR merging soon. I am really really sorry if I miscommunicated anything or led you down a rabbit hole @gvanrossum . Sorry that I wasn't clear either by vaguely saying optmizer, which could mean tier 2, uops, abstract interpreter, or all optimizations passes. It seems your PR does fix my issue.
I have tried to repro @Fidget-Spinner's results. In particular, all combinations of these two diffs:
For each combination, I ran with PYTHON_UOPS=0
(Tier 1 only) and PYTHON_UOPS=1
(Tier 2 interpreter).
The Tier 1 speed is unaffected by the various choices; here's a typical measurement:
Mean +- std dev: 57.2 ms +- 1.4 ms
The rest of my measurements are all for PYTHON_UOPS=1
(Tier 2 interpreter).
With max trace 800:
Mean +- std dev: 90.3 ms +- 0.5 ms
Mean +- std dev: 95.4 ms +- 0.5 ms
With max trace 400:Mean +- std dev: 93.0 ms +- 0.3 ms
Mean +- std dev: 98.7 ms +- 0.5 ms
My conclusion, limited to nbody:
I don't see any evidence that reducing the max trace length is beneficial, nor that the optimizer makes things worse.
PS. The JIT numbers I reported above show that the JIT has a slight positive effect for nbody:
Mean +- std dev: 54.0 ms +- 0.4 ms
Before I let it rest I'll have a look at whether the optimizer does much for the nbody benchmark.
A quick analysis shows that the optimizer removed a bunch of _GUARD_BOTH_FLOAT
and their corresponding side exits. Where the optimized list is 193, the unoptimized (but still peepholed) list has 215 entries, including stubs.
So the optimizer does its job!
- Modified: with the
CHECK_EVAL_BREAKER()
commented out in_JUMP_TO_TOP
, and runningmake regen-cases
Oof, sorry. _JUMP_TO_TOP
is handled as a special case in the JIT. So you'd actually need to comment out the eval breaker check in template.c
and rebuild the JIT to actually skip it.
Okay, will try that.
JIT with this diff (and everything else unchanged):
diff --git a/Tools/jit/template.c b/Tools/jit/template.c
index 351bc2f3dd4..1b5ddf98565 100644
--- a/Tools/jit/template.c
+++ b/Tools/jit/template.c
@@ -97,7 +97,7 @@ _JIT_ENTRY(_PyInterpreterFrame *frame, PyObject **stack_pointer, PyThreadState *
// The actual instruction definitions (only one will be used):
if (opcode == _JUMP_TO_TOP) {
- CHECK_EVAL_BREAKER();
+ // CHECK_EVAL_BREAKER();
PATCH_JUMP(_JIT_TOP);
}
switch (opcode) {
First and second run (same setup):
Mean +- std dev: 52.9 ms +- 0.9 ms
Mean +- std dev: 52.3 ms +- 1.0 ms
With the eval breaker enabled (the patch removed, expected to be slightly slower), two runs:
Mean +- std dev: 53.0 ms +- 0.7 ms
Mean +- std dev: 52.7 ms +- 0.7 ms
That doesn't look like it made a difference (for nbody).
I'm guessing that there's enough "meat" in even the innermost loop in nbody that checking a few bits in tstate->eval_breaker
doesn't matter. I will stop losing sleep over this one. (But we should still fix the two bugs linking to this issue.)
I decided to benchmark nbody using pyperf on my computer.
When I turn on the optimizer (optimize_uops) in CPython main + JIT + Guido's exponential backoff and GHCCC, on is 3-10% slower than with it off.
However, with above settings and reducing trace length from 800 to 200, optimizer on is 2% faster than off.
My hunch is that the trace lengths are too long to be worth optimizing right now. What I think is happening is that we are falling off trace way too early. Say for a trace of 800, the optimizer abstract interprets all 800 bytecode instructions. Whereas in reality, maybe only 40 of the instructions are executed before we fall off in most cases. In the above example, the instructions that are worth optimizing are only 40 out of the 800 instructions.
We should seriously reconsider reducing trace length.
In other news, I tried encouraging inlining of all optimizer functions by placing them in the same compilation unit. No speedup there with LTO. So that means LTO is already doing a good job.