Open mdboom opened 8 months ago
Drilling down on the individual benchmarks that are "heavily interpreter", these are the ones that get a lot slower with the JIT: unpack_sequence, hexiom, go, raytrace, chaos. Maybe looking at these more closely (with the pystats, perhaps), will reveal ways to improve the JIT.
Did those get slower with the JIT, or tier two in general?
This is all with the JIT vs. Tier 1.
Here's those same top 5 slowdowns (on JIT) for Tier 2 and JIT:
Benchmark | Tier 2 | JIT |
---|---|---|
unpack_sequence | 1.02x | 1.89x |
hexiom | 1.48x | 1.14x |
go | 1.28x | 1.12x |
raytrace | 1.30x | 1.09x |
chaos | 1.27x | 1.09x |
So interestingly, unpack_sequence
is the only one in this bunch where the JIT makes it slower. Yet another reason that this benchmark is a weird outlier. It does seem like some sort of detecting whatever causes this to get slower (too much code, maybe?) and bailing out on the JIT might make sense.
For completeness, I did a full comparison of JIT vs Tier 2 (not something we usually generate), and the only interpreter-heavy benchmark where the JIT actually makes things slower than Tier 2 is unpack_sequence
.
Another interesting observation: the go
and hexiom
benchmarks both spend about 1.5% of their time in uop_optimize
. Maybe too many failed optimization attempts? We currently see 99.5% of optimization attempts getting rejected as too short, so maybe if we can do less work before bailing out on those we could move the needle...
We might also want to shrink the size of the abstract interpreter -- probably 1024 for the max size, and *3 of trace length for the arena is more than sufficient for everything
@Fidget-Spinner
We might also want to shrink the size of the abstract interpreter
That's not really going to make it faster though, right? It saves some C stack space, but these buffers aren't all that big (by modern standards -- 30 years ago I'd be horrified :-) and we don't ever iterate over the entire buffer. (Or do we? In which case I'd take it back.)
@mdboom
Another interesting observation: the
go
andhexiom
benchmarks both spend about 1.5% of their time inuop_optimize
. Maybe too many failed optimization attempts? We currently see 99.5% of optimization attempts getting rejected as too short, so maybe if we can do less work before bailing out on those we could move the needle...
Optimization attempts that fail because the trace is too short don't take up much time though -- they only process a few bytecodes and skip all of the expensive work (peephole, abstract interpreter, executor construction). Is the 99.5% rejected across all benchmarks or specific to those two? I'd almost think that if a benchmark spends a lot of time in uop_optimize
that points to very long traces in combination with some quadratic aspect of the optimizer.
Is the 99.5% rejected across all benchmarks or specific to those two? I'd almost think that if a benchmark spends a lot of time in uop_optimize that points to very long traces in combination with some quadratic aspect of the optimizer.
The 99.5% rejected for being too short is across all benchmarks, but individually go
and hexiom
show that as well.
99.8% of go
benchmark traces are between 8 and 16 instructions long. 96.7% of hexiom
benchmark traces are between 16 and 32 instructions long. So these aren't especially long.
I'll see if I can get some more detail out of the profiler to see if there's anything in particular in uop_optimize
to look at.
Did these numbers change recently? I wonder if side-exits (a relatively new feature) might be involved. Maybe we should count optimization attempts from side-exits separately from those initiated by JUMP_BACKWARD
?
@Fidget-Spinner
We might also want to shrink the size of the abstract interpreter
That's not really going to make it faster though, right? It saves some C stack space, but these buffers aren't all that big (by modern standards -- 30 years ago I'd be horrified :-) and we don't ever iterate over the entire buffer. (Or do we? In which case I'd take it back.)
I had a talk with @brandtbucher and we thought that it might be possible that it's making stack-spilled variables extremely slow to access. Imagine having rbp+0xbig
just to to get a variable.
Also Brandt mentioned maybe the small helper functions we have in the uops symbols are not being inlined, because they are a separate compilation unit so they won't automatically be inlined unless LTO is on (and even then, will the compiler always do it?). Maybe we want to mark them static inline and put them in a header file to hint to the C compiler?
Maybe we want to mark them static inline and put them in a header file to hint to the C compiler?
That seems worth the effort to at least try.
I wonder if side-exits (a relatively new feature) might be involved. Maybe we should count optimization attempts from side-exits separately from those initiated by JUMP_BACKWARD?
The recent bug we had in the stats was that the side exit optimization attempts were not being counted (but all of the various error or success cases were). From that, we already know that the side exit stuff creates orders of magnitude more optimization attempts (1,500 traces vs. 1,267,520 side exits). As you say, that's all fine if most of them are short and bail out quickly, but it is a lot of work that ends up being thrown away if there's a faster way to know if we don't need to do it (I'm not sure there is...)
I looked at hexiom in detail with uops debugging on, and it looks like the optimizer keeps creating executors for line 274:
possible = sum((1 if v in cell else 0) for cell in done.cells)
That's a generator expression called via C (sum()
). Maybe something's wrong there?
The message I see repeatedly is
Created a trace for constraint_pass (/Users/guido/pyperformance/pyperformance/data-files/benchmarks/bm_hexiom/run_benchmark.py:237) at byte offset 816 -- length 29
Looking at the disassembly incorporated into the executor, I think it is this:
274 L19: 816 LOAD_GLOBAL_BUILTIN 27 (sum + NULL)
826 LOAD_FAST 19 (v)
828 BUILD_TUPLE 1
830 LOAD_CONST 7 (<code object <genexpr> at 0x10350bac0, file "/Users/guido/pyperformance/pyperformance/data-files/benchmarks/bm_hexiom/run_benchmark.py", line 274>)
832 MAKE_FUNCTION
834 SET_FUNCTION_ATTRIBUTE 8 (closure)
836 LOAD_FAST 4 (done)
838 LOAD_ATTR_INSTANCE_VALUE 20 (cells)
858 GET_ITER
860 CALL_PY_EXACT_ARGS 0
This is not the top of a loop, so it must be a side exit. Indeed, some more context shows:
for v in range(8):
# If there is none, remove the possibility from all tiles
if (pos.tiles[v] > 0) and (left[v] == 0):
if done.remove_unfixed(v):
changed = True
else:
possible = sum((1 if v in cell else 0) for cell in done.cells) # <----------- HERE
After turning on the right debug variable (PYTHON_OPT_DEBUG=3
) I found the culprit:
Abstract interpreting _INIT_CALL_PY_EXACT_ARGS:0 Encountered error in abstract interpreter
Maybe @Fidget-Spinner can diagnose this part further?
There's more though. We should fail more loudly in this case -- optimize_uops()
silently returns 0 in this case, and then _Py_uop_analyze_and_optimize()
also returns 0, and then uop_optimize()
also returns 0, and _PyOptimizer_Optimize()
ditto. But... Since we know a side exit is involved, I looked at the logic for _COLD_EXIT
in bytecodes.c, where I find these two lines that I don't follow:
int32_t new_temp = -1 * tstate->interp->optimizer_side_threshold;
exit->temperature = (new_temp < INT16_MIN) ? INT16_MIN : new_temp;
Maybe the temperature calculation is wrong? @markshannon There definitely seems to be something that needs to be tuned better in case the abstract interpreter bails.
PS. The "Created a trace for ..." debug message does not indicate that an executor was successfully created. It only means that the initial step of that process, trace projection, succeeded with a not-too-short trace. We then do various other things which may fail (-1) or bail (0) before constructing the final executor. We should probably add level 1 or level 2 debug messages for the various possible bail points. (And these should get the level from PYTHON_LLTRACE
, not from PYTHON_OPT_DEBUG
.)
Maybe @Fidget-Spinner can diagnose this part further?
The only real reason _INIT_CALL_PY_EXACT_ARGS
can fail is because of the function object not being present/set during the trace https://github.com/python/cpython/blob/main/Python/optimizer_bytecodes.c#L548. This can only happen if the function version isn't in the cache https://github.com/python/cpython/blob/c432df6d56f3e02530132321b47dcc7b914a3660/Python/optimizer.c#L740.
In general, function versions are shared across code objects, with some checks. However, I don't know if that works for generator expressions. It might be that every single time a brand new function version is used, leading to it to deopt out of there.
We should just tell tier 2 to never optimize that trace.
There were some stats about the new optimizer that weren't added to the output tables. Once I did that and ran it locally, I see that out of 2,674,828 optimizer attempts, only 92,441 (3.5%) are successful. I don't know if that's just expected at this point. The go
and hexiom
benchmarks are good examples of ones that have an even lower success rate. I don't know if that's helpful, @Fidget-Spinner, but thought I'd share.
Some more lab notes:
For a smaller demo, one of the issues is that we can't translate YIELD_VALUE
to Tier 2. In a toy function that repeatedly calls a small generator expression, e.g.
def testfunc():
for i in range(20):
sum(i for i in range(20))
the following seems to happen:
YIELD_VALUE
, putting an _EXIT_TRACE
there.YIELD_VALUE
so it fails.It seems that we should implement some kind of exponential back-off to avoid such scenarios.
This is not actually unique to generator expressions -- I can reproduce it with a generator function too:
def generator():
for i in range(20):
yield i
def testfunc():
for i in range(20):
x = sum(generator())
return x
Are there really only 5 bits for the temperature? Or do we have a chance to implement exponential backoff here?
Another approach might be that if we know for sure that a trace can't be produced for a given side exit (like in this case), we generate a different exit instruction, that uses DEOPT_IF(1)
instead of SIDE_EXIT(1)
. I'm not sure how to reliably decide this though.
The other thing I learned is that, indeed, somehow the generator expression's function object isn't in the function version cache. This seems to happen more for generator expressions, but I can reproduce it with nested generator functions too. It seems that frequently calling LOAD_CONST; MAKE_FUNCTION
somehow evicts the original cache entry. However, for generator expressions it's worse -- these never seem to even enter the cache.
out of 2,674,828 optimizer attempts, only 92,441 (3.5%) are successful
That's definitely not expected. I am on the trail of several issues (see my messages above).
Oh. The difference between generator expressions and generator functions is that a generator function (the way I wrote the code, anyway) is stored in a local, and thus has a lifetime that overlaps with the previous incarnation of the same function. And the reason this matters is that when a function object is deallocated, it is evicted from the version cache.
When using a generator function, the sequence of bytecode is
LOAD_CONST
MAKE_FUNCTION
STORE_FAST
and the function object thus created stays alive in the fast locals until the next iteration.
Here, MAKE_FUNCTION
calls _PyFunction_SetVersion()
which puts the function object in the cache. The STORE_FAST
decrefs the previous version of the function (created by the previous iteration), which attempts to evict that previous version from the version cache. However, since the cache was already overwritten with the new function object, it doesn't actually evict anything.
When projecting the trace, looking up the version number in the cache will then produce the correct function object, and all is well. (Except... At some later point it is also evicted. I don't understand that yet.)
OTOH, for a generator expression, the code sequence is
LOAD_CONST
MAKE_FUNCTION
i.e., the freshly created function object isn't stored in a local variable, just on the stack. When that stack entry is eventually popped, the function object is decrefed, which evicts it from the version cache, successfully.
When projecting the trace, we therefore find nothing in the cache!
I'm not sure what to do about this. We may have to talk about it in person Monday. Could we perhaps make the version cache store code objects instead of function objects? The trace projection process ultimately needs a code object anyway. (But I understand there's a subtle issue of changes to the function object's __code__
attribute.)
Separately, there's the mystery that after some number of successful iterations, the variant using a generator function also ends up finding nothing in the cache. I haven't looked into this yet -- I thought the issue with generator expressions was more pressing. (EDIT: See next comment; this is explainable.)
In any case I believe that the things I've found so far (generator expressions, and the lack of exponential backoff for side exits) explain the problem with the "hexiom" benchmark spending a lot of time in uop_optimize
, and I suspect it's similar for "go", and possibly for the rest.
It looks like the issue with generator functions finding nothing in the cache is separate -- it's a loop in the module being run through the Tier 2 optimizer, tracing into testfunc()
, and attempting to trace into generator()
, but at that point there is no local variable keeping the generator function alive, so the cache miss is legitimate.
@mdboom, this issue is yours again -- I have created various cpython issues to deal with the things I found.
By "characterization", I mean what category of functions dominate the runtime of each benchmark.
If we organize them by the top category in each benchmark, we get the following:
Benchmark by top profiling category
```python characterizations = { "interpreter": [ "2to3", "aiohttp", "chameleon", "chaos", "comprehensions", "coroutines", "coverage", "crypto_pyaes", "dask", "deepcopy", "deltablue", "django_template", "djangocms", "docutils", "dulwich_log", "fannkuch", "float", "generators", "genshi", "go", "gunicorn", "hexiom", "html5lib", "logging", "mako", "mypy2", "nbody", "nqueens", "pickle_pure_python", "pprint", "pycparser", "pyflate", "pylint", "raytrace", "regex_compile", "richards", "richards_super", "scimark", "spectral_norm", "sqlglot", "sqlglot_optimize", "sqlglot_parse", "sqlglot_transpile", "sympy", "thrift", "tomli_loads", "tornado_http", "typing_runtime_protocols", "unpack_sequence", "unpickle_pure_python", "xml_etree", ], "memory": [ "async_generators", "json_dumps", "python_startup", "python_startup_no_site", "unpickle_list", ], "gc": [ "async_tree", "async_tree_cpu_io_mixed", "async_tree_cpu_io_mixed_tg", "async_tree_io", "async_tree_io_tg", "async_tree_memoization", "async_tree_memoization_tg", "async_tree_tg", "gc_collect", "gc_traversal", ], "kernel": ["asyncio_tcp", "concurrent_imap", "pathlib"], "libc": ["asyncio_tcp_ssl"], "library": [ "asyncio_websockets", "json", "json_loads", "pickle", "pickle_dict", "pickle_list", "regex_dna", "regex_effbot", "regex_v8", "sqlite_synth", "telco", ], "tuple": ["mdp"], "miscobj": ["meteor_contest"], "int": ["pidigits"], "str": ["unpickle"], } ```If you refine this to only include a benchmark in a category if that category represents more than 50% of the runtime:
Benchmarks that are heavily (over 50%) in a particular category
```python over_50 = { "kernel": ["asyncio_tcp"], "libc": ["asyncio_tcp_ssl"], "library": [ "asyncio_websockets", "pickle", "pickle_dict", "pickle_list", "regex_dna", "regex_effbot", ], "interpreter": [ "chaos", "coroutines", "deepcopy", "deltablue", "generators", "go", "hexiom", "logging", "nbody", "pickle_pure_python", "pprint", "raytrace", "richards", "richards_super", "sqlglot_parse", "tomli_loads", "unpack_sequence", "unpickle_pure_python", ], "gc": ["gc_collect", "gc_traversal"], "int": ["pidigits"], } ```Interestingly, this doesn't seem to reveal too much related to profiling. (Admittedly, the only category where we would expect significant change is "interpreter"). The following results are for JIT (main) vs. Tier 1 (same commit), HPT at the 99th percentile:
Using only benchmarks where 50% of time is in a single category: