faster-cpython / ideas

1.7k stars 49 forks source link

Investigation: Group benchmarks based on their profiling "characterization" #664

Open mdboom opened 8 months ago

mdboom commented 8 months ago

By "characterization", I mean what category of functions dominate the runtime of each benchmark.

If we organize them by the top category in each benchmark, we get the following:

Benchmark by top profiling category ```python characterizations = { "interpreter": [ "2to3", "aiohttp", "chameleon", "chaos", "comprehensions", "coroutines", "coverage", "crypto_pyaes", "dask", "deepcopy", "deltablue", "django_template", "djangocms", "docutils", "dulwich_log", "fannkuch", "float", "generators", "genshi", "go", "gunicorn", "hexiom", "html5lib", "logging", "mako", "mypy2", "nbody", "nqueens", "pickle_pure_python", "pprint", "pycparser", "pyflate", "pylint", "raytrace", "regex_compile", "richards", "richards_super", "scimark", "spectral_norm", "sqlglot", "sqlglot_optimize", "sqlglot_parse", "sqlglot_transpile", "sympy", "thrift", "tomli_loads", "tornado_http", "typing_runtime_protocols", "unpack_sequence", "unpickle_pure_python", "xml_etree", ], "memory": [ "async_generators", "json_dumps", "python_startup", "python_startup_no_site", "unpickle_list", ], "gc": [ "async_tree", "async_tree_cpu_io_mixed", "async_tree_cpu_io_mixed_tg", "async_tree_io", "async_tree_io_tg", "async_tree_memoization", "async_tree_memoization_tg", "async_tree_tg", "gc_collect", "gc_traversal", ], "kernel": ["asyncio_tcp", "concurrent_imap", "pathlib"], "libc": ["asyncio_tcp_ssl"], "library": [ "asyncio_websockets", "json", "json_loads", "pickle", "pickle_dict", "pickle_list", "regex_dna", "regex_effbot", "regex_v8", "sqlite_synth", "telco", ], "tuple": ["mdp"], "miscobj": ["meteor_contest"], "int": ["pidigits"], "str": ["unpickle"], } ```

If you refine this to only include a benchmark in a category if that category represents more than 50% of the runtime:

Benchmarks that are heavily (over 50%) in a particular category ```python over_50 = { "kernel": ["asyncio_tcp"], "libc": ["asyncio_tcp_ssl"], "library": [ "asyncio_websockets", "pickle", "pickle_dict", "pickle_list", "regex_dna", "regex_effbot", ], "interpreter": [ "chaos", "coroutines", "deepcopy", "deltablue", "generators", "go", "hexiom", "logging", "nbody", "pickle_pure_python", "pprint", "raytrace", "richards", "richards_super", "sqlglot_parse", "tomli_loads", "unpack_sequence", "unpickle_pure_python", ], "gc": ["gc_collect", "gc_traversal"], "int": ["pidigits"], } ```

Interestingly, this doesn't seem to reveal too much related to profiling. (Admittedly, the only category where we would expect significant change is "interpreter"). The following results are for JIT (main) vs. Tier 1 (same commit), HPT at the 99th percentile:

interpreter: 1.01x slower
memory: 1.00x slower
gc: 1.01x slower
kernel: 1.00x slower
libc: 1.00x slower
library: 1.00x slower
tuple: 1.00x slower
miscobj: 1.00x slower
int: 1.00x faster
str: 1.00x faster

Using only benchmarks where 50% of time is in a single category:

interpreter: 1.00x slower
kernel: 1.00x slower
libc: 1.00x slower
library: 1.00x slower
gc: 1.00x slower
int: 1.00x faster
mdboom commented 8 months ago

Drilling down on the individual benchmarks that are "heavily interpreter", these are the ones that get a lot slower with the JIT: unpack_sequence, hexiom, go, raytrace, chaos. Maybe looking at these more closely (with the pystats, perhaps), will reveal ways to improve the JIT.

brandtbucher commented 8 months ago

Did those get slower with the JIT, or tier two in general?

mdboom commented 8 months ago

This is all with the JIT vs. Tier 1.

Here's those same top 5 slowdowns (on JIT) for Tier 2 and JIT:

Benchmark Tier 2 JIT
unpack_sequence 1.02x 1.89x
hexiom 1.48x 1.14x
go 1.28x 1.12x
raytrace 1.30x 1.09x
chaos 1.27x 1.09x

So interestingly, unpack_sequence is the only one in this bunch where the JIT makes it slower. Yet another reason that this benchmark is a weird outlier. It does seem like some sort of detecting whatever causes this to get slower (too much code, maybe?) and bailing out on the JIT might make sense.

For completeness, I did a full comparison of JIT vs Tier 2 (not something we usually generate), and the only interpreter-heavy benchmark where the JIT actually makes things slower than Tier 2 is unpack_sequence.

Comparison of JIT to Tier 2 ![image](https://github.com/faster-cpython/ideas/assets/38294/41bc8fc2-07b5-40c3-a8b7-ec336665c1d1)
mdboom commented 8 months ago

Another interesting observation: the go and hexiom benchmarks both spend about 1.5% of their time in uop_optimize. Maybe too many failed optimization attempts? We currently see 99.5% of optimization attempts getting rejected as too short, so maybe if we can do less work before bailing out on those we could move the needle...

Fidget-Spinner commented 8 months ago

We might also want to shrink the size of the abstract interpreter -- probably 1024 for the max size, and *3 of trace length for the arena is more than sufficient for everything

https://github.com/python/cpython/blob/c432df6d56f3e02530132321b47dcc7b914a3660/Include/internal/pycore_optimizer.h#L39

gvanrossum commented 8 months ago

@Fidget-Spinner

We might also want to shrink the size of the abstract interpreter

That's not really going to make it faster though, right? It saves some C stack space, but these buffers aren't all that big (by modern standards -- 30 years ago I'd be horrified :-) and we don't ever iterate over the entire buffer. (Or do we? In which case I'd take it back.)

@mdboom

Another interesting observation: the go and hexiom benchmarks both spend about 1.5% of their time in uop_optimize. Maybe too many failed optimization attempts? We currently see 99.5% of optimization attempts getting rejected as too short, so maybe if we can do less work before bailing out on those we could move the needle...

Optimization attempts that fail because the trace is too short don't take up much time though -- they only process a few bytecodes and skip all of the expensive work (peephole, abstract interpreter, executor construction). Is the 99.5% rejected across all benchmarks or specific to those two? I'd almost think that if a benchmark spends a lot of time in uop_optimize that points to very long traces in combination with some quadratic aspect of the optimizer.

mdboom commented 8 months ago

Is the 99.5% rejected across all benchmarks or specific to those two? I'd almost think that if a benchmark spends a lot of time in uop_optimize that points to very long traces in combination with some quadratic aspect of the optimizer.

The 99.5% rejected for being too short is across all benchmarks, but individually go and hexiom show that as well.

99.8% of go benchmark traces are between 8 and 16 instructions long. 96.7% of hexiom benchmark traces are between 16 and 32 instructions long. So these aren't especially long.

I'll see if I can get some more detail out of the profiler to see if there's anything in particular in uop_optimize to look at.

gvanrossum commented 8 months ago

Did these numbers change recently? I wonder if side-exits (a relatively new feature) might be involved. Maybe we should count optimization attempts from side-exits separately from those initiated by JUMP_BACKWARD?

Fidget-Spinner commented 8 months ago

@Fidget-Spinner

We might also want to shrink the size of the abstract interpreter

That's not really going to make it faster though, right? It saves some C stack space, but these buffers aren't all that big (by modern standards -- 30 years ago I'd be horrified :-) and we don't ever iterate over the entire buffer. (Or do we? In which case I'd take it back.)

I had a talk with @brandtbucher and we thought that it might be possible that it's making stack-spilled variables extremely slow to access. Imagine having rbp+0xbig just to to get a variable.

Also Brandt mentioned maybe the small helper functions we have in the uops symbols are not being inlined, because they are a separate compilation unit so they won't automatically be inlined unless LTO is on (and even then, will the compiler always do it?). Maybe we want to mark them static inline and put them in a header file to hint to the C compiler?

mdboom commented 8 months ago

Maybe we want to mark them static inline and put them in a header file to hint to the C compiler?

That seems worth the effort to at least try.

I wonder if side-exits (a relatively new feature) might be involved. Maybe we should count optimization attempts from side-exits separately from those initiated by JUMP_BACKWARD?

The recent bug we had in the stats was that the side exit optimization attempts were not being counted (but all of the various error or success cases were). From that, we already know that the side exit stuff creates orders of magnitude more optimization attempts (1,500 traces vs. 1,267,520 side exits). As you say, that's all fine if most of them are short and bail out quickly, but it is a lot of work that ends up being thrown away if there's a faster way to know if we don't need to do it (I'm not sure there is...)

gvanrossum commented 8 months ago

I looked at hexiom in detail with uops debugging on, and it looks like the optimizer keeps creating executors for line 274:

            possible = sum((1 if v in cell else 0) for cell in done.cells)

That's a generator expression called via C (sum()). Maybe something's wrong there?

The message I see repeatedly is

Created a trace for constraint_pass (/Users/guido/pyperformance/pyperformance/data-files/benchmarks/bm_hexiom/run_benchmark.py:237) at byte offset 816 -- length 29

Looking at the disassembly incorporated into the executor, I think it is this:

 274   L19:  816       LOAD_GLOBAL_BUILTIN     27 (sum + NULL)
             826       LOAD_FAST               19 (v)
             828       BUILD_TUPLE              1
             830       LOAD_CONST               7 (<code object <genexpr> at 0x10350bac0, file "/Users/guido/pyperformance/pyperformance/data-files/benchmarks/bm_hexiom/run_benchmark.py", line 274>)
             832       MAKE_FUNCTION
             834       SET_FUNCTION_ATTRIBUTE   8 (closure)
             836       LOAD_FAST                4 (done)
             838       LOAD_ATTR_INSTANCE_VALUE 20 (cells)
             858       GET_ITER
             860       CALL_PY_EXACT_ARGS       0

This is not the top of a loop, so it must be a side exit. Indeed, some more context shows:

    for v in range(8):
        # If there is none, remove the possibility from all tiles
        if (pos.tiles[v] > 0) and (left[v] == 0):
            if done.remove_unfixed(v):
                changed = True
        else:
            possible = sum((1 if v in cell else 0) for cell in done.cells)    # <----------- HERE

After turning on the right debug variable (PYTHON_OPT_DEBUG=3) I found the culprit:

Abstract interpreting _INIT_CALL_PY_EXACT_ARGS:0 Encountered error in abstract interpreter

Maybe @Fidget-Spinner can diagnose this part further?

There's more though. We should fail more loudly in this case -- optimize_uops() silently returns 0 in this case, and then _Py_uop_analyze_and_optimize() also returns 0, and then uop_optimize() also returns 0, and _PyOptimizer_Optimize() ditto. But... Since we know a side exit is involved, I looked at the logic for _COLD_EXIT in bytecodes.c, where I find these two lines that I don't follow:

                    int32_t new_temp = -1 * tstate->interp->optimizer_side_threshold;
                    exit->temperature = (new_temp < INT16_MIN) ? INT16_MIN : new_temp;

Maybe the temperature calculation is wrong? @markshannon There definitely seems to be something that needs to be tuned better in case the abstract interpreter bails.

gvanrossum commented 8 months ago

PS. The "Created a trace for ..." debug message does not indicate that an executor was successfully created. It only means that the initial step of that process, trace projection, succeeded with a not-too-short trace. We then do various other things which may fail (-1) or bail (0) before constructing the final executor. We should probably add level 1 or level 2 debug messages for the various possible bail points. (And these should get the level from PYTHON_LLTRACE, not from PYTHON_OPT_DEBUG.)

Fidget-Spinner commented 8 months ago

Maybe @Fidget-Spinner can diagnose this part further?

The only real reason _INIT_CALL_PY_EXACT_ARGS can fail is because of the function object not being present/set during the trace https://github.com/python/cpython/blob/main/Python/optimizer_bytecodes.c#L548. This can only happen if the function version isn't in the cache https://github.com/python/cpython/blob/c432df6d56f3e02530132321b47dcc7b914a3660/Python/optimizer.c#L740.

In general, function versions are shared across code objects, with some checks. However, I don't know if that works for generator expressions. It might be that every single time a brand new function version is used, leading to it to deopt out of there.

We should just tell tier 2 to never optimize that trace.

mdboom commented 8 months ago

There were some stats about the new optimizer that weren't added to the output tables. Once I did that and ran it locally, I see that out of 2,674,828 optimizer attempts, only 92,441 (3.5%) are successful. I don't know if that's just expected at this point. The go and hexiom benchmarks are good examples of ones that have an even lower success rate. I don't know if that's helpful, @Fidget-Spinner, but thought I'd share.

gvanrossum commented 8 months ago

Some more lab notes:

For a smaller demo, one of the issues is that we can't translate YIELD_VALUE to Tier 2. In a toy function that repeatedly calls a small generator expression, e.g.

def testfunc():
    for i in range(20):
        sum(i for i in range(20))

the following seems to happen:

It seems that we should implement some kind of exponential back-off to avoid such scenarios.

This is not actually unique to generator expressions -- I can reproduce it with a generator function too:

def generator():
    for i in range(20):
        yield i

def testfunc():
    for i in range(20):
        x = sum(generator())
    return x

Are there really only 5 bits for the temperature? Or do we have a chance to implement exponential backoff here?

Another approach might be that if we know for sure that a trace can't be produced for a given side exit (like in this case), we generate a different exit instruction, that uses DEOPT_IF(1) instead of SIDE_EXIT(1). I'm not sure how to reliably decide this though.

gvanrossum commented 8 months ago

The other thing I learned is that, indeed, somehow the generator expression's function object isn't in the function version cache. This seems to happen more for generator expressions, but I can reproduce it with nested generator functions too. It seems that frequently calling LOAD_CONST; MAKE_FUNCTION somehow evicts the original cache entry. However, for generator expressions it's worse -- these never seem to even enter the cache.

gvanrossum commented 8 months ago

out of 2,674,828 optimizer attempts, only 92,441 (3.5%) are successful

That's definitely not expected. I am on the trail of several issues (see my messages above).

gvanrossum commented 8 months ago

Oh. The difference between generator expressions and generator functions is that a generator function (the way I wrote the code, anyway) is stored in a local, and thus has a lifetime that overlaps with the previous incarnation of the same function. And the reason this matters is that when a function object is deallocated, it is evicted from the version cache.

When using a generator function, the sequence of bytecode is

LOAD_CONST
MAKE_FUNCTION
STORE_FAST

and the function object thus created stays alive in the fast locals until the next iteration.

Here, MAKE_FUNCTION calls _PyFunction_SetVersion() which puts the function object in the cache. The STORE_FAST decrefs the previous version of the function (created by the previous iteration), which attempts to evict that previous version from the version cache. However, since the cache was already overwritten with the new function object, it doesn't actually evict anything.

When projecting the trace, looking up the version number in the cache will then produce the correct function object, and all is well. (Except... At some later point it is also evicted. I don't understand that yet.)

OTOH, for a generator expression, the code sequence is

LOAD_CONST
MAKE_FUNCTION

i.e., the freshly created function object isn't stored in a local variable, just on the stack. When that stack entry is eventually popped, the function object is decrefed, which evicts it from the version cache, successfully.

When projecting the trace, we therefore find nothing in the cache!

I'm not sure what to do about this. We may have to talk about it in person Monday. Could we perhaps make the version cache store code objects instead of function objects? The trace projection process ultimately needs a code object anyway. (But I understand there's a subtle issue of changes to the function object's __code__ attribute.)

Separately, there's the mystery that after some number of successful iterations, the variant using a generator function also ends up finding nothing in the cache. I haven't looked into this yet -- I thought the issue with generator expressions was more pressing. (EDIT: See next comment; this is explainable.)

In any case I believe that the things I've found so far (generator expressions, and the lack of exponential backoff for side exits) explain the problem with the "hexiom" benchmark spending a lot of time in uop_optimize, and I suspect it's similar for "go", and possibly for the rest.

gvanrossum commented 8 months ago

It looks like the issue with generator functions finding nothing in the cache is separate -- it's a loop in the module being run through the Tier 2 optimizer, tracing into testfunc(), and attempting to trace into generator(), but at that point there is no local variable keeping the generator function alive, so the cache miss is legitimate.

gvanrossum commented 8 months ago

@mdboom, this issue is yours again -- I have created various cpython issues to deal with the things I found.