A disassembler - Githubissues

gvanrossum commented 3 years ago

If we're going to be specializing bytecode, at some point we're going to need to fix up jumps. Dino's Shadowcode (see #3) solves this by only ever replacing a single opcode with another, inserting NOPs when the new opcode is shorter, so all jump targets are unchanged. But I expect that at some point we're going to need to replace a single opcode with several others.

So I did some work on disassembling bytecode back into the data structures used by the compiler/assembler. Given a bytecode array you'll get a series of basic blocks back, which you can modify (e.g. insert or delete instructions or add new basic blocks), and then present them to the assembler which will construct a new bytecode object for you, with jumps fixed.

As a demonstration I added some opcodes for indexing slots (LOAD_ATTR_SLOT and STORE_ATTR_SLOT -- the latter was added by @ericsnowcurrently), but I suspect that the true value here is in the infrastructure for disassembling. (That particular optimization is quiestionable, since we already have an inline cache for slots in LOAD_ATTR.)

There are some unresolved issues, e.g. I assume that the constants table remains the same (and there's an implicit assumption that None is always present). We could probably refactor more of the assembler to check this.

ericsnowcurrently commented 3 years ago

Some notes about where things are at:

the branch is looking pretty good (passes test suite with a few tweaks)
there may be a problem with reduce() for PyCodeObject and PyFrameObject (noticed via test_multiprocessing)
co_linetable is broken still
we need to validate other co_* fields after optimizing
generators caused problems (and may impact the opcache too)

I should have some rough benchmark numbers up tomorrow.

gvanrossum commented 3 years ago

Thanks for cleaning up some of my technical debt! After discussing things with Mark this is probably not something we should pursue further in the short term (though I'd like to see those benchmark numbers!). Some issues:

The basic block data structure is inefficient (example: only the last instruction can have a jump)
The TYPEGUARD instruction makes things unsafe -- some joker could edit the class affecting e.g. LOAD_ATTR
You can do a lot of bytecode rewriting without ever needing to fix jump offsets (e.g. super-instructions, and shadowcode #3)

ericsnowcurrently commented 3 years ago

Here are the benchmark results I ran. Note that I ran them in a linux VM on Windows, so the numbers should be considered very rough.

results-disassembler.tar.gz

table

``` 2021-03-01_16-48-master-cc12888f9b4b.json.gz ============================================ Performance version: 1.0.2 Report on Linux-5.4.0-45-generic-x86_64-with-glibc2.27 Number of logical CPUs: 8 Start date: 2021-03-10 17:22:12.886957 End date: 2021-03-10 17:43:15.516662 2021-03-10_23-49-disassembler-cc9ac6a352fb.json.gz ================================================== Performance version: 1.0.2 Report on Linux-5.4.0-45-generic-x86_64-with-glibc2.27 Number of logical CPUs: 8 Start date: 2021-03-11 10:27:09.280900 End date: 2021-03-11 10:48:10.529382 +-------------------------+--------------+--------------+--------------+-----------------------+ | Benchmark | master | disassembler | Change | Significance | +=========================+==============+==============+==============+=======================+ | 2to3 | 405 ms | 403 ms | 1.00x faster | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | chameleon | 11.7 ms | 11.9 ms | 1.02x slower | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | chaos | 135 ms | 141 ms | 1.04x slower | Significant (t=-5.33) | +-------------------------+--------------+--------------+--------------+-----------------------+ | crypto_pyaes | 148 ms | 148 ms | 1.00x faster | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | deltablue | 10.7 ms | 10.4 ms | 1.03x faster | Significant (t=4.58) | +-------------------------+--------------+--------------+--------------+-----------------------+ | django_template | 76.6 ms | 75.8 ms | 1.01x faster | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | dulwich_log | 87.8 ms | 83.0 ms | 1.06x faster | Significant (t=8.51) | +-------------------------+--------------+--------------+--------------+-----------------------+ | fannkuch | 569 ms | 546 ms | 1.04x faster | Significant (t=5.10) | +-------------------------+--------------+--------------+--------------+-----------------------+ | float | 127 ms | 118 ms | 1.08x faster | Significant (t=14.91) | +-------------------------+--------------+--------------+--------------+-----------------------+ | genshi_text | 38.8 ms | 37.8 ms | 1.03x faster | Significant (t=2.83) | +-------------------------+--------------+--------------+--------------+-----------------------+ | genshi_xml | 81.8 ms | 82.2 ms | 1.01x slower | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | go | 316 ms | 308 ms | 1.03x faster | Significant (t=4.99) | +-------------------------+--------------+--------------+--------------+-----------------------+ | hexiom | 12.5 ms | 12.1 ms | 1.03x faster | Significant (t=3.97) | +-------------------------+--------------+--------------+--------------+-----------------------+ | json_dumps | 16.6 ms | 16.4 ms | 1.01x faster | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | json_loads | 32.1 us | 31.7 us | 1.01x faster | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | logging_format | 14.2 us | 14.0 us | 1.01x faster | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | logging_silent | 254 ns | 270 ns | 1.06x slower | Significant (t=-3.89) | +-------------------------+--------------+--------------+--------------+-----------------------+ | logging_simple | 12.8 us | 12.6 us | 1.02x faster | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | mako | 18.8 ms | 18.8 ms | 1.00x faster | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | meteor_contest | 123 ms | 120 ms | 1.03x faster | Significant (t=6.90) | +-------------------------+--------------+--------------+--------------+-----------------------+ | nbody | 162 ms | 162 ms | 1.00x faster | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | nqueens | 126 ms | 123 ms | 1.02x faster | Significant (t=3.98) | +-------------------------+--------------+--------------+--------------+-----------------------+ | pathlib | 25.8 ms | 26.0 ms | 1.01x slower | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | pickle | 12.7 us | 12.7 us | 1.00x faster | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | pickle_dict | 27.2 us | 27.1 us | 1.00x faster | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | pickle_list | 4.25 us | 4.26 us | 1.00x slower | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | pickle_pure_python | 657 us | 655 us | 1.00x faster | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | pidigits | 188 ms | 186 ms | 1.01x faster | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | pyflate | 847 ms | 839 ms | 1.01x faster | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | python_startup | 10.3 ms | 10.3 ms | 1.00x faster | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | python_startup_no_site | 6.81 ms | 6.82 ms | 1.00x slower | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | raytrace | 733 ms | 723 ms | 1.01x faster | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | regex_compile | 231 ms | 227 ms | 1.02x faster | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | regex_dna | 193 ms | 192 ms | 1.01x faster | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | regex_effbot | 3.01 ms | 2.92 ms | 1.03x faster | Significant (t=9.16) | +-------------------------+--------------+--------------+--------------+-----------------------+ | regex_v8 | 25.8 ms | 25.4 ms | 1.01x faster | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | richards | 111 ms | 111 ms | 1.01x slower | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | scimark_fft | 485 ms | 481 ms | 1.01x faster | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | scimark_lu | 198 ms | 203 ms | 1.03x slower | Significant (t=-5.46) | +-------------------------+--------------+--------------+--------------+-----------------------+ | scimark_monte_carlo | 128 ms | 131 ms | 1.02x slower | Significant (t=-5.16) | +-------------------------+--------------+--------------+--------------+-----------------------+ | scimark_sor | 265 ms | 253 ms | 1.05x faster | Significant (t=6.29) | +-------------------------+--------------+--------------+--------------+-----------------------+ | scimark_sparse_mat_mult | 6.77 ms | 6.72 ms | 1.01x faster | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | spectral_norm | 172 ms | 169 ms | 1.02x faster | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | sqlalchemy_declarative | 199 ms | 196 ms | 1.01x faster | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | sqlalchemy_imperative | 36.7 ms | 35.4 ms | 1.04x faster | Significant (t=4.90) | +-------------------------+--------------+--------------+--------------+-----------------------+ | sqlite_synth | 3.06 us | 3.13 us | 1.02x slower | Significant (t=-4.64) | +-------------------------+--------------+--------------+--------------+-----------------------+ | sympy_expand | 665 ms | 635 ms | 1.05x faster | Significant (t=9.99) | +-------------------------+--------------+--------------+--------------+-----------------------+ | sympy_integrate | 31.7 ms | 30.5 ms | 1.04x faster | Significant (t=8.38) | +-------------------------+--------------+--------------+--------------+-----------------------+ | sympy_str | 435 ms | 391 ms | 1.11x faster | Significant (t=5.78) | +-------------------------+--------------+--------------+--------------+-----------------------+ | sympy_sum | 265 ms | 256 ms | 1.03x faster | Significant (t=4.68) | +-------------------------+--------------+--------------+--------------+-----------------------+ | telco | 8.04 ms | 7.78 ms | 1.03x faster | Significant (t=5.59) | +-------------------------+--------------+--------------+--------------+-----------------------+ | tornado_http | 242 ms | 248 ms | 1.03x slower | Significant (t=-2.72) | +-------------------------+--------------+--------------+--------------+-----------------------+ | unpack_sequence | 63.4 ns | 62.8 ns | 1.01x faster | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | unpickle | 19.7 us | 20.6 us | 1.04x slower | Significant (t=-3.83) | +-------------------------+--------------+--------------+--------------+-----------------------+ | unpickle_list | 5.33 us | 5.23 us | 1.02x faster | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | unpickle_pure_python | 430 us | 424 us | 1.01x faster | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | xml_etree_generate | 119 ms | 114 ms | 1.04x faster | Significant (t=8.18) | +-------------------------+--------------+--------------+--------------+-----------------------+ | xml_etree_iterparse | 125 ms | 126 ms | 1.01x slower | Not significant | +-------------------------+--------------+--------------+--------------+-----------------------+ | xml_etree_parse | 170 ms | 165 ms | 1.03x faster | Significant (t=2.69) | +-------------------------+--------------+--------------+--------------+-----------------------+ | xml_etree_process | 99.9 ms | 95.6 ms | 1.05x faster | Significant (t=12.17) | +-------------------------+--------------+--------------+--------------+-----------------------+ ```

For the most part the disassembler branch improved performance, while a few benchmarks were slower. The mean was slightly faster. However, I don't really trust the results a ton given I used a VM. (Using python3 -m pyperf stats and python3 -m pyperf hist show a relatively high std dev.) The nice thing is that there were no massive outliers and the outcome wasn't clearly negative. 🙂

The directory structure I used is the following:

./cpython-perf/
    cpython/  # repo
        .perf/  # data dir for benchmark runs
    pyperformance/  # repo
    results-disassembler/
    ms-perf.ini

The config file looks like this:

ms-perf.ini

```ini [config] # Directory where JSON files are written. # - uploaded files are moved to json_dir/uploaded/ # - results of patched Python are written into json_dir/patch/ json_dir = ~/cpython-perf/cpython/.perf # If True, compile CPython is debug mode (LTO and PGO disabled), # run benchmarks with --debug-single-sample, and disable upload. # # Use this option used to quickly test a configuration. debug = False # Name of the Git remote, used to create revision of # the Git branch. For example, use revision 'remotes/origin/3.6' # for the branch '3.6'. #git_remote = remotes/origin git_remote = remotes/ericsnowcurrently [scm] # Directory of CPython source code (Git repository) repo_dir = ~/cpython-perf/cpython # Update the Git repository (git fetch)? update = False #update = True # Name of the Git remote, used to create revision of # the Git branch. For example, use revision 'remotes/origin/3.6' # for the branch '3.6'. #git_remote = remotes/origin git_remote = remotes/ericsnowcurrently [compile] # Create files into bench_dir: # - bench_dir/bench-xxx.log # - bench_dir/prefix/: where Python is installed # - bench_dir/venv/: Virtual environment used by pyperformance bench_dir = ~/cpython-perf/cpython/.perf/bench_tmpdir # Link Time Optimization (LTO)? lto = False #lto = True # Profiled Guided Optimization (PGO)? pgo = False #pgo = True # The space-separated list of libraries that are package-only, # i.e., locally installed but not on header and library paths. # For each such library, determine the install path and add an # appropriate subpath to CFLAGS and LDFLAGS declarations passed # to configure. As an exception, the prefix for openssl, if that # library is present here, is passed via the --with-openssl # option. Currently, this only works with Homebrew on macOS. # If running on macOS with Homebrew, you probably want to use: # pkg_only = openssl readline sqlite3 xz zlib # The version of zlib shipping with macOS probably works as well, # as long as Apple's SDK headers are installed. pkg_only = # Install Python? If false, run Python from the build directory # # WARNING: Running Python from the build directory introduces subtle changes # compared to running an installed Python. Moreover, creating a virtual # environment using a Python run from the build directory fails in many cases, # especially on Python older than 3.4. Only disable installation if you # really understand what you are doing! install = True [run_benchmark] # Run "sudo python3 -m pyperf system tune" before running benchmarks? system_tune = False #system_tune = True # --benchmarks option for 'pyperformance run' benchmarks = # --affinity option for 'pyperf system tune' and 'pyperformance run' affinity = # Upload generated JSON file? # # Upload is disabled on patched Python, in debug mode or if install is # disabled. upload = False # Configuration to upload results to a Codespeed website [upload] url = environment = executable = project = [compile_all] # List of CPython Git branches branches = default 3.6 3.5 2.7 # List of revisions to benchmark by compile_all [compile_all_revisions] # list of 'sha1=' (default branch: 'master') or 'sha1=branch' # used by the "pyperformance compile_all" command ```

To run the suite I did the following:

cd pyperformance
py3 -m pyperformance compile ../ms-perf.ini cc12888f9b master
mv ../cpython/.perf/*.json.gz ../results-disassembler
py3 -m pyperformance compile ../ms-perf.ini disassembler
mv ../cpython/.perf/*.json.gz ../results-disassembler
py3 -m pyperformance compare --csv ../disassembler-delta.csv ../results-disassembler/*.json.gz
py3 -m pyperformance compare -O table ../results-disassembler/*.json.gz

gvanrossum commented 3 years ago

Thanks! I agree that the variance is a bit high to put much faith in these. One conclusion is that the existing LOAD_ATTR optimization for slots isn't so bad! Another is that probably most of the benchmark suite doesn't use slots a lot. The one that I know is sensitive to slots performance ("float") also enjoyed the highest speedup, validating that we're indeed seeing the effect of the work on slots in this branch.

faster-cpython / ideas

A disassembler #8