Is `PREDICT()` worth it any more?

gvanrossum commented 1 year ago

The PREDICT() macro is already a no-op on compilers with computed GOTO. So it's only used on Windows (and a few esoteric platforms). Is it worth the bother? We should ideally benchmark this. In the past it's been claimed to have substantial benefits.

brandtbucher commented 1 year ago

I've wondered the same thing. We could benchmark this on Linux pretty easily by disabling computed gotos and timing the results with and without PREDICT turned on. I imagine that would be very close to whatever the Windows speedup is (but without requiring a Windows benchmarking setup).

brandtbucher commented 1 year ago

Based on Linux builds on our benchmarking machine, disabling USE_COMPUTED_GOTOS makes things ~1% slower (I thought it would be more, but it could be that our new superinstructions do most of the heavy lifting when reducing this part of the dispatch overhead):

``` Slower (45): - logging_format: 6.25 us +- 0.09 us -> 6.79 us +- 0.12 us: 1.09x slower - logging_simple: 5.71 us +- 0.09 us -> 6.07 us +- 0.07 us: 1.06x slower - genshi_xml: 46.1 ms +- 0.6 ms -> 48.8 ms +- 1.9 ms: 1.06x slower - html5lib: 58.5 ms +- 2.6 ms -> 61.8 ms +- 2.7 ms: 1.06x slower - deltablue: 3.19 ms +- 0.04 ms -> 3.37 ms +- 0.07 ms: 1.06x slower - dulwich_log: 61.1 ms +- 0.4 ms -> 64.3 ms +- 0.8 ms: 1.05x slower - scimark_lu: 104 ms +- 2 ms -> 110 ms +- 2 ms: 1.05x slower - regex_compile: 128 ms +- 1 ms -> 135 ms +- 2 ms: 1.05x slower - raytrace: 278 ms +- 4 ms -> 292 ms +- 9 ms: 1.05x slower - deepcopy: 324 us +- 2 us -> 338 us +- 3 us: 1.05x slower - pidigits: 189 ms +- 0 ms -> 198 ms +- 0 ms: 1.04x slower - tornado_http: 92.9 ms +- 1.4 ms -> 97.1 ms +- 1.4 ms: 1.04x slower - sympy_str: 282 ms +- 3 ms -> 294 ms +- 3 ms: 1.04x slower - sqlglot_normalize: 105 ms +- 1 ms -> 109 ms +- 1 ms: 1.04x slower - sympy_expand: 459 ms +- 5 ms -> 479 ms +- 3 ms: 1.04x slower - unpickle: 12.9 us +- 0.1 us -> 13.4 us +- 1.4 us: 1.04x slower - sympy_sum: 164 ms +- 2 ms -> 169 ms +- 2 ms: 1.03x slower - gunicorn: 1.07 ms +- 0.01 ms -> 1.11 ms +- 0.01 ms: 1.03x slower - sqlglot_optimize: 50.4 ms +- 0.2 ms -> 52.1 ms +- 0.2 ms: 1.03x slower - sqlglot_parse: 1.32 ms +- 0.01 ms -> 1.36 ms +- 0.01 ms: 1.03x slower - django_template: 32.3 ms +- 0.3 ms -> 33.3 ms +- 0.4 ms: 1.03x slower - unpickle_pure_python: 201 us +- 2 us -> 207 us +- 2 us: 1.03x slower - sqlglot_transpile: 1.61 ms +- 0.02 ms -> 1.66 ms +- 0.02 ms: 1.03x slower - pprint_safe_repr: 672 ms +- 6 ms -> 691 ms +- 7 ms: 1.03x slower - pprint_pformat: 1.38 sec +- 0.01 sec -> 1.41 sec +- 0.01 sec: 1.03x slower - aiohttp: 994 us +- 9 us -> 1.02 ms +- 0.01 ms: 1.03x slower - sympy_integrate: 20.4 ms +- 0.1 ms -> 20.9 ms +- 0.2 ms: 1.02x slower - coroutines: 25.0 ms +- 0.2 ms -> 25.5 ms +- 0.5 ms: 1.02x slower - scimark_sor: 103 ms +- 1 ms -> 105 ms +- 1 ms: 1.02x slower - 2to3: 246 ms +- 1 ms -> 250 ms +- 1 ms: 1.02x slower - chaos: 66.9 ms +- 0.8 ms -> 68.0 ms +- 0.6 ms: 1.02x slower - scimark_fft: 309 ms +- 3 ms -> 314 ms +- 3 ms: 1.01x slower - pickle_pure_python: 284 us +- 4 us -> 288 us +- 2 us: 1.01x slower - deepcopy_memo: 33.7 us +- 0.4 us -> 34.2 us +- 0.3 us: 1.01x slower - float: 73.0 ms +- 1.0 ms -> 74.0 ms +- 0.9 ms: 1.01x slower - deepcopy_reduce: 2.90 us +- 0.04 us -> 2.94 us +- 0.04 us: 1.01x slower - json_loads: 24.1 us +- 0.9 us -> 24.3 us +- 0.3 us: 1.01x slower - xml_etree_process: 53.0 ms +- 0.7 ms -> 53.6 ms +- 0.4 ms: 1.01x slower - unpickle_list: 4.92 us +- 0.10 us -> 4.97 us +- 0.04 us: 1.01x slower - xml_etree_generate: 76.8 ms +- 0.7 ms -> 77.6 ms +- 0.7 ms: 1.01x slower - xml_etree_iterparse: 102 ms +- 2 ms -> 103 ms +- 2 ms: 1.01x slower - python_startup_no_site: 6.23 ms +- 0.01 ms -> 6.28 ms +- 0.01 ms: 1.01x slower - python_startup: 8.56 ms +- 0.01 ms -> 8.62 ms +- 0.01 ms: 1.01x slower - coverage: 98.2 ms +- 1.3 ms -> 98.9 ms +- 1.3 ms: 1.01x slower - hexiom: 6.13 ms +- 0.03 ms -> 6.16 ms +- 0.05 ms: 1.01x slower Faster (22): - nbody: 94.7 ms +- 2.3 ms -> 88.6 ms +- 1.5 ms: 1.07x faster - pycparser: 1.13 sec +- 0.02 sec -> 1.09 sec +- 0.02 sec: 1.04x faster - pathlib: 18.0 ms +- 0.3 ms -> 17.4 ms +- 0.3 ms: 1.03x faster - mdp: 2.65 sec +- 0.02 sec -> 2.57 sec +- 0.01 sec: 1.03x faster - regex_v8: 22.2 ms +- 0.3 ms -> 21.6 ms +- 0.3 ms: 1.03x faster - pickle_list: 4.16 us +- 0.06 us -> 4.06 us +- 0.04 us: 1.03x faster - pickle: 10.3 us +- 0.1 us -> 10.0 us +- 0.1 us: 1.03x faster - generators: 76.4 ms +- 0.7 ms -> 74.6 ms +- 0.5 ms: 1.02x faster - genshi_text: 21.1 ms +- 0.3 ms -> 20.7 ms +- 0.3 ms: 1.02x faster - pyflate: 402 ms +- 7 ms -> 393 ms +- 4 ms: 1.02x faster - chameleon: 6.50 ms +- 0.07 ms -> 6.38 ms +- 0.07 ms: 1.02x faster - scimark_sparse_mat_mult: 4.12 ms +- 0.08 ms -> 4.04 ms +- 0.05 ms: 1.02x faster - unpack_sequence: 48.7 ns +- 0.6 ns -> 48.0 ns +- 0.8 ns: 1.01x faster - fannkuch: 380 ms +- 2 ms -> 375 ms +- 3 ms: 1.01x faster - mako: 9.64 ms +- 0.07 ms -> 9.53 ms +- 0.06 ms: 1.01x faster - logging_silent: 91.3 ns +- 0.9 ns -> 90.3 ns +- 1.5 ns: 1.01x faster - richards: 41.7 ms +- 0.6 ms -> 41.3 ms +- 1.0 ms: 1.01x faster - go: 134 ms +- 1 ms -> 132 ms +- 1 ms: 1.01x faster - json_dumps: 9.48 ms +- 0.14 ms -> 9.40 ms +- 0.10 ms: 1.01x faster - regex_effbot: 3.60 ms +- 0.02 ms -> 3.58 ms +- 0.02 ms: 1.01x faster - scimark_monte_carlo: 65.6 ms +- 0.8 ms -> 65.3 ms +- 0.8 ms: 1.00x faster - regex_dna: 203 ms +- 0 ms -> 202 ms +- 1 ms: 1.00x faster Benchmark hidden because not significant (15): async_tree_none, async_tree_cpu_io_mixed, async_tree_io, async_tree_memoization, crypto_pyaes, json, meteor_contest, mypy, nqueens, pickle_dict, spectral_norm, sqlite_synth, telco, thrift, xml_etree_parse Geometric mean: 1.01x slower ```

When USE_COMPUTED_GOTOS is turned off, disabling PREDICT doesn't appear to have any effect:

``` Slower (28): - generators: 74.6 ms +- 0.5 ms -> 84.4 ms +- 0.3 ms: 1.13x slower - scimark_sparse_mat_mult: 4.04 ms +- 0.05 ms -> 4.27 ms +- 0.06 ms: 1.06x slower - mdp: 2.57 sec +- 0.01 sec -> 2.68 sec +- 0.02 sec: 1.04x slower - telco: 6.30 ms +- 0.15 ms -> 6.54 ms +- 0.18 ms: 1.04x slower - pickle_list: 4.06 us +- 0.04 us -> 4.19 us +- 0.06 us: 1.03x slower - float: 74.0 ms +- 0.9 ms -> 76.3 ms +- 1.1 ms: 1.03x slower - nbody: 88.6 ms +- 1.5 ms -> 91.3 ms +- 2.8 ms: 1.03x slower - json: 4.70 ms +- 0.11 ms -> 4.84 ms +- 0.17 ms: 1.03x slower - pickle: 10.0 us +- 0.1 us -> 10.3 us +- 0.1 us: 1.02x slower - json_dumps: 9.40 ms +- 0.10 ms -> 9.56 ms +- 0.14 ms: 1.02x slower - pickle_dict: 30.8 us +- 0.1 us -> 31.2 us +- 0.1 us: 1.01x slower - hexiom: 6.16 ms +- 0.05 ms -> 6.24 ms +- 0.04 ms: 1.01x slower - pathlib: 17.4 ms +- 0.3 ms -> 17.6 ms +- 0.3 ms: 1.01x slower - deepcopy_memo: 34.2 us +- 0.3 us -> 34.5 us +- 0.6 us: 1.01x slower - go: 132 ms +- 1 ms -> 134 ms +- 1 ms: 1.01x slower - sqlglot_transpile: 1.66 ms +- 0.02 ms -> 1.67 ms +- 0.02 ms: 1.01x slower - sqlglot_parse: 1.36 ms +- 0.01 ms -> 1.37 ms +- 0.02 ms: 1.01x slower - json_loads: 24.3 us +- 0.3 us -> 24.6 us +- 0.2 us: 1.01x slower - pyflate: 393 ms +- 4 ms -> 396 ms +- 4 ms: 1.01x slower - sympy_expand: 479 ms +- 3 ms -> 482 ms +- 5 ms: 1.01x slower - sqlite_synth: 2.60 us +- 0.04 us -> 2.62 us +- 0.04 us: 1.01x slower - mako: 9.53 ms +- 0.06 ms -> 9.57 ms +- 0.04 ms: 1.00x slower - nqueens: 81.9 ms +- 0.8 ms -> 82.3 ms +- 0.7 ms: 1.00x slower - regex_v8: 21.6 ms +- 0.3 ms -> 21.7 ms +- 0.2 ms: 1.00x slower - unpickle_pure_python: 207 us +- 2 us -> 208 us +- 2 us: 1.00x slower - 2to3: 250 ms +- 1 ms -> 251 ms +- 1 ms: 1.00x slower - sqlglot_optimize: 52.1 ms +- 0.2 ms -> 52.2 ms +- 0.3 ms: 1.00x slower - regex_dna: 202 ms +- 1 ms -> 202 ms +- 1 ms: 1.00x slower Faster (23): - regex_effbot: 3.58 ms +- 0.02 ms -> 3.32 ms +- 0.02 ms: 1.08x faster - unpack_sequence: 48.0 ns +- 0.8 ns -> 44.9 ns +- 0.6 ns: 1.07x faster - pidigits: 198 ms +- 0 ms -> 189 ms +- 0 ms: 1.04x faster - xml_etree_process: 53.6 ms +- 0.4 ms -> 52.8 ms +- 0.4 ms: 1.01x faster - chaos: 68.0 ms +- 0.6 ms -> 67.1 ms +- 0.7 ms: 1.01x faster - pprint_pformat: 1.41 sec +- 0.01 sec -> 1.40 sec +- 0.01 sec: 1.01x faster - django_template: 33.3 ms +- 0.4 ms -> 32.9 ms +- 0.3 ms: 1.01x faster - pycparser: 1.09 sec +- 0.02 sec -> 1.07 sec +- 0.02 sec: 1.01x faster - thrift: 755 us +- 27 us -> 746 us +- 9 us: 1.01x faster - scimark_sor: 105 ms +- 1 ms -> 104 ms +- 1 ms: 1.01x faster - logging_format: 6.79 us +- 0.12 us -> 6.72 us +- 0.11 us: 1.01x faster - scimark_fft: 314 ms +- 3 ms -> 311 ms +- 3 ms: 1.01x faster - unpickle_list: 4.97 us +- 0.04 us -> 4.92 us +- 0.12 us: 1.01x faster - logging_simple: 6.07 us +- 0.07 us -> 6.01 us +- 0.10 us: 1.01x faster - crypto_pyaes: 75.1 ms +- 1.3 ms -> 74.5 ms +- 0.8 ms: 1.01x faster - pickle_pure_python: 288 us +- 2 us -> 285 us +- 4 us: 1.01x faster - pprint_safe_repr: 691 ms +- 7 ms -> 686 ms +- 5 ms: 1.01x faster - sympy_str: 294 ms +- 3 ms -> 292 ms +- 2 ms: 1.01x faster - python_startup_no_site: 6.28 ms +- 0.01 ms -> 6.25 ms +- 0.01 ms: 1.00x faster - aiohttp: 1.02 ms +- 0.01 ms -> 1.01 ms +- 0.01 ms: 1.00x faster - sqlglot_normalize: 109 ms +- 1 ms -> 109 ms +- 1 ms: 1.00x faster - python_startup: 8.62 ms +- 0.01 ms -> 8.59 ms +- 0.01 ms: 1.00x faster - deepcopy: 338 us +- 3 us -> 337 us +- 3 us: 1.00x faster Benchmark hidden because not significant (31): async_tree_none, async_tree_cpu_io_mixed, async_tree_io, async_tree_memoization, chameleon, coroutines, coverage, deepcopy_reduce, deltablue, dulwich_log, fannkuch, genshi_text, genshi_xml, gunicorn, html5lib, logging_silent, meteor_contest, mypy, raytrace, regex_compile, richards, scimark_lu, scimark_monte_carlo, spectral_norm, sympy_integrate, sympy_sum, tornado_http, unpickle, xml_etree_parse, xml_etree_iterparse, xml_etree_generate Geometric mean: 1.00x slower ```

So this could be worth exploring as a cleanup opportunity (probably after observing similar MSVC results first).

gvanrossum commented 1 year ago

Agreed, this looks somewhat promising. Let's wait until we have a reliable way to benchmark this on Windows. (Didn't I read somewhere that we need 2.5% perf difference to be sure a benchmark result isn't noise?)

ericsnowcurrently commented 1 year ago

(Didn't I read somewhere that we need 2.5% perf difference to be sure a benchmark result isn't noise?)

https://github.com/faster-cpython/ideas/issues/480

gvanrossum commented 1 year ago

Another way to evaluate the effectiveness of PREDICT() without benchmarking: look at opcode counts and successors for opcodes that have a PREDICT() in them. We should be able to tell how often the PREDICT() is effective from that.

Separately, PREDICT() in opcodes that may be specialized aren't very effective unless the specialized versions also use PREDICT(). So we should either add PREDICT() to the specialized versions, or remove it from the base opcode.

gvanrossum commented 1 year ago

LIST_APPEND: not specialized; 97% successors are the predicted JUMP_BACKWARD
SET_ADD: not specialized; 100% successors are the predicted JUMP_BACKWARD
GET_ANEXT: not specialized; doesn't show up in successor count stats
GET_AWAITABLE: not specialized; doesn't show up in successor count stats
DICT_MERGE: not specialized; 100% successors are the predicted CALL_FUNCTION_EX
MAP_ADD: not specialized; only 8.6% successors are the predicted JUMP_BACKWARD (!)
MATCH_MAPPING: not specialized; doesn't show up in successor count stats
MATCH_SEQUENCE: not specialized; doesn't show up in successor count stats
GET_YIELD_FROM_ITER: not specialized; 100% successors are the predicted LOAD_CONST
BEFORE_ASYNC_WITH: not specialized; doesn't show up in successor count stats

Conclusion: no need to worry about specializations here; half of these opcodes don't show up in the stats (because they're not used in any of the benchmarks); four of the others are clearly effective; MAP_ADD seems questionable. Perhaps we can just drop the PREDICT() calls from the 5 that don't show up (being very specialized opcodes) and from MAP_ADD?

earonesty commented 1 year ago

windows is far from esoteric tho! seems like it's still worth it on windows, which is a critical o/s

markshannon commented 1 year ago

One thing we could do, is strip the PREDICT() macros from bytecodes.c and automatically insert the prediction logic using data from the stats. The current, manually inserted predictions aren't very good, so it should be easy to do better.

E.g. according to the stats, BEFORE_WITH is followed by POP_TOP about 80% of the time, STORE_FAST__LOAD_FAST the remaining 20%, and everything else some negligible fraction.

We could generate this code at the end of BEFORE_WITH:

if (opcode == POP_TOP) goto target_POP_TOP;
if (opcode == STORE_FAST__LOAD_FAST) goto target_STORE_FAST__LOAD_FAST;
GOTO_DISPATCH();

(The if ... goto ... will need to wrapped in a macro, for portability)

gvanrossum commented 1 year ago

Since PREDICT() only matters on the MS compiler, maybe we could just manually insert more appropriate PREDICT() calls based on the stats? Where are the most recent stats?

markshannon commented 1 year ago

I think we should automate this. https://github.com/faster-cpython/ideas/tree/main/stats look for the one from the most recent Sunday.

markshannon commented 1 year ago

The PREDICT macros are gone.

faster-cpython / ideas

Is `PREDICT()` worth it any more? #496