Measure Windows performance (and improve if lacking)

gvanrossum commented 2 years ago

There was a report that on Windows the performance on the PyPerformance benchmark is only 3% better than for 3.10. We should try to confirm or refute this result, and if confirmed, attempt to improve the situation.

Likely culprit is the optimizer (or PGO or LTO) giving up on inlining in the main function in ceval.c. In particular it might be the INCREF/DECREF inline functions. A possible hack would be to turn these back into macros again at the top of ceval.c (if no debugging options are defined).

markshannon commented 2 years ago

Would it be a viable option to use Clang instead of MSVC for Windows?

kumaraditya303 commented 2 years ago

I have got it compiling under clang on windows in debug build

gvanrossum commented 2 years ago

FWIW my own measurements on Windows show much better perf than that, but are inconclusive due to noise (we're waiting for a dedicated perf machine running Windows). See https://github.com/faster-cpython/ideas/discussions/315#discussioncomment-2376669 .

gvanrossum commented 2 years ago

@zooba Do you see a problem with using clang instead of MSVC on Windows, assuming we can get it working?

ivoflipse commented 2 years ago

Wouldn't that mean that anyone building C extensions would also have to switch over to Clang?

If so, that would require some coordination (e.g. cibuildwheel, conda-forge, numpy/scipy) to make such a switch and support two compilers if you want to make C extensions available for multiple Python versions.

gvanrossum commented 2 years ago

@ivoflipse

Wouldn't that mean that anyone building C extensions would also have to switch over to Clang?

I don't know. I have no experience with this topic at all, which is why I'm asking @zooba for advice.

@kumaraditya303 Can you point to your branch here?

zooba commented 2 years ago

Yeah, that's a massive ecosystem change that we're not ready for. It almost certainly will affect the ABI in some way, though it might be manageable, and it'll definitely impact/break any build scripts that implicitly or explicitly assume MSVC. As I work through ARM64 support, I'm finding that is "many"... (unless, of course, it turns out to be perfectly safe to link between the two compilers, and then it'll only affect static lib users, which is practically nobody AFAIK).

Figuring out for certain if the size of ceval is causing it to de-optimise and then sensibly breaking it up is much easier.

gvanrossum commented 2 years ago

FWIW @neonene came up with a PR that temporarily enables a flag that shows the inlining decisions. I'll look into this.

https://bugs.python.org/issue45116#msg406471

sweeneyde commented 2 years ago

I have a branch https://github.com/python/cpython/compare/main...sweeneyde:ops_to_funcs that converts several of the longer opcodes into functions, in case that is useful for comparison.

neonene commented 2 years ago

Macro benchmarks' improvements are unstable for me in the range of 8~20% compared to 3.10.1+.

When building with PGO (only), changing the definition of Py_UNREACHABLE() could be an improvement. For example, replacing __assume(0) with a no-return-noop function in pymacro.h. But __assume(0) is not the only suspect that stops the optimizer.

https://devblogs.microsoft.com/cppblog/visual-studio-2017-throughput-improvements-and-advice/#remove-usages-of-__assume

neonene commented 2 years ago

Regarding __assume(0), recent PGO builds show the following pyperformance results. __assume(0) can block the project's efforts, especially on x64.

Sample patch:

--- Include\pymacro.h
+++ Include\pymacro.h
@@ -126 +126 @@
-#  define Py_UNREACHABLE() __assume(0)
+static inline void _Py_NO_RETURN Py_UNREACHABLE(void) {}

f4b328e (20220408)	Py_UNREACHABLE	x64 PGO	x86 PGO
original	__assume(0)	1.00	1.00
void foo(void) {}	1.01x faster	1.05x faster
fork (forceinline, etc.)	__assume(0)	1.02x faster	1.08x faster
void foo(void) {}	1.12x faster	1.16x faster

x64 PGO details

``` >pyperf compare_to original orig_no_assume -G --min-speed=2 Slower (6): - python_startup_no_site: 11.2 ms +- 0.4 ms -> 13.6 ms +- 6.3 ms: 1.21x slower - logging_format: 22.9 us +- 0.7 us -> 24.6 us +- 1.4 us: 1.07x slower - pathlib: 178 ms +- 14 ms -> 191 ms +- 23 ms: 1.07x slower - 2to3: 556 ms +- 8 ms -> 587 ms +- 115 ms: 1.06x slower - json_loads: 40.1 us +- 0.6 us -> 42.2 us +- 1.2 us: 1.05x slower - telco: 11.3 ms +- 0.1 ms -> 11.6 ms +- 0.1 ms: 1.02x slower Faster (26): - logging_silent: 203 ns +- 2 ns -> 178 ns +- 4 ns: 1.14x faster - regex_v8: 40.9 ms +- 1.1 ms -> 36.6 ms +- 1.0 ms: 1.12x faster - unpack_sequence: 92.1 ns +- 3.4 ns -> 84.3 ns +- 3.0 ns: 1.09x faster - crypto_pyaes: 155 ms +- 6 ms -> 145 ms +- 5 ms: 1.07x faster - chaos: 159 ms +- 3 ms -> 150 ms +- 2 ms: 1.06x faster - hexiom: 13.3 ms +- 0.3 ms -> 12.6 ms +- 0.2 ms: 1.06x faster - logging_simple: 21.7 us +- 0.6 us -> 20.4 us +- 0.4 us: 1.06x faster - unpickle_pure_python: 421 us +- 7 us -> 399 us +- 5 us: 1.06x faster - go: 289 ms +- 5 ms -> 276 ms +- 3 ms: 1.05x faster - sympy_str: 579 ms +- 24 ms -> 553 ms +- 13 ms: 1.05x faster - regex_compile: 264 ms +- 5 ms -> 253 ms +- 6 ms: 1.04x faster - sympy_sum: 322 ms +- 11 ms -> 309 ms +- 5 ms: 1.04x faster - tornado_http: 417 ms +- 41 ms -> 400 ms +- 12 ms: 1.04x faster - sympy_integrate: 38.0 ms +- 0.6 ms -> 36.7 ms +- 0.6 ms: 1.04x faster - sqlalchemy_imperative: 47.6 ms +- 3.5 ms -> 45.9 ms +- 3.6 ms: 1.04x faster - raytrace: 646 ms +- 12 ms -> 624 ms +- 9 ms: 1.04x faster - unpickle_list: 7.20 us +- 0.29 us -> 6.95 us +- 0.10 us: 1.04x faster - pyflate: 919 ms +- 16 ms -> 889 ms +- 15 ms: 1.03x faster - django_template: 96.2 ms +- 1.7 ms -> 93.0 ms +- 1.6 ms: 1.03x faster - sqlalchemy_declarative: 301 ms +- 10 ms -> 291 ms +- 8 ms: 1.03x faster - pickle_pure_python: 642 us +- 11 us -> 623 us +- 14 us: 1.03x faster - mako: 21.0 ms +- 0.3 ms -> 20.5 ms +- 0.3 ms: 1.03x faster - scimark_monte_carlo: 127 ms +- 2 ms -> 124 ms +- 2 ms: 1.03x faster - float: 161 ms +- 3 ms -> 158 ms +- 2 ms: 1.02x faster - xml_etree_parse: 255 ms +- 13 ms -> 249 ms +- 5 ms: 1.02x faster - sympy_expand: 963 ms +- 26 ms -> 941 ms +- 40 ms: 1.02x faster Benchmark hidden because not significant (26): chameleon, deltablue, dulwich_log , fannkuch, json_dumps, meteor_contest, nbody, nqueens, pickle, pickle_dict, pic kle_list, pidigits, python_startup, regex_dna, regex_effbot, richards, scimark_f ft, scimark_lu, scimark_sor, scimark_sparse_mat_mult, spectral_norm, sqlite_synt h, unpickle, xml_etree_iterparse, xml_etree_generate, xml_etree_process Geometric mean: 1.01x faster ``` ``` >pyperf compare_to original fork_with_assume -G --min-speed=2 Slower (8): - logging_silent: 203 ns +- 2 ns -> 270 ns +- 4 ns: 1.33x slower - unpack_sequence: 92.1 ns +- 3.4 ns -> 97.8 ns +- 1.2 ns: 1.06x slower - scimark_lu: 172 ms +- 3 ms -> 180 ms +- 2 ms: 1.05x slower - xml_etree_iterparse: 187 ms +- 6 ms -> 195 ms +- 6 ms: 1.04x slower - telco: 11.3 ms +- 0.1 ms -> 11.8 ms +- 0.1 ms: 1.04x slower - scimark_sparse_mat_mult: 7.43 ms +- 0.17 ms -> 7.63 ms +- 0.15 ms: 1.03x slower - pathlib: 178 ms +- 14 ms -> 182 ms +- 4 ms: 1.02x slower - regex_effbot: 3.81 ms +- 0.02 ms -> 3.89 ms +- 0.09 ms: 1.02x slower Faster (32): - nbody: 209 ms +- 2 ms -> 181 ms +- 1 ms: 1.15x faster - regex_v8: 40.9 ms +- 1.1 ms -> 36.2 ms +- 0.5 ms: 1.13x faster - logging_simple: 21.7 us +- 0.6 us -> 20.1 us +- 0.4 us: 1.08x faster - crypto_pyaes: 155 ms +- 6 ms -> 144 ms +- 4 ms: 1.07x faster - pickle_list: 7.05 us +- 0.31 us -> 6.60 us +- 0.26 us: 1.07x faster - django_template: 96.2 ms +- 1.7 ms -> 90.7 ms +- 2.3 ms: 1.06x faster - sqlalchemy_imperative: 47.6 ms +- 3.5 ms -> 44.9 ms +- 0.7 ms: 1.06x faster - sympy_expand: 963 ms +- 26 ms -> 913 ms +- 12 ms: 1.05x faster - logging_format: 22.9 us +- 0.7 us -> 21.7 us +- 0.4 us: 1.05x faster - spectral_norm: 215 ms +- 2 ms -> 205 ms +- 4 ms: 1.05x faster - sqlalchemy_declarative: 301 ms +- 10 ms -> 288 ms +- 6 ms: 1.05x faster - regex_compile: 264 ms +- 5 ms -> 253 ms +- 11 ms: 1.04x faster - tornado_http: 417 ms +- 41 ms -> 399 ms +- 7 ms: 1.04x faster - fannkuch: 733 ms +- 27 ms -> 702 ms +- 12 ms: 1.04x faster - sympy_str: 579 ms +- 24 ms -> 556 ms +- 9 ms: 1.04x faster - raytrace: 646 ms +- 12 ms -> 622 ms +- 8 ms: 1.04x faster - chaos: 159 ms +- 3 ms -> 154 ms +- 4 ms: 1.04x faster - sympy_sum: 322 ms +- 11 ms -> 311 ms +- 5 ms: 1.04x faster - scimark_monte_carlo: 127 ms +- 2 ms -> 123 ms +- 2 ms: 1.03x faster - chameleon: 16.4 ms +- 0.4 ms -> 15.9 ms +- 0.2 ms: 1.03x faster - deltablue: 7.90 ms +- 0.14 ms -> 7.66 ms +- 0.13 ms: 1.03x faster - mako: 21.0 ms +- 0.3 ms -> 20.4 ms +- 0.4 ms: 1.03x faster - xml_etree_parse: 255 ms +- 13 ms -> 247 ms +- 5 ms: 1.03x faster - json_loads: 40.1 us +- 0.6 us -> 39.1 us +- 0.7 us: 1.03x faster - pickle_dict: 44.8 us +- 1.7 us -> 43.7 us +- 0.6 us: 1.03x faster - unpickle: 22.7 us +- 0.4 us -> 22.1 us +- 0.3 us: 1.03x faster - nqueens: 181 ms +- 3 ms -> 177 ms +- 2 ms: 1.02x faster - unpickle_list: 7.20 us +- 0.29 us -> 7.04 us +- 0.17 us: 1.02x faster - xml_etree_generate: 156 ms +- 5 ms -> 152 ms +- 3 ms: 1.02x faster - go: 289 ms +- 5 ms -> 283 ms +- 2 ms: 1.02x faster - sympy_integrate: 38.0 ms +- 0.6 ms -> 37.3 ms +- 0.5 ms: 1.02x faster - meteor_contest: 171 ms +- 3 ms -> 168 ms +- 5 ms: 1.02x faster Benchmark hidden because not significant (18): 2to3, dulwich_log, float, hexiom, json_dumps, pickle, pickle_pure_python, pidigits, pyflate, python_startup, pyth on_startup_no_site, regex_dna, richards, scimark_fft, scimark_sor, sqlite_synth, unpickle_pure_python, xml_etree_process Geometric mean: 1.02x faster ``` ``` >pyperf compare_to original fork_no_assume -G --min-speed=2 Faster (49): - unpack_sequence: 92.1 ns +- 3.4 ns -> 66.2 ns +- 3.3 ns: 1.39x faster - spectral_norm: 215 ms +- 2 ms -> 166 ms +- 2 ms: 1.30x faster - chaos: 159 ms +- 3 ms -> 129 ms +- 1 ms: 1.23x faster - unpickle_pure_python: 421 us +- 7 us -> 343 us +- 5 us: 1.23x faster - raytrace: 646 ms +- 12 ms -> 529 ms +- 4 ms: 1.22x faster - go: 289 ms +- 5 ms -> 237 ms +- 3 ms: 1.22x faster - crypto_pyaes: 155 ms +- 6 ms -> 128 ms +- 3 ms: 1.21x faster - richards: 96.7 ms +- 1.4 ms -> 80.3 ms +- 1.4 ms: 1.20x faster - scimark_monte_carlo: 127 ms +- 2 ms -> 106 ms +- 1 ms: 1.20x faster - pyflate: 919 ms +- 16 ms -> 769 ms +- 12 ms: 1.20x faster - scimark_sor: 234 ms +- 2 ms -> 196 ms +- 3 ms: 1.19x faster - nbody: 209 ms +- 2 ms -> 175 ms +- 2 ms: 1.19x faster - hexiom: 13.3 ms +- 0.3 ms -> 11.2 ms +- 0.1 ms: 1.19x faster - chameleon: 16.4 ms +- 0.4 ms -> 13.8 ms +- 0.2 ms: 1.19x faster - regex_compile: 264 ms +- 5 ms -> 223 ms +- 2 ms: 1.18x faster - float: 161 ms +- 3 ms -> 138 ms +- 1 ms: 1.17x faster - nqueens: 181 ms +- 3 ms -> 155 ms +- 2 ms: 1.17x faster - deltablue: 7.90 ms +- 0.14 ms -> 6.77 ms +- 0.11 ms: 1.17x faster - logging_simple: 21.7 us +- 0.6 us -> 18.7 us +- 0.4 us: 1.16x faster - xml_etree_process: 116 ms +- 1 ms -> 99.8 ms +- 1.6 ms: 1.16x faster - pickle_pure_python: 642 us +- 11 us -> 552 us +- 6 us: 1.16x faster - fannkuch: 733 ms +- 27 ms -> 634 ms +- 25 ms: 1.16x faster - scimark_sparse_mat_mult: 7.43 ms +- 0.17 ms -> 6.47 ms +- 0.11 ms: 1.15x faster - regex_v8: 40.9 ms +- 1.1 ms -> 35.8 ms +- 1.3 ms: 1.14x faster - scimark_fft: 568 ms +- 11 ms -> 497 ms +- 14 ms: 1.14x faster - django_template: 96.2 ms +- 1.7 ms -> 84.3 ms +- 1.4 ms: 1.14x faster - mako: 21.0 ms +- 0.3 ms -> 18.4 ms +- 0.6 ms: 1.14x faster - sympy_expand: 963 ms +- 26 ms -> 848 ms +- 18 ms: 1.14x faster - meteor_contest: 171 ms +- 3 ms -> 152 ms +- 1 ms: 1.13x faster - sympy_str: 579 ms +- 24 ms -> 517 ms +- 9 ms: 1.12x faster - logging_format: 22.9 us +- 0.7 us -> 20.5 us +- 0.7 us: 1.12x faster - scimark_lu: 172 ms +- 3 ms -> 156 ms +- 1 ms: 1.10x faster - sqlalchemy_declarative: 301 ms +- 10 ms -> 274 ms +- 7 ms: 1.10x faster - 2to3: 556 ms +- 8 ms -> 507 ms +- 11 ms: 1.10x faster - sympy_integrate: 38.0 ms +- 0.6 ms -> 34.7 ms +- 0.8 ms: 1.10x faster - sympy_sum: 322 ms +- 11 ms -> 295 ms +- 5 ms: 1.09x faster - pickle_list: 7.05 us +- 0.31 us -> 6.49 us +- 0.08 us: 1.09x faster - xml_etree_generate: 156 ms +- 5 ms -> 143 ms +- 8 ms: 1.09x faster - dulwich_log: 203 ms +- 7 ms -> 188 ms +- 3 ms: 1.08x faster - logging_silent: 203 ns +- 2 ns -> 189 ns +- 3 ns: 1.07x faster - unpickle_list: 7.20 us +- 0.29 us -> 6.79 us +- 0.38 us: 1.06x faster - pickle_dict: 44.8 us +- 1.7 us -> 42.5 us +- 1.4 us: 1.05x faster - json_dumps: 22.7 ms +- 0.7 ms -> 21.6 ms +- 0.2 ms: 1.05x faster - xml_etree_iterparse: 187 ms +- 6 ms -> 178 ms +- 13 ms: 1.05x faster - tornado_http: 417 ms +- 41 ms -> 403 ms +- 22 ms: 1.03x faster - python_startup: 13.8 ms +- 0.8 ms -> 13.3 ms +- 0.1 ms: 1.03x faster - json_loads: 40.1 us +- 0.6 us -> 38.9 us +- 0.7 us: 1.03x faster - pidigits: 266 ms +- 1 ms -> 259 ms +- 1 ms: 1.03x faster - sqlite_synth: 5.54 us +- 0.05 us -> 5.42 us +- 0.07 us: 1.02x faster Benchmark hidden because not significant (9): pathlib, pickle, python_startup_no _site, regex_dna, regex_effbot, sqlalchemy_imperative, telco, unpickle, xml_etre e_parse Geometric mean: 1.12x faster ```

`Py_FatalError()` can be used instead:	f4b328e	Py_UNREACHABLE	x64 PGO
fork	void foo(void) {}	1.00
Py_FatalError	1.01x slower	too common to be distinguished

Relase builds:	f4b328e	Py_UNREACHABLE	x64 Release	x86 Release
original (/Ob3)	__assume(0)	1.00	1.00
void foo(void) {}	1.01x slower	1.02x faster

It is possible another commit reverses the slower/faster.

``` >pyperf compare_to original64 orig64_no_assume -G --min-speed=2 Slower (20): - hexiom: 11.6 ms +- 0.2 ms -> 12.7 ms +- 0.3 ms: 1.10x slower - sqlalchemy_imperative: 48.7 ms +- 2.7 ms -> 51.9 ms +- 4.3 ms: 1.06x slower - xml_etree_parse: 304 ms +- 22 ms -> 324 ms +- 20 ms: 1.06x slower - django_template: 102 ms +- 1 ms -> 108 ms +- 2 ms: 1.06x slower - dulwich_log: 217 ms +- 9 ms -> 229 ms +- 12 ms: 1.05x slower - deltablue: 7.27 ms +- 0.10 ms -> 7.61 ms +- 0.10 ms: 1.05x slower - pyflate: 849 ms +- 9 ms -> 882 ms +- 14 ms: 1.04x slower - sympy_sum: 343 ms +- 9 ms -> 355 ms +- 26 ms: 1.03x slower - xml_etree_generate: 166 ms +- 4 ms -> 172 ms +- 9 ms: 1.03x slower - nqueens: 184 ms +- 2 ms -> 190 ms +- 4 ms: 1.03x slower - xml_etree_iterparse: 203 ms +- 8 ms -> 209 ms +- 11 ms: 1.03x slower - crypto_pyaes: 149 ms +- 3 ms -> 153 ms +- 3 ms: 1.03x slower - spectral_norm: 193 ms +- 4 ms -> 199 ms +- 5 ms: 1.03x slower - logging_simple: 22.8 us +- 0.6 us -> 23.4 us +- 0.4 us: 1.03x slower - xml_etree_process: 120 ms +- 3 ms -> 124 ms +- 1 ms: 1.03x slower - regex_compile: 260 ms +- 8 ms -> 267 ms +- 7 ms: 1.03x slower - mako: 21.9 ms +- 0.2 ms -> 22.4 ms +- 0.4 ms: 1.02x slower - pathlib: 206 ms +- 14 ms -> 210 ms +- 10 ms: 1.02x slower - python_startup_no_site: 12.0 ms +- 0.4 ms -> 12.3 ms +- 1.7 ms: 1.02x slower - meteor_contest: 165 ms +- 4 ms -> 168 ms +- 5 ms: 1.02x slower Faster (7): - regex_v8: 41.8 ms +- 1.7 ms -> 37.4 ms +- 0.4 ms: 1.12x faster - unpickle_pure_python: 475 us +- 34 us -> 452 us +- 9 us: 1.05x faster - scimark_lu: 166 ms +- 4 ms -> 160 ms +- 1 ms: 1.04x faster - pickle_pure_python: 677 us +- 11 us -> 653 us +- 7 us: 1.04x faster - scimark_sor: 213 ms +- 4 ms -> 207 ms +- 3 ms: 1.03x faster - chameleon: 18.2 ms +- 0.2 ms -> 17.7 ms +- 0.3 ms: 1.03x faster - scimark_sparse_mat_mult: 7.73 ms +- 0.23 ms -> 7.56 ms +- 0.25 ms: 1.02x faster Benchmark hidden because not significant (31): 2to3, chaos, fannkuch, float, go, json_dumps, json_loads, logging_format, logging_silent, nbody, pickle, pickle_d ict, pickle_list, pidigits, python_startup, raytrace, regex_dna, regex_effbot, r ichards, scimark_fft, scimark_monte_carlo, sqlalchemy_declarative, sqlite_synth, sympy_expand, sympy_integrate, sympy_str, telco, tornado_http, unpack_sequence, unpickle, unpickle_list Geometric mean: 1.01x slower ``` ``` >pyperf compare_to original86 orig86_no_assume -G --min-speed=2 Slower (3): - nbody: 196 ms +- 5 ms -> 216 ms +- 3 ms: 1.11x slower - xml_etree_iterparse: 178 ms +- 3 ms -> 184 ms +- 5 ms: 1.03x slower - nqueens: 194 ms +- 2 ms -> 198 ms +- 3 ms: 1.02x slower Faster (21): - tornado_http: 435 ms +- 64 ms -> 382 ms +- 6 ms: 1.14x faster - 2to3: 601 ms +- 142 ms -> 539 ms +- 13 ms: 1.12x faster - pickle_pure_python: 689 us +- 6 us -> 627 us +- 8 us: 1.10x faster - sympy_sum: 326 ms +- 16 ms -> 306 ms +- 4 ms: 1.07x faster - sqlalchemy_imperative: 44.2 ms +- 2.5 ms -> 41.4 ms +- 1.4 ms: 1.07x faster - logging_silent: 200 ns +- 9 ns -> 188 ns +- 7 ns: 1.06x faster - hexiom: 13.2 ms +- 0.1 ms -> 12.5 ms +- 0.1 ms: 1.06x faster - xml_etree_process: 133 ms +- 1 ms -> 126 ms +- 1 ms: 1.06x faster - sympy_integrate: 39.0 ms +- 1.9 ms -> 37.4 ms +- 0.3 ms: 1.04x faster - chameleon: 20.7 ms +- 0.2 ms -> 20.0 ms +- 0.2 ms: 1.04x faster - unpickle_list: 8.40 us +- 0.12 us -> 8.11 us +- 0.39 us: 1.04x faster - meteor_contest: 165 ms +- 1 ms -> 159 ms +- 1 ms: 1.03x faster - xml_etree_generate: 183 ms +- 4 ms -> 177 ms +- 5 ms: 1.03x faster - sqlalchemy_declarative: 285 ms +- 15 ms -> 276 ms +- 6 ms: 1.03x faster - deltablue: 8.15 ms +- 0.13 ms -> 7.92 ms +- 0.16 ms: 1.03x faster - sympy_expand: 975 ms +- 19 ms -> 948 ms +- 10 ms: 1.03x faster - richards: 108 ms +- 3 ms -> 105 ms +- 2 ms: 1.03x faster - sympy_str: 589 ms +- 29 ms -> 575 ms +- 14 ms: 1.02x faster - go: 299 ms +- 2 ms -> 292 ms +- 2 ms: 1.02x faster - scimark_monte_carlo: 137 ms +- 2 ms -> 134 ms +- 3 ms: 1.02x faster - pickle_dict: 43.5 us +- 2.8 us -> 42.7 us +- 1.0 us: 1.02x faster Benchmark hidden because not significant (34): chaos, crypto_pyaes, django_templ ate, dulwich_log, fannkuch, float, json_dumps, json_loads, logging_format, loggi ng_simple, mako, pathlib, pickle, pickle_list, pidigits, pyflate, python_startup , python_startup_no_site, raytrace, regex_compile, regex_dna, regex_effbot, rege x_v8, scimark_fft, scimark_lu, scimark_sor, scimark_sparse_mat_mult, spectral_no rm, sqlite_synth, telco, unpack_sequence, unpickle, unpickle_pure_python, xml_et ree_parse Geometric mean: 1.02x faster ```

neonene commented 2 years ago

Relase vs PGO, leaving `__assume(0)` as it is:	f4b328e	x64	x86
Release (/Ob3)	1.00	1.00
PGO	1.02x faster	1.02x slower

``` >pyperf compare_to original64ob3 original64pgo -G --min-speed=2 Slower (21): - pickle_dict: 36.9 us +- 0.8 us -> 44.8 us +- 1.7 us: 1.21x slower - nbody: 179 ms +- 4 ms -> 209 ms +- 2 ms: 1.17x slower - hexiom: 11.6 ms +- 0.2 ms -> 13.3 ms +- 0.3 ms: 1.15x slower - logging_silent: 177 ns +- 4 ns -> 203 ns +- 2 ns: 1.14x slower - spectral_norm: 193 ms +- 4 ms -> 215 ms +- 2 ms: 1.12x slower - scimark_monte_carlo: 115 ms +- 2 ms -> 127 ms +- 2 ms: 1.11x slower - go: 263 ms +- 5 ms -> 289 ms +- 5 ms: 1.10x slower - regex_dna: 252 ms +- 2 ms -> 276 ms +- 3 ms: 1.10x slower - scimark_sor: 213 ms +- 4 ms -> 234 ms +- 2 ms: 1.10x slower - fannkuch: 673 ms +- 19 ms -> 733 ms +- 27 ms: 1.09x slower - pickle_list: 6.48 us +- 0.09 us -> 7.05 us +- 0.31 us: 1.09x slower - deltablue: 7.27 ms +- 0.10 ms -> 7.90 ms +- 0.14 ms: 1.09x slower - pyflate: 849 ms +- 9 ms -> 919 ms +- 16 ms: 1.08x slower - scimark_fft: 537 ms +- 9 ms -> 568 ms +- 11 ms: 1.06x slower - unpack_sequence: 87.4 ns +- 5.4 ns -> 92.1 ns +- 3.4 ns: 1.05x slower - chaos: 152 ms +- 1 ms -> 159 ms +- 3 ms: 1.05x slower - crypto_pyaes: 149 ms +- 3 ms -> 155 ms +- 6 ms: 1.04x slower - unpickle_list: 6.92 us +- 0.19 us -> 7.20 us +- 0.29 us: 1.04x slower - meteor_contest: 165 ms +- 4 ms -> 171 ms +- 3 ms: 1.04x slower - scimark_lu: 166 ms +- 4 ms -> 172 ms +- 3 ms: 1.03x slower - float: 156 ms +- 4 ms -> 161 ms +- 3 ms: 1.03x slower Faster (29): - telco: 16.9 ms +- 0.1 ms -> 11.3 ms +- 0.1 ms: 1.49x faster - json_loads: 49.6 us +- 0.8 us -> 40.1 us +- 0.6 us: 1.24x faster - unpickle: 27.6 us +- 0.5 us -> 22.7 us +- 0.4 us: 1.22x faster - xml_etree_parse: 304 ms +- 22 ms -> 255 ms +- 13 ms: 1.20x faster - pathlib: 206 ms +- 14 ms -> 178 ms +- 14 ms: 1.16x faster - unpickle_pure_python: 475 us +- 34 us -> 421 us +- 7 us: 1.13x faster - chameleon: 18.2 ms +- 0.2 ms -> 16.4 ms +- 0.4 ms: 1.11x faster - logging_format: 25.4 us +- 0.9 us -> 22.9 us +- 0.7 us: 1.11x faster - sqlite_synth: 6.06 us +- 0.07 us -> 5.54 us +- 0.05 us: 1.09x faster - xml_etree_iterparse: 203 ms +- 8 ms -> 187 ms +- 6 ms: 1.09x faster - json_dumps: 24.6 ms +- 0.3 ms -> 22.7 ms +- 0.7 ms: 1.08x faster - python_startup_no_site: 12.0 ms +- 0.4 ms -> 11.2 ms +- 0.4 ms: 1.07x faster - dulwich_log: 217 ms +- 9 ms -> 203 ms +- 7 ms: 1.07x faster - xml_etree_generate: 166 ms +- 4 ms -> 156 ms +- 5 ms: 1.07x faster - tornado_http: 445 ms +- 37 ms -> 417 ms +- 41 ms: 1.07x faster - sympy_sum: 343 ms +- 9 ms -> 322 ms +- 11 ms: 1.07x faster - python_startup: 14.7 ms +- 0.4 ms -> 13.8 ms +- 0.8 ms: 1.06x faster - sympy_str: 615 ms +- 21 ms -> 579 ms +- 24 ms: 1.06x faster - sympy_integrate: 40.2 ms +- 2.7 ms -> 38.0 ms +- 0.6 ms: 1.06x faster - pickle_pure_python: 677 us +- 11 us -> 642 us +- 11 us: 1.06x faster - django_template: 102 ms +- 1 ms -> 96.2 ms +- 1.7 ms: 1.06x faster - logging_simple: 22.8 us +- 0.6 us -> 21.7 us +- 0.6 us: 1.05x faster - sympy_expand: 1.01 sec +- 0.02 sec -> 963 ms +- 26 ms: 1.04x faster - scimark_sparse_mat_mult: 7.73 ms +- 0.23 ms -> 7.43 ms +- 0.17 ms: 1.04x faster - mako: 21.9 ms +- 0.2 ms -> 21.0 ms +- 0.3 ms: 1.04x faster - xml_etree_process: 120 ms +- 3 ms -> 116 ms +- 1 ms: 1.04x faster - pickle: 21.7 us +- 0.4 us -> 21.0 us +- 0.4 us: 1.03x faster - richards: 98.9 ms +- 1.5 ms -> 96.7 ms +- 1.4 ms: 1.02x faster - regex_v8: 41.8 ms +- 1.7 ms -> 40.9 ms +- 1.1 ms: 1.02x faster Benchmark hidden because not significant (8): 2to3, nqueens, pidigits, raytrace, regex_compile, regex_effbot, sqlalchemy_declarative, sqlalchemy_imperative Geometric mean: 1.02x faster ``` ``` >pyperf compare_to original86ob3 original86pgo -G --min-speed=2 Slower (27): - nbody: 196 ms +- 5 ms -> 278 ms +- 8 ms: 1.42x slower - scimark_lu: 188 ms +- 3 ms -> 246 ms +- 6 ms: 1.31x slower - scimark_sor: 238 ms +- 7 ms -> 310 ms +- 9 ms: 1.31x slower - logging_silent: 200 ns +- 9 ns -> 246 ns +- 9 ns: 1.23x slower - hexiom: 13.2 ms +- 0.1 ms -> 16.0 ms +- 0.4 ms: 1.21x slower - scimark_sparse_mat_mult: 9.01 ms +- 0.17 ms -> 10.8 ms +- 0.3 ms: 1.20x slower - scimark_monte_carlo: 137 ms +- 2 ms -> 163 ms +- 2 ms: 1.19x slower - unpack_sequence: 88.9 ns +- 1.1 ns -> 104 ns +- 4 ns: 1.17x slower - float: 157 ms +- 4 ms -> 183 ms +- 2 ms: 1.17x slower - scimark_fft: 676 ms +- 16 ms -> 783 ms +- 13 ms: 1.16x slower - spectral_norm: 246 ms +- 5 ms -> 282 ms +- 8 ms: 1.15x slower - meteor_contest: 165 ms +- 1 ms -> 187 ms +- 2 ms: 1.14x slower - go: 299 ms +- 2 ms -> 338 ms +- 9 ms: 1.13x slower - raytrace: 673 ms +- 7 ms -> 757 ms +- 5 ms: 1.12x slower - fannkuch: 886 ms +- 10 ms -> 995 ms +- 14 ms: 1.12x slower - chaos: 172 ms +- 1 ms -> 189 ms +- 3 ms: 1.09x slower - unpickle_pure_python: 483 us +- 8 us -> 527 us +- 9 us: 1.09x slower - deltablue: 8.15 ms +- 0.13 ms -> 8.86 ms +- 0.14 ms: 1.09x slower - nqueens: 194 ms +- 2 ms -> 210 ms +- 2 ms: 1.09x slower - regex_dna: 240 ms +- 3 ms -> 256 ms +- 3 ms: 1.07x slower - mako: 23.2 ms +- 0.4 ms -> 24.7 ms +- 0.5 ms: 1.06x slower - pyflate: 997 ms +- 11 ms -> 1.06 sec +- 0.02 sec: 1.06x slower - pickle_dict: 43.5 us +- 2.8 us -> 46.1 us +- 1.7 us: 1.06x slower - richards: 108 ms +- 3 ms -> 114 ms +- 2 ms: 1.05x slower - chameleon: 20.7 ms +- 0.2 ms -> 21.4 ms +- 0.2 ms: 1.03x slower - pickle_pure_python: 689 us +- 6 us -> 711 us +- 5 us: 1.03x slower - regex_compile: 278 ms +- 2 ms -> 286 ms +- 5 ms: 1.03x slower Faster (23): - python_startup: 16.2 ms +- 9.9 ms -> 13.0 ms +- 0.1 ms: 1.25x faster - unpickle_list: 8.40 us +- 0.12 us -> 7.12 us +- 0.20 us: 1.18x faster - telco: 15.2 ms +- 0.5 ms -> 13.0 ms +- 0.2 ms: 1.17x faster - unpickle: 28.0 us +- 0.4 us -> 24.0 us +- 0.3 us: 1.17x faster - pathlib: 208 ms +- 10 ms -> 181 ms +- 4 ms: 1.15x faster - tornado_http: 435 ms +- 64 ms -> 386 ms +- 54 ms: 1.13x faster - json_loads: 51.0 us +- 0.7 us -> 45.5 us +- 0.3 us: 1.12x faster - xml_etree_parse: 298 ms +- 3 ms -> 267 ms +- 3 ms: 1.12x faster - python_startup_no_site: 12.3 ms +- 1.4 ms -> 11.0 ms +- 0.2 ms: 1.11x faster - sqlite_synth: 6.21 us +- 0.09 us -> 5.66 us +- 0.02 us: 1.10x faster - sqlalchemy_imperative: 44.2 ms +- 2.5 ms -> 40.5 ms +- 2.1 ms: 1.09x faster - dulwich_log: 207 ms +- 3 ms -> 191 ms +- 15 ms: 1.09x faster - sympy_sum: 326 ms +- 16 ms -> 302 ms +- 6 ms: 1.08x faster - regex_v8: 38.9 ms +- 0.6 ms -> 37.2 ms +- 0.2 ms: 1.05x faster - sympy_expand: 975 ms +- 19 ms -> 933 ms +- 6 ms: 1.05x faster - pickle_list: 7.37 us +- 0.13 us -> 7.11 us +- 0.10 us: 1.04x faster - pidigits: 462 ms +- 4 ms -> 446 ms +- 1 ms: 1.04x faster - logging_simple: 22.8 us +- 0.4 us -> 22.0 us +- 0.7 us: 1.04x faster - logging_format: 24.9 us +- 0.8 us -> 24.0 us +- 1.1 us: 1.03x faster - xml_etree_generate: 183 ms +- 4 ms -> 177 ms +- 6 ms: 1.03x faster - xml_etree_process: 133 ms +- 1 ms -> 129 ms +- 1 ms: 1.03x faster - sympy_integrate: 39.0 ms +- 1.9 ms -> 37.9 ms +- 1.7 ms: 1.03x faster - sympy_str: 589 ms +- 29 ms -> 575 ms +- 30 ms: 1.02x faster Benchmark hidden because not significant (8): 2to3, crypto_pyaes, django_templat e, json_dumps, pickle, regex_effbot, sqlalchemy_declarative, xml_etree_iterparse Geometric mean: 1.02x slower ```

neonene commented 2 years ago

3.10.1 (x64) could get a 17% gap between release(/Ob3) and PGO.

``` >pyperf compare_to 3101x64ob3 3101x64pgo -G Slower (4): - pickle_dict: 36.8 us +- 0.1 us -> 44.8 us +- 0.2 us: 1.22x slower - regex_dna: 258 ms +- 2 ms -> 287 ms +- 9 ms: 1.11x slower - pickle_list: 6.46 us +- 0.06 us -> 6.91 us +- 0.06 us: 1.07x slower - unpickle_list: 7.18 us +- 0.13 us -> 7.40 us +- 0.16 us: 1.03x slower Faster (51): - telco: 17.2 ms +- 0.1 ms -> 11.5 ms +- 0.1 ms: 1.49x faster - crypto_pyaes: 224 ms +- 7 ms -> 163 ms +- 5 ms: 1.38x faster - chameleon: 23.6 ms +- 0.4 ms -> 18.1 ms +- 0.3 ms: 1.31x faster - scimark_monte_carlo: 193 ms +- 5 ms -> 148 ms +- 2 ms: 1.30x faster - raytrace: 998 ms +- 13 ms -> 766 ms +- 9 ms: 1.30x faster - django_template: 136 ms +- 8 ms -> 105 ms +- 5 ms: 1.30x faster - logging_silent: 339 ns +- 11 ns -> 262 ns +- 20 ns: 1.29x faster - pickle_pure_python: 980 us +- 39 us -> 764 us +- 8 us: 1.28x faster - logging_simple: 28.8 us +- 1.0 us -> 22.6 us +- 0.7 us: 1.28x faster - unpickle_pure_python: 562 us +- 11 us -> 442 us +- 18 us: 1.27x faster - python_startup: 21.1 ms +- 7.1 ms -> 16.6 ms +- 0.4 ms: 1.27x faster - logging_format: 31.5 us +- 1.4 us -> 24.9 us +- 0.9 us: 1.27x faster - spectral_norm: 284 ms +- 11 ms -> 226 ms +- 3 ms: 1.26x faster - deltablue: 14.1 ms +- 0.7 ms -> 11.2 ms +- 0.3 ms: 1.25x faster - 2to3: 778 ms +- 149 ms -> 621 ms +- 36 ms: 1.25x faster - go: 468 ms +- 17 ms -> 377 ms +- 5 ms: 1.24x faster - xml_etree_process: 161 ms +- 2 ms -> 129 ms +- 1 ms: 1.24x faster - scimark_sparse_mat_mult: 8.68 ms +- 0.46 ms -> 7.00 ms +- 0.27 ms: 1.24x faster - richards: 155 ms +- 2 ms -> 125 ms +- 2 ms: 1.24x faster - chaos: 222 ms +- 3 ms -> 179 ms +- 6 ms: 1.24x faster - nqueens: 220 ms +- 7 ms -> 178 ms +- 4 ms: 1.24x faster - json_loads: 51.0 us +- 0.8 us -> 41.3 us +- 0.5 us: 1.23x faster - hexiom: 17.6 ms +- 0.7 ms -> 14.3 ms +- 0.4 ms: 1.23x faster - unpickle: 29.3 us +- 0.3 us -> 23.9 us +- 0.4 us: 1.23x faster - pathlib: 222 ms +- 13 ms -> 182 ms +- 4 ms: 1.22x faster - nbody: 257 ms +- 4 ms -> 210 ms +- 2 ms: 1.22x faster - pyflate: 1.28 sec +- 0.01 sec -> 1.05 sec +- 0.02 sec: 1.22x faster - xml_etree_generate: 193 ms +- 6 ms -> 160 ms +- 4 ms: 1.21x faster - scimark_lu: 270 ms +- 4 ms -> 225 ms +- 4 ms: 1.20x faster - json_dumps: 27.1 ms +- 0.8 ms -> 22.7 ms +- 0.5 ms: 1.19x faster - regex_compile: 341 ms +- 3 ms -> 287 ms +- 3 ms: 1.19x faster - sympy_expand: 1.18 sec +- 0.07 sec -> 1.00 sec +- 0.03 sec: 1.18x faster - scimark_fft: 647 ms +- 11 ms -> 551 ms +- 8 ms: 1.17x faster - mako: 29.9 ms +- 0.6 ms -> 25.5 ms +- 0.8 ms: 1.17x faster - scimark_sor: 334 ms +- 8 ms -> 286 ms +- 5 ms: 1.17x faster - float: 213 ms +- 2 ms -> 182 ms +- 3 ms: 1.17x faster - sqlite_synth: 7.16 us +- 0.05 us -> 6.21 us +- 0.04 us: 1.15x faster - xml_etree_parse: 289 ms +- 6 ms -> 251 ms +- 5 ms: 1.15x faster - meteor_contest: 171 ms +- 1 ms -> 153 ms +- 1 ms: 1.12x faster - sqlalchemy_declarative: 355 ms +- 8 ms -> 317 ms +- 16 ms: 1.12x faster - dulwich_log: 270 ms +- 13 ms -> 242 ms +- 20 ms: 1.12x faster - tornado_http: 543 ms +- 15 ms -> 486 ms +- 14 ms: 1.12x faster - xml_etree_iterparse: 211 ms +- 4 ms -> 190 ms +- 3 ms: 1.11x faster - sqlalchemy_imperative: 62.9 ms +- 1.2 ms -> 56.5 ms +- 1.4 ms: 1.11x faster - sympy_sum: 420 ms +- 24 ms -> 378 ms +- 42 ms: 1.11x faster - fannkuch: 784 ms +- 10 ms -> 707 ms +- 14 ms: 1.11x faster - sympy_str: 721 ms +- 40 ms -> 659 ms +- 69 ms: 1.09x faster - python_startup_no_site: 13.5 ms +- 0.8 ms -> 12.6 ms +- 0.6 ms: 1.07x faster - regex_v8: 39.2 ms +- 0.2 ms -> 37.0 ms +- 1.8 ms: 1.06x faster - pickle: 22.5 us +- 0.5 us -> 21.2 us +- 0.3 us: 1.06x faster - sympy_integrate: 48.1 ms +- 0.9 ms -> 46.1 ms +- 5.3 ms: 1.04x faster Benchmark hidden because not significant (3): pidigits, regex_effbot, unpack_sequence Geometric mean: 1.17x faster ```

gvanrossum commented 2 years ago

I've figured out how to get assembly code from after the PGO/LTO passes run, but the output is hard to understand -- there are almost no symbols in the resulting file and absolutely no line numbers. I have to do this for python311.dll and the output is 630,000 lines. The function _PyEval_EvalFrameDefault alone is over 8400 lines.

Anyway, the key thing I learned here is

There are extra tools like DUMPBIN that you can only get if your PATH is set properly, and the easiest way to do that is "windows key" + "developer" and then click on "command prompt" or "powershell" for vs 2019. This brings up a terminal window where PATH is set properly.
In that window, from the root of the cpython repo, navigate to PCbuild\amd64 and then invoke something like this (this is what produces that huge file):
```
dumpbin /disasm:nobytes python311.dll >disasm.txt
```
In the file, search for a function followed by a colon, e.g. _PyEval_EvalFrameDefault:

I was hoping that looking at that disassembly would answer questions like "did the switch get compiled to the optimal form" but I can barely even find where the switch starts. :-(

I found one trick that seems to help:

Open the project in Visual Studio, select the "PGUpdate" configuration (and x64), and hit F5.
When this brings up a Python prompt, set a breakpoint at the top of the switch.
Type a small Python function and run it.
Once the debugger stops at the breakpoint, right-click and select "Go to disassembly" (this brings up a little dialog with some truncated choices, I just hit OK). Now the disassembly is shown interspersed with the source code. (Yay Visual Studio!) You can scroll up or down, but it disappears when you stop debugging.

Using search on the raw disassembly I was able to confirm that the code there looked the same. I now believe that the top of the switch compiles to this code (on my machine, with VS 2019, in PGO+LTO mode):

        switch (opcode) {
00007FFE120A0C53  je          _PyEval_EvalFrameDefault+1CC9h (07FFE120A2659h)  
00007FFE120A0C59  mov         rax,qword ptr [rsp+78h]  
00007FFE120A0C5E  mov         eax,dword ptr [rax]  
00007FFE120A0C60  test        eax,eax  
00007FFE120A0C62  jne         _PyEval_EvalFrameDefault+51Eh (07FFE120A0EAEh)  
00007FFE120A0C68  movzx       ecx,word ptr [r14]  
00007FFE120A0C6C  mov         esi,ecx  
00007FFE120A0C6E  movzx       edi,cl  
00007FFE120A0C71  shr         esi,8  
00007FFE120A0C74  or          edi,dword ptr [cframe]  
00007FFE120A0C78  mov         qword ptr [rsp+30h],rsi  
00007FFE120A0C7D  mov         qword ptr [rbp-50h],r15  
00007FFE120A0C81  mov         r10,r14  
00007FFE120A0C84  mov         r13,r15  
00007FFE120A0C87  mov         r9,r14  
00007FFE120A0C8A  mov         r12,r15  
00007FFE120A0C8D  mov         rdx,r14  
00007FFE120A0C90  mov         r8,r14  
00007FFE120A0C93  mov         r11,r15  
00007FFE120A0C96  cmp         edi,0FFh  
00007FFE120A0C9C  ja          _guard_xfg_dispatch_icall_nop+2DCEh (07FFE121F1B0Eh)  
00007FFE120A0CA2  lea         rcx,[`__local_stdio_scanf_options'::`2'::_OptionsStorage (07FFE12060000h)]  
00007FFE120A0CA9  movsxd      rax,edi  
00007FFE120A0CAC  movsxd      rsi,esi  
00007FFE120A0CAF  movzx       eax,byte ptr [rcx+rax+490E0h]  
00007FFE120A0CB7  mov         ecx,dword ptr [rcx+rax*4+48E1Ch]  
00007FFE120A0CBE  lea         rax,[`__local_stdio_scanf_options'::`2'::_OptionsStorage (07FFE12060000h)]  
00007FFE120A0CC5  add         rcx,rax  
00007FFE120A0CC8  jmp         rcx  
00007FFE120A0CCA  mov         r13,qword ptr [rsp+48h]  
00007FFE120A0CCF  mov         rax,r14  
00007FFE120A0CD2  sub         rax,qword ptr [rsp+40h]  
00007FFE120A0CD7  add         r14,2  
00007FFE120A0CDB  sar         rax,1  
00007FFE120A0CDE  mov         dword ptr [r13+38h],eax  
00007FFE120A0CE2  mov         rax,qword ptr [r13+rsi*8+48h]  
00007FFE120A0CE7  test        rax,rax  
00007FFE120A0CEA  je          _guard_xfg_dispatch_icall_nop+1AF7h (07FFE121F0837h)  
00007FFE120A0CF0  inc         qword ptr [rax]  
00007FFE120A0CF3  mov         qword ptr [r15],rax  
00007FFE120A0CF6  add         r15,8  
00007FFE120A0CFA  mov         qword ptr [rsp+50h],r15  
00007FFE120A0CFF  movzx       eax,word ptr [r14]  
00007FFE120A0D03  movzx       edi,al  
00007FFE120A0D06  mov         esi,eax  
00007FFE120A0D08  jmp         _PyEval_EvalFrameDefault+2E1h (07FFE120A0C71h)  
                goto handle_eval_breaker;

(Which is weird, because that last line is not directly following the switch.)

Unfortunately, I don't know x86/x64 assembly, so I have no idea what this means.

gvanrossum commented 2 years ago

Okay, the Disassembly window works a bit differently than I had understood before. It does not necessarily display all source lines, and/or it doesn't display them in order. I guess that's why it has to be so deeply integrated with the debugger. A more illustrative snippet:

        TARGET(LOAD_CONST) {
00007FFE120A0D0D  mov         rax,qword ptr [rsp+30h]  
00007FFE120A0D12  mov         rsi,r14  
            PREDICTED(LOAD_CONST);
            PyObject *value = GETITEM(consts, oparg);
00007FFE120A0D15  sub         rsi,qword ptr [rsp+40h]  
00007FFE120A0D1A  mov         r13,qword ptr [rsp+48h]  
00007FFE120A0D1F  sar         rsi,1  
00007FFE120A0D22  mov         rcx,qword ptr [rbp-38h]  
00007FFE120A0D26  add         r14,2  
00007FFE120A0D2A  mov         dword ptr [r13+38h],esi  
00007FFE120A0D2E  cdqe  
00007FFE120A0D30  mov         rcx,qword ptr [rcx+rax*8+18h]  
            Py_INCREF(value);
00007FFE120A0D35  inc         qword ptr [rcx]  
            PUSH(value);
00007FFE120A0D38  mov         qword ptr [r15],rcx  
00007FFE120A0D3B  add         r15,8  
00007FFE120A0D3F  mov         qword ptr [rsp+50h],r15  
            DISPATCH();
00007FFE120A0D44  jmp         _PyEval_EvalFrameDefault+2D8h (07FFE120A0C68h)  
            DISPATCH();
        }

That corresponds to this source code:

        TARGET(LOAD_CONST) {
            PREDICTED(LOAD_CONST);
            PyObject *value = GETITEM(consts, oparg);
            Py_INCREF(value);
            PUSH(value);
            DISPATCH();
        }

(I'm not sure why the DISPATCH() call is shown twice, probably a small bug.)

Tracing through the DISPATCH() call one instruction at a time, I land in the big block from the previous comment. I wonder if that macro just does too much and all that has been consolidated in that big block.

Also note that I did all this in the PR branch where I am experimenting with turning INCREF/DECREF back into macros, in ceval.c only.

gvanrossum commented 2 years ago

@neonene I am not ignoring you, far from it! I would like to dig into the Py_UNREACHABLE() situation but the issue is, as you have said yourself, that the benchmarks are very noisy.

There are only two calls to it in ceval.c. How can we prove that the optimizer really does get distracted by those? Note that the two are not similar:

One call is here:

        TARGET(CACHE) {
            Py_UNREACHABLE();
        }

which is only unreachable because the bytecode compiler never generates this opcode -- but the optimizer cannot know that (though PGO can deduce it, since it's never executed).

The other is here:

            goto error;

        } /* End instructions */

        /* This should never be reached. Every opcode should end with DISPATCH()
           or goto error. */
        Py_UNREACHABLE();

Here the optimizer should know the code is really unreachable, if all cases indeed do end with DISPATCH() or goto.

Can we use this knowledge somehow? Maybe manually put calls to Py_FatalError() intead?

PS. Can you link us to your branch?

neonene commented 2 years ago

Could anyone run benchmarks on the same version (timing) without code change, putting my posts aside? My English reading/writing is too slow to reply about __assume(0) at this time.

Comparing Relese with PGO should be easier than 3.10PGO vs 3.11PGO. Using /Ob3 would be better to keep consistent the inlining and the performance on Release builds, which should be already far faster than 3.10PGO.

--- PCbuild/pythoncore.vcxproj
+++ PCbuild/pythoncore.vcxproj
@@ -102 +102 @@
-      <AdditionalOptions>/Zm200  %(AdditionalOptions)</AdditionalOptions>
+      <AdditionalOptions>/Zm200 /Ob3 %(AdditionalOptions)</AdditionalOptions>

neonene commented 2 years ago

on the same version

Any commit is fine. I don't mean the same version as mine.

gvanrossum commented 2 years ago

@neonene I still don't have the capacity to run benchmarks without a lot of noise.

The rest here is lab notes:

I learned a bit of x64 assembler and I think I have a better grip on how switch (opcode) is translated. Here's the blob of code:

00007FFE0DA50CA2  lea         rcx,[`__local_stdio_scanf_options'::`2'::_OptionsStorage (07FFE0DA10000h)]  
00007FFE0DA50CA9  movsxd      rax,edi  
00007FFE0DA50CAC  movsxd      rsi,esi  
00007FFE0DA50CAF  movzx       eax,byte ptr [rcx+rax+490E0h]  
00007FFE0DA50CB7  mov         ecx,dword ptr [rcx+rax*4+48E1Ch]  
00007FFE0DA50CBE  lea         rax,[`__local_stdio_scanf_options'::`2'::_OptionsStorage (07FFE0DA10000h)]  
00007FFE0DA50CC5  add         rcx,rax  
00007FFE0DA50CC8  jmp         rcx

Let me narrate that. (Register names like EAX are aliases for the lower 32 bits of RAX -- the latter is a 64-bit register.)

At the start, register EDI contains the opcode and ESI has the oparg (the result of the NEXTOPARG() macro).

Load the address of some block of memory into RCX (I suspect this is just the start address of the DLL in virtual memory, or at least a known anchor point in the DLL). We'll use it a few instructions down:
```
00007FFE0DA50CA2  lea         rcx,[`__local_stdio_scanf_options'::`2'::_OptionsStorage (07FFE0DA10000h)]  
```
Sign-extend opcode from EDI into RAX:
```
00007FFE0DA50CA9  movsxd      rax,edi  
```
Sign-extend oparg (RSI, in-place from ESI):
```
00007FFE0DA50CAC  movsxd      rsi,esi  
```
There's a table of bytes at RCX + 490E0(hex); index with RAX, load into RAX/EAX with zero extension:
```
00007FFE0DA50CAF  movzx       eax,byte ptr [rcx+rax+490E0h]  
```
There's another table, of 32-bit ints ("dwords"), at RCS + 48E1C(hex); index with RAX (times 4 to get 4 bytes per entry), load into ECX:
```
00007FFE0DA50CB7  mov         ecx,dword ptr [rcx+rax*4+48E1Ch]  
```

Load the anchor point into RAX:

00007FFE0DA50CBE  lea         rax,[`__local_stdio_scanf_options'::`2'::_OptionsStorage (07FFE0DA10000h)]

Add that block to RCX giving the new value of RCX:
```
00007FFE0DA50CC5  add         rcx,rax  
```
Jump to RCX:
```
00007FFE0DA50CC8  jmp         rcx  
```
So we have two tables, call them uchar blob1[] and uint32 blob2[]. We do effectively this:
```
goto anchor + blob2[blob1[opcode]];
```
The critical point here is that we have two memory loads, where the second one is dependent on the first one:
```
00007FFE0DA50CAF  movzx       eax,byte ptr [rcx+rax+490E0h]  
00007FFE0DA50CB7  mov         ecx,dword ptr [rcx+rax*4+48E1Ch]  
```
So this is very far from the "computed goto" we have with GCC and Clang, not is it anywhere near the "use opcode as jump index" we were hoping for (that would be using blob2[opcode] rather than blob2[blob1[opcode]]).

Given enough time and monkeys I can now answer any question about optimization. :-)

Another observation relates to DISPATCH(): it seems that whenever the optimizer sees this macro it just jumps to a location a bit above the code I just narrated. That's probably fine.

Next I'm going to see what things look like on the main branch (so far I used my "inline" branch from https://github.com/python/cpython/pull/32387).

gvanrossum commented 2 years ago

With a Release build from main, the switch(op) code looks pretty much the same.

With a Debug build a bit of clarity is obtained, but not much -- now we see this:

00007FFE0D8CFC21  lea         rcx,[__ImageBase (07FFE0D2D0000h)]  
00007FFE0D8CFC28  movzx       eax,byte ptr [rcx+rax+61BFCCh]  
00007FFE0D8CFC30  mov         eax,dword ptr [rcx+rax*4+61BD04h]  
00007FFE0D8CFC37  add         rax,rcx  
00007FFE0D8CFC3A  jmp         rax

This confirms my hunch that the lea instruction loads a base offset for some big block (now named __ImageBase -- the LTO seems to just have moved the base around so the offsets in the mov instructions fit in fewer bits), but the rest is the same -- we're still effectively doing goto imagebase + blob2[blob1[opcode]].

neonene commented 2 years ago

Thanks for digging into it! I'll also look into further with my debugger.

markshannon commented 2 years ago

So the dispatch is effectively (using gcc syntax): goto *(base + offset_table[compact_table[opcode]])? Where compact_table maps all opcodes to themselves and all other values to the default?

If we were to drop the default: and generate cases for all undefined opcodes:

case X:
     oparg = X; /* X needs to be a constant, not `opcode` so that the cases cannot be trivially merged */
     goto unknown_opcode;

then would the compiler drop the compact_table giving us goto *(base + offset_table[opcode])?

@gvanrossum Could you try this out?

I wonder whether goto *(base + offset_table[opcode]) is faster or slower than goto *address_table[opcode]? The offset_table is half the size, reducing cache pressure, but it needs base to be loaded.

If it turns that goto *(base + offset_table[opcode]) is faster, then we can use that for non-Windows builds.

markshannon commented 2 years ago

One thing that is not clear. Is the computed jmp performed at the end of each instruction, or just at one location?

brandtbucher commented 2 years ago

One thing that is not clear. Is the computed jmp performed at the end of each instruction, or just at one location?

See Guido's comment above:

Another observation relates to DISPATCH(): it seems that whenever the optimizer sees this macro it just jumps to a location a bit above the code I just narrated. That's probably fine.

It looks like it follows the C code pretty literally, jumping up to dispatch_opcode.

brandtbucher commented 2 years ago

Half-baked idea: see how the different compilers handle a hack-y form of computed goto, where a switch over the opcode labels is stuffed into the DISPATCH() macro.

#define DISPATCH_GOTO() \
    switch (opcode) { \
        case NOP: goto TARGET_NOP; \
        case RESUME: goto TARGET_RESUME; \
        ...
    }

I'm not sure I like it, though, since small changes (in the code or the compiler) could easily result in code size explosion.

gvanrossum commented 2 years ago

I added dummy cases for all non-existent opcodes, but sadly this did not convince the compiler to skip the indirection table. The switch code looks the same as before, the only thing different is that there are a few other instructions between the first and the second memory load:

00007FF92E838B09  lea         ecx,[`__local_stdio_scanf_options'::`2'::_OptionsStorage (07FF92E820000h)]  
00007FF92E838B0F  movsxd      rax,edi  
00007FF92E838B12  movsxd      r14,r15d  
00007FF92E838B15  mov         r11,r12  
00007FF92E838B18  mov         qword ptr [rbp-68h],r12  
00007FF92E838B1C  mov         r9,r13  
00007FF92E838B1F  mov         qword ptr [rbp-60h],r12  
00007FF92E838B23  mov         rdx,r13  
00007FF92E838B26  movzx       eax,byte ptr [rcx+rax+1F778h]  ; *** FIRST LOAD ***
00007FF92E838B2E  mov         r8,r13  
00007FF92E838B31  mov         qword ptr [rsp+50h],r12  
00007FF92E838B36  mov         rsi,r12  
00007FF92E838B39  mov         qword ptr [rbp-50h],r14  
00007FF92E838B3D  mov         ecx,dword ptr [rcx+rax*4+1F4B4h]  ; *** SECOND LOAD ***
00007FF92E838B44  lea         rax,[`__local_stdio_scanf_options'::`2'::_OptionsStorage (07FF92E820000h)]  
00007FF92E838B4B  add         rcx,rax  
00007FF92E838B4E  jmp         rcx

I don't think this is worth the hassle.

gvanrossum commented 2 years ago

Half-baked idea: see how the different compilers handle a hack-y form of computed goto, where a switch over the opcode labels is stuffed into the DISPATCH() macro.

This worries me because we'd have 150+ copies of that switch.

At this point I don't think it's very useful to try and argue with either the compiler, PGO, LTO, or the CPU -- the one area where I think we have some leverage is some common macros that fail to be inlined (IS_TYPE, DECREF, and load_atomic_relaxed[int32]). For that I think we need https://github.com/python/cpython/pull/32387.

gvanrossum commented 2 years ago

@markshannon

(So I tried out the dummy cases but it didn't get rid of the offset_table, it just moved the memory access up a bit. See previous comment.)

I wonder whether goto *(base + offset_table[opcode]) is faster or slower than goto *address_table[opcode]? The offset_table is half the size, reducing cache pressure, but it needs base to be loaded.

Here base is actually a constant, it uses a lea instruction (Load Effective Address) to get it into a register, like this (though modified by PGO+LTO):

00007FFE0D8CFC21  lea         rcx,[__ImageBase (07FFE0D2D0000h)]

That's just a constant load, and while I suspect (since this is in a DLL) it's actually IP-relative, the CPU probably has a dedicated adder to compute that. So my guess is that the compiler writers know what they are doing, using 32-bit offsets. But they still want to avoid dummy entries there, hence the offset_table.

Hmm, maybe we can avoid that by making all the cases actually do something different.

gvanrossum commented 2 years ago

Hmm, maybe we can avoid that by making all the cases actually do something different.

Nope, it still uses the offset_table. I'm giving up on this. Sorry to get your hopes up during our meeting earlier today.

markshannon commented 2 years ago

I'm giving up on this.

Could you publish the branches with your code and the resulting assembler output, so that others can pick up where you left off?

gvanrossum commented 2 years ago

I will try to create branches later this week.

gvanrossum commented 2 years ago

Looking at this again I realized that opcode is declared as int so maybe that's why the switch doesn't optimize as expected. I'll look into this some more, starting by changing the switch line to this:

switch ((unsigned char)opcode) {

gvanrossum commented 2 years ago

Well I'll be darned. Now the code effectively boils down to this, where initially rcx is the opcode:

00007FF8FF9A7854  lea         rax,[`__local_stdio_scanf_options'::`2'::_OptionsStorage (07FF8FF970000h)]  
00007FF8FF9A7871  mov         eax,dword ptr [rax+rcx*4+3F9D4h]  
00007FF8FF9A787B  lea         rcx,[`__local_stdio_scanf_options'::`2'::_OptionsStorage (07FF8FF970000h)]  
00007FF8FF9A7886  add         rax,rcx  
00007FF8FF9A788F  jmp         rax

(I left some instructions out that are interleaved but don't touch any of these and seem to be related to flushing registers to the stack or moving them around.)

Crucially, there's nothing like this to be found, which I found in last week's investigations:

movzx       eax,byte ptr [rcx+rax+1F778h]

(which was the "compact_table" from above).

Thus the new code is now equivalent to

goto *(base + offset_table[opcode]);

which is one memory load less than before (where opcode was an int).

Let me see if we actually need the dummy cases.

gvanrossum commented 2 years ago

Yeah, with just a default: label, the extra memory load is back:

00007FF8F031789A  movzx       eax,byte ptr [rcx+rax+3DD70h]  
00007FF8F03178A2  mov         ecx,dword ptr [rcx+rax*4+3DA9Ch]  
00007FF8F03178A9  lea         rax,[`__local_stdio_scanf_options'::`2'::_OptionsStorage (07FF8F02E0000h)]  
00007FF8F03178B0  add         rcx,rax  
00007FF8F03178B3  jmp         rcx

Next I'll investigate whether the dummy cases need to contain any code.

gvanrossum commented 2 years ago

Fortunately, the dummy cases don't need to do anything. I will prepare a separate PR for this.

gvanrossum commented 2 years ago

Here's the PR. (Code is included in comments above.) https://github.com/python/cpython/pull/91718

neonene commented 2 years ago

This approach makes PREDICT() macro unnecessary for me.

neonene commented 2 years ago

3.10.x PGO does not need a fix around switch(opcode). We can just compare 3.11's performance with current 3.10.

gvanrossum commented 2 years ago

I've done the best measurements I can on my corporate laptop, and I conclude that on Windows 3.11 is now ~20% faster than 3.10. This is main at revision 2f233fceae.

Things I did to stabilize the benchmarks:

Built Python 3.10.4 (that exact tag) from source with PGO, so there's no doubt about differences due to different build environments.
Used PyPerformance 1.0.5, which has the necessary tweaks to run all benchmarks on Windows.
Fixed the CPU speed at 85% of max using powercfg.exe.
Put the Python installation on a separate partition (the OS partition is kind of noisy).
Closed all application windows except a terminal window where I run the commands, and minimize that.
Used AutoClicker.exe set to click once per 5 minutes to avoid the machine going to sleep (it's a corporate machine and the profile doesn't allow me to disable the screensaver).
Tried to set airplane mode once the first benchmark starts. (Alas, the VM setup code fails without network access -- possibly only for those deps that use a GitHub URL.) For the second 43.11 run I wasn't around to do this, but it was at most 1% slower.
Ran the benchmarks for each version compared twice. These results came within 1% of each other.

"I close the issue."

gvanrossum commented 2 years ago

PS. I now also benchmarked 3.11 just before my first Windows perf commit (i.e., just before GH-91719). That one was only 7% faster than 3.10.4, and current head is 13% faster than that. So one or both of these did have a big effect!

gvanrossum commented 2 years ago

I ran one more set of benchmarks, and conclude that the switch improvement (GH-91719) contributed 3% and the macro-fied inline functions (GH-89279) contributed the remaining 10%.

This is assuming none of the intervening commits contributed performance. The list of commits including the endpoints:

(benchmark here: 20% faster than 3.10.4)
2f233fceae gh-89279: In ceval.c, redefine some macros for speed (#32387)
9fe82d0b70 gh-91764: Mark version typing.Unpack and LiteralString were added in (#91817)
1218061609 gh-89653: PEP 670: Amend docs (GH-91813)
9e146bbb7e gh-80527: Deprecate PEP 623 Unicode functions (#91801)
82ec638ab7 Docs: Clarify availability of PyOS_CheckStack (GH-91816)
bcf14ae433 gh-91291: Accept attributes as keyword arguments in decimal.localcontext (#32242)
5e130a8da4 bpo-43224: Implement pickling of TypeVarTuples (#32119)
2551a6c92f gh-91217: deprecate-pipes (GH-91779)
29afb7d2ef gh-69093: Add indexing and slicing support to sqlite3.Blob (#91599)
1317b70f89 gh-91156: Use locale.getencoding() instead of getpreferredencoding (GH-91732)
efe7fd4170 gh-89653: Add assertions to unicodeobject.h functions (#91800)
636ad7b47e gh-89653: PEP 670: Convert unicodeobject.h macros to functions (#91799)
364ed94092 gh-89373: _Py_Dealloc() checks tp_dealloc exception (#32357)
8a4e519e78 gh-79315: Add Include/cpython/pythread.h header (#91798)
6f9addb5ba gh-79315: Add Include/cpython/modsupport.h header (#91797)
c1474fa6c6 gh-89653: PEP 670: Group deprecated API in unicodeobject.h (#91796)
eaa85cb22f gh-91768: C API no longer use "const PyObject*" type (#91769)
128d624117 gh-89653: PEP 670: Convert unicodeobject.h macros to functions (#91773)
(benchmark here: 10% faster than 3.10.4)
f8dc6186d1 GH-91719: Make MSVC generate somewhat faster switch code (#91718)
(benchmark here: 7% faster than 3.10.4)

faster-cpython / ideas

Measure Windows performance (and improve if lacking) #321