Py311 is 10-20% slower in loops of empty and variant assignments than Py38, Py39 and Py310

wangyi041228 commented 2 years ago

EDITED: Py311 is 10-20% slower in big loops of empty or global variant assignments than Py38, Py39 and Py310, measuring with time or timeit. Even though is's faster in other functions and libs.

gvanrossum commented 2 years ago

Can you show the code you used to measure this?

wangyi041228 commented 2 years ago

env: Win10 cmd

video(just upload): https://www.youtube.com/watch?v=zhdF9eyi10M

result: Python 3.11.0.b3 empty 216 193 216 204 a=i 388 357 431 391 sin(i) 1115 1050 1093 1066

Python 3.10.5 empty 168 176 177 178 171 a=i 290 309 308 315 292 sin(i) 1100 1081 1069 1073 1095

Python 3.9.13 Python 3.8.10 (just like Py310)

code: import time import math

t0 = time.perf_counter_ns() for i in range(10_000_000): ... t1 = time.perf_counter_ns() print(t1 - t0)

t0 = time.perf_counter_ns() for i in range(10_000_000): pass t1 = time.perf_counter_ns() print(t1 - t0)

t0 = time.perf_counter_ns() for i in range(10_000_000): a = i t1 = time.perf_counter_ns() print(t1 - t0)

t0 = time.perf_counter_ns() for i in range(10_000_000): a = math.sin(i) t1 = time.perf_counter_ns() print(t1 - t0)

gvanrossum commented 2 years ago

Can you try again with the whole thing inside a function (which you call just once)? Your current code writes globals (i and a) and that may be slowing things down.

wangyi041228 commented 2 years ago

They're all faster, but Python311 is still 10-20% slower than Python310 for empty loops.

For empty loops: Python311 0.096s Python310 0.089s

For a=i: Python311 0.123s Python310 0.122-0.132s

For sin: Python311 0.70s Python310 0.850s

methane commented 2 years ago

please use pyperformance or similar tool to call function hundreds of times. Many optimizations will not work with functions called only once.

notatallshaw commented 2 years ago

I think it would still be good to know if Python 3.11 has performance regressions for code that doesn't call the same function multiple times (such as in simple scripts) so I tried creating OPs test cases in a way that others can hopefully more easily reproduce.

Test environment: AMD Ryzen 9 5900X 2 x 32 GBs DDR4 clocked at 3200 MT/s Windows 10 Version 21H2 OS Build 19044.1766 Python 3.10.5, Python 3.11.0b3 (both clean installs)

I have not set up anything special in terms of performance control, I am just running the tests several times to see if there is a consistent and big enough difference to make any conclusions (though my PC is well ventilated so there is no thermal throttling).

Test execution format:

py -{python_version} -m timeit [-s "{setup_code}"] "{run_code}"

Test 1 Run code: for i in range(10_000_000): pass High level result: Python 3.11 is consistently 5-10% slower

Test 2 Run code: for i in range(10_000_000): a=i High level result: After several dozen runs no significant statistical difference

Test 3 Setup code: import math Run code: for i in range(10_000_000): a=math.sin(i) High level result: Python 3.11 is consistently 10-20% faster

Not sure this information helps anyone, but there it is.

gvanrossum commented 2 years ago

The math.sin() test is pretty uninteresting, but the ‘pass’ case deserves more investigation. @sweeneyde ?

sweeneyde commented 2 years ago

It always seems to be very hard to get stable Windows benchmarks. I got these results:

## Python 3.10.4 ##
py -3.10 -m pyperf timeit -s "global i" "for i in range(10_000_000): pass" --rigorous

(I did that 5 times)
Mean +- std dev: 315 ms +- 7 ms
Mean +- std dev: 320 ms +- 10 ms
Mean +- std dev: 317 ms +- 10 ms
Mean +- std dev: 321 ms +- 10 ms
Mean +- std dev: 320 ms +- 8 ms

## Python 3.11.0b3 ##
py -3.11 -m pyperf timeit -s "global i" "for i in range(10_000_000): pass" --rigorous

(I did that 5 times)
Mean +- std dev: 322 ms +- 12 ms
Mean +- std dev: 320 ms +- 8 ms
Mean +- std dev: 322 ms +- 7 ms
Mean +- std dev: 323 ms +- 6 ms
Mean +- std dev: 316 ms +- 14 ms

wangyi041228 commented 2 years ago

AMD Ryzen 9 5900HX with Radeon Graphics 3.30 GHz Windows 10 Pro 21H2 19044.1766

I got "WARNING: unable to increase process priority" while measuring with py311.

## Python 3.10.5 ##
py -3.10 -m pyperf timeit -s "global i" "for i in range(10_000_000): pass" --rigorous

Mean +- std dev: 195 ms +- 9 ms
Mean +- std dev: 193 ms +- 6 ms

py -3.11 -m pyperf timeit -s "global i" "for i in range(10_000_000): pass" --rigorous
WARNING: unable to increase process priority
Mean +- std dev: 216 ms +- 9 ms
Mean +- std dev: 214 ms +- 6 ms

gvanrossum commented 2 years ago

Does installing psutil for 3.11 make a difference?

gvanrossum commented 2 years ago

I can repro the results from @wangyi041228 on my Windows laptop (although I wasn't very rigorous). I'm not sure why we see such different results where @sweeneyde saw basically no difference (at least, the difference of the means was a fraction of the std dev) -- it'd be surprising if the background noise always affected 3.11 more than 3.10.

We talked about this in the Faster CPython perf team and did not come up with any good explanation (although we expect the issue doesn't affect real code, since there's always some code in the loop). (@markshannon @brandtbucher)

I'm sure my Python 3.10.5 and 3.11.0b3 binaries all came from python.org, and I have to assume they were built using the same toolchain.

I reproduced (at least some of) the difference on my Mac as well, so I don't think it's the MSVC compiler.

I looked at the disassembly, and the heart of the loop is equivalent in 3.10 and 3.11:

3.10:

        >>    8 FOR_ITER                 2 (to 14)
             10 STORE_GLOBAL             1 (i)
             12 JUMP_ABSOLUTE            4 (to 8)

3.11:

        >>   32 FOR_ITER                 2 (to 38)
             34 STORE_GLOBAL             1 (i)
             36 JUMP_BACKWARD            3 (to 32)

Interestingly, getting rid of global i did not change things. So it's also not something in the global dict that slows down 3.11 here.

Here's a thought. Could it be the _PyCode_Warmup() call in JUMP_BACKWARD? That's an inline function, but still, it could be an extra memory load?

static inline void
_PyCode_Warmup(PyCodeObject *code)
{
    if (code->co_warmup != 0) {
        code->co_warmup++;
        if (code->co_warmup == 0) {
            _PyCode_Quicken(code);
        }
    }
}

notatallshaw commented 2 years ago

It always seems to be very hard to get stable Windows benchmarks.

Agreed, though I think it helps my machine has a high count of homogeneous CPU cores, there is no thermal throttling, and I have all virus scanning turned off. I created 2 virtual environments with pyperf and psutils (and both my Python installs are from python.org), I ran the pyperf several times and interleaved the runs between Python 3.10 and 3.11, I never saw the mean change more than 2% for any of the runs:

Python 3.10

(.venv310) PS C:\Users\damia> py  -m pyperf timeit -s "global i" "for i in range(10_000_000): pass" --rigorous
.........................................
Mean +- std dev: 168 ms +- 5 ms
(.venv310) PS C:\Users\damia> py  -m pyperf timeit -s "global i" "for i in range(10_000_000): pass" --rigorous
.........................................
Mean +- std dev: 169 ms +- 6 ms
(.venv310) PS C:\Users\damia> py  -m pyperf timeit -s "global i" "for i in range(10_000_000): pass" --rigorous
.........................................
Mean +- std dev: 168 ms +- 5 ms

Python 3.11

(.venv311) PS C:\Users\damia> py  -m pyperf timeit -s "global i" "for i in range(10_000_000): pass" --rigorous
.........................................
Mean +- std dev: 187 ms +- 12 ms
(.venv311) PS C:\Users\damia> py  -m pyperf timeit -s "global i" "for i in range(10_000_000): pass" --rigorous
.........................................
Mean +- std dev: 188 ms +- 16 ms
(.venv311) PS C:\Users\damia> py  -m pyperf timeit -s "global i" "for i in range(10_000_000): pass" --rigorous
.........................................
Mean +- std dev: 186 ms +- 13 ms

Though on my machine I can not reproduce anything other than an empty loop where 3.11 is slower, as soon as the loop has code in it I don't see any performance regression.

brandtbucher commented 2 years ago

Here's a thought. Could it be the _PyCode_Warmup() call in JUMP_BACKWARD? That's an inline function, but still, it could be an extra memory load?

To see what's actually being run on 3.11+, pass adaptive=True to dis. The JUMP_BACKWARD is quickened into a JUMP_BACKWARD_QUICK, which skips the _PyCode_Warmup dance.

brandtbucher commented 2 years ago

For those who can reproduce: how does the slowdown vary with the number of iterations? Is it constant, or does it grow larger/smaller for different range sizes?

brandtbucher commented 2 years ago

The single-digit fast-path for PyLong_FromLong did change slightly between 3.10 and 3.11. I don't see any obvious reason why the new one would be slower, though.

Do we see a similar slowdown when dropping the loop into C with:

pyperf timeit -s "from collections import deque; r = range(10_000_000)" "deque(r, 0)" --rigorous

This version simply exhausts the iterator with as little overhead as possible. It might give us an idea if this slowdown is happening in the VM or not.

gvanrossum commented 2 years ago

I tried it on Windows with 10k iterations and with 10M iterations, and the % difference between 3.10 and 3.11 is the same for each (9% or so).

The variant using deque also shows a comparable slowdown.

brandtbucher commented 2 years ago

Hm, so that would seem to point to a slowdown in range iteration at the C level. I’m not aware of any changes to that in 3.11, so perhaps the construction of single-digit longs got a bit slower on certain builds?

gvanrossum commented 2 years ago

Wouldn't it have had to be a lot slower to show a 10% slowdown on those three bytecode instructions?

Is the allocator at all under suspicion at this point?

brandtbucher commented 2 years ago

Wouldn't it have had to be a lot slower to show a 10% slowdown on those three bytecode instructions?

Depends what you consider “a lot slower”, I guess. All this code is doing is creating and destroying tons of single-digit integers. If that specific code got 10-15% slower, it could possibly make this code 5-10% slower without really affecting anything else.

Is the allocator at all under suspicion at this point?

Maybe. I’m not sure I know enough about the object allocator to have any ideas there, but I would assume that a degradation like that would affect pretty much everything. Unless this is some edge case where we’re repeatedly allocating and freeing pools of memory or something.

brandtbucher commented 2 years ago

Okay, I came back to this for a bit and I too can reproduce it on a Windows machine. For the record, I only tried the python.org installers.

I ran more experiments (using the deque(r, 0) trick to iterate cheaply), and here's what I found. The timings were pretty consistent across several runs:

range(2**29, 2**29 + 10_000_000) (fast range_iterator yielding single-digit values)
- Python 3.10.5: 74.8 ms +- 10.3 ms
- Python 3.11.0b3: 84.4 ms +- 8.1 ms (13% slower)
range(2**30, 2**30 + 10_000_000) (fast range_iterator yielding double-digit values)
- Python 3.10.5: 104 ms +- 9 ms
- Python 3.11.0b3: 106 ms +- 9 ms (2% slower)
range(2**31, 2**31 + 10_000_000) (slow longrange_iterator yielding double-digit values)
- Python 3.10.5: 454 ms +- 38 ms
- Python 3.11.0b3: 498 ms +- 41 ms (10% slower)

My best theory is still that single-digit int construction is the culprit. Why?

The first run creates one single-digit value per iteration (the result).
The second run creates no single-digit values (and thus has no slowdown).
The third run creates two single-digit values per iteration (used for intermediate results when computing each double-digit result).

Anyways, I'll stop here, because I'm not really sure how to proceed further. Also, I'm not really convinced that this is severe enough to spend more time investigating. 🙃

markshannon commented 2 years ago

If you want to isolate the cost of iterating over a range, try all(range(1, 10_000_001) as it doesn't need to do any allocation.

There have been changes to PyLongObject as well. The code is a bit messy because sizeof(long) is 4 on Windows, instead of 8 on other systems. This might be related.

gvanrossum commented 2 years ago

So does this mean that DIGIT is only 15 bits on Windows? That might explain a lot. (Though not that I also saw a 6% slowdown on my Mac -- I wasn't very rigorous there though.)

arhadthedev commented 2 years ago

The code is a bit messy because sizeof(long) is 4 on Windows, instead of 8 on other systems. This might be related.

Can we use int64_t from C99 <stdint.h> then?

sweeneyde commented 2 years ago

This might be a relevant change in 3.11: https://github.com/python/cpython/issues/89732 (from 15 bits sometimes to always 30 bits by default)

gvanrossum commented 2 years ago

Can someone with access to the Windows tooling verify that there's a perf drop for this example around the commit that changed the digit size default?

gvanrossum commented 2 years ago

Um, sorry, according to sys.int_info, the bits per digits hasn't changed on Windows between 3.10 and 3.11, so that's not it.

@brandtbucher is looking into definitive proof that it's in PyLong_FromLong() for single-digit values.

brandtbucher commented 2 years ago

This might be a relevant change in 3.11: https://github.com/python/cpython/issues/89732 (from 15 bits sometimes to always 30 bits by default)

Just checked, and both builds I were using to get these numbers have the same sys.int_info:

PS C:\Users\brandtbucher> py -3.11
Python 3.11.0b3 (main, Jun  1 2022, 13:29:14) [MSC v.1932 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.int_info
sys.int_info(bits_per_digit=30, sizeof_digit=4)
>>> exit()

PS C:\Users\brandtbucher> py -3.10
Python 3.10.5 (tags/v3.10.5:f377153, Jun  6 2022, 16:14:13) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.int_info
sys.int_info(bits_per_digit=30, sizeof_digit=4)
>>> exit()

brandtbucher commented 2 years ago

Oh, whoops. Crossed posts. Yeah, I'm going to try making the new PyLong_FromLong path for single-digit ints as close as possible to the old 3.10 one, and see if I can still reproduce.

neonene commented 2 years ago

As for Windows, v143 tools in VS2022 cause this issue, in which the PGO build gets the same performance as the release build. It seems to me that the official binaries have been built with them since 3.11b1.

gvanrossum commented 2 years ago

As for Windows, v143 tools in VS2022 cause this issue, in which the PGO build gets the same performance as the release build. It seems to me that the official binaries have been built with them since 3.11b1.

@zooba Can you confirm that?

neonene commented 2 years ago

I tried exercising the deque cases additionally to absorb the differences between v142 and v143, which did not go well. Those tools do not affect the macro benchmark, though.

Local x64 PGO builds:

3.11b3+ (2022-07-01 abf5f5c)
3.10.5+ (2022-07-01 cf17326)

pyperformance:

3.11 (not significant)

``` pyperf compare_to v142 v143 Slower (19): - sympy_sum: 282 ms +- 6 ms -> 313 ms +- 33 ms: 1.11x slower - nbody: 145 ms +- 2 ms -> 159 ms +- 1 ms: 1.09x slower - sympy_integrate: 33.6 ms +- 0.7 ms -> 35.7 ms +- 3.8 ms: 1.06x slower - pickle_list: 6.88 us +- 0.08 us -> 7.29 us +- 0.07 us: 1.06x slower - telco: 10.4 ms +- 0.3 ms -> 11.0 ms +- 0.1 ms: 1.06x slower - chameleon: 13.7 ms +- 0.2 ms -> 14.4 ms +- 0.2 ms: 1.05x slower - sqlalchemy_imperative: 41.4 ms +- 1.1 ms -> 43.4 ms +- 2.1 ms: 1.05x slower - 2to3: 500 ms +- 6 ms -> 522 ms +- 38 ms: 1.04x slower - sympy_expand: 791 ms +- 7 ms -> 821 ms +- 18 ms: 1.04x slower - xml_etree_iterparse: 176 ms +- 5 ms -> 182 ms +- 8 ms: 1.03x slower - meteor_contest: 141 ms +- 2 ms -> 145 ms +- 2 ms: 1.03x slower - pyflate: 714 ms +- 5 ms -> 735 ms +- 7 ms: 1.03x slower - regex_dna: 236 ms +- 6 ms -> 242 ms +- 6 ms: 1.03x slower - unpickle_list: 6.31 us +- 0.06 us -> 6.48 us +- 0.05 us: 1.03x slower - chaos: 120 ms +- 1 ms -> 123 ms +- 1 ms: 1.03x slower - json_loads: 36.3 us +- 0.7 us -> 36.9 us +- 0.8 us: 1.02x slower - crypto_pyaes: 109 ms +- 1 ms -> 111 ms +- 2 ms: 1.01x slower - pidigits: 242 ms +- 2 ms -> 246 ms +- 1 ms: 1.01x slower - unpickle: 21.3 us +- 0.2 us -> 21.6 us +- 0.2 us: 1.01x slower Faster (15): - pickle_dict: 49.4 us +- 0.4 us -> 45.3 us +- 0.4 us: 1.09x faster - scimark_sor: 190 ms +- 3 ms -> 176 ms +- 2 ms: 1.08x faster - hexiom: 9.88 ms +- 0.05 ms -> 9.20 ms +- 0.08 ms: 1.07x faster - spectral_norm: 161 ms +- 3 ms -> 152 ms +- 3 ms: 1.06x faster - unpack_sequence: 56.4 ns +- 0.5 ns -> 53.6 ns +- 0.6 ns: 1.05x faster - tornado_http: 319 ms +- 21 ms -> 305 ms +- 5 ms: 1.05x faster - dulwich_log: 231 ms +- 18 ms -> 223 ms +- 3 ms: 1.04x faster - nqueens: 148 ms +- 1 ms -> 144 ms +- 2 ms: 1.03x faster - unpickle_pure_python: 338 us +- 5 us -> 332 us +- 5 us: 1.02x faster - deltablue: 6.49 ms +- 0.11 ms -> 6.37 ms +- 0.09 ms: 1.02x faster - scimark_monte_carlo: 98.8 ms +- 1.4 ms -> 97.2 ms +- 1.9 ms: 1.02x faster - fannkuch: 576 ms +- 5 ms -> 568 ms +- 9 ms: 1.01x faster - scimark_lu: 143 ms +- 2 ms -> 141 ms +- 1 ms: 1.01x faster - mako: 19.1 ms +- 0.2 ms -> 18.8 ms +- 0.2 ms: 1.01x faster - float: 134 ms +- 2 ms -> 133 ms +- 2 ms: 1.01x faster Benchmark hidden because not significant (22): django_template, go, json_dumps, logging_format, logging_silent, logging_simple, pathlib, pickle, pickle_pure_python, raytrace, regex_compile, regex_effbot, regex_v8, richards, scimark_fft, scimark_sparse_mat_mult, sqlalchemy_declarative, sqlite_synth, sympy_str, xml_etree_parse, xml_etree_generate, xml_etree_process Geometric mean: 1.01x slower ```

Additional exercise:

n = 2**29; deque(range(n, n + 10_000_000), 0)
n = 2**30; deque(range(n, n + 10_000_000), 0)
n = 2**31; deque(range(n, n + 10_000_000), 0)

Benchmark results:

`n = 2**29; deque(range(n, n + 10_000_000), 0)`	py	v143	v142	v143 trained	v142 trained
3.11	120 ms (base)	100 ms ( 80%)	112 ms ( 93%)	98.8 ms ( 78%)
3.10	123 ms (102%)	106 ms ( 87%)	109 ms ( 89%)	97.8 ms ( 77%)

`n = 2**30; deque(range(n, n + 10_000_000), 0)`	py	v143	v142	v143 trained	v142 trained
3.11	153 ms (base)	138 ms ( 89%)	142 ms ( 92%)	127 ms ( 79%)
3.10	155 ms (101%)	136 ms ( 88%)	139 ms ( 90%)	125 ms ( 78%)

`n = 2**31; deque(range(n, n + 10_000_000), 0)`	py	v143	v142	v143 trained	v142 trained
3.11	714 ms (base)	675 ms ( 94%)	599 ms ( 81%)	558 ms ( 72%)
3.10	679 ms ( 95%)	639 ms ( 88%)	615 ms ( 84%)	573 ms ( 75%)

This is off-topic, but the coverage.py C tracer can also improve the performance (5~7%) with some exercise on Windows.

zooba commented 2 years ago

As for Windows, v143 tools in VS2022 cause this issue, in which the PGO build gets the same performance as the release build. It seems to me that the official binaries have been built with them since 3.11b1.

Can confirm that 3.11 is using the latest tools, yeah, and I think 3.10 still tries to use the one it originally released with. It shows up in the sys.version string - MSC v.1932 vs MSC v.1929 (the compiler version doesn't match the overall toolset version).

There are a number of other possibilities though. It may be that we need to adjust the profile to account for the changed interpreter loop. I assume it's not going to have the same proportion of "normal" vs. "unlikely" paths as it used to, which could affect training decisions. But that requires macro-benchmarks, not microbenchmarks, so it's less interesting to discuss ;-)

gvanrossum commented 2 years ago

It would be easy enough to rule out the profile as the reason for the slowdown: Compile 3.11 with v142 and v143 tools and compare the result. IIUC that's exactly what @neonene did, and the "v143 trained" vs. "v142 trained" columns seem to indicate that code compiled with v143 is always slower than v142, both for 3.10 and for 3.11. IIUC this is with the same profile. Also, comparing the (untrained) "v142" and "v143" columns, we see the same thing -- for both 3.10 and 3.11, code compiled with v143 is slower than v142.

So that rather clearly points to v143 as the culprit. What am I missing that sheds doubt on this conclusion? Since we both work for Microsoft at this point it would make sense to invite some internal compiler backend expert to this discussion, right? And it's not unreasonable that there might be some regression here, given that the VS2022 compiler is still relatively young.

zooba commented 2 years ago

So that rather clearly points to v143 as the culprit.

Good point, you're right.

We've had such a terrible time getting any traction from the compiler team on this. I suspect they're overwhelmed with actual errors that a regression without a micro-reproduction can't justify the attention. Our best bet is probably to convince the team to include a CPython build and perf suite in their test runs.

I'm away from work for 2 weeks now, but Graham and/or Mike should be able to make the connections to start that discussion.

neonene commented 2 years ago

3.10.0 does not show this issue on my local PGO build with v143 tool set, so I tried bisecting.

Related commits on main (backported before 3.10.1):

Benchmark results:

n = 2**29; deque(range(n, n + 10_000_000), 0)

3.10.5+ (Jul 1)	as-is	without commit 1
v143 PGO	123 ms (base)	105 ms (83%)
v142 PGO	106 ms (84%)	105 ms (83%)

3.12a0 (Jul 5)	as-is	without 1,2	without 1	without 2
v143 PGO	125 ms (base)	100 ms (75%)	118 ms (94%)	118 ms (94%)
v142 PGO	99.3 ms (74%)	99.8 ms (74%)	-	-
v143 release	112 ms (88%)	111 ms (87%)	-	-
v142 release	117 ms (93%)	118 ms (93%)	-	-

mdboom commented 2 years ago

Possibly naive question: Could we make python.org 3.11 releases with the v142 compiler? Does that commit us to using the compiler for the life of the release, or are 142 and 143 ABI-compatible? The reason I'm asking is to understand the urgency. Is this something that needs to be resolved before 3.11rc1? I imagine no, both because (a) using an older compiler is fine and (b) this benchmark isn't terribly representative of real-world code we'd be ok with this specific regression (even if it is a bonafide regression in the C compiler).

kumaraditya303 commented 2 years ago

are 142 and 143 ABI-compatible?

Yes, otherwise wheels on windows would not work.

Could we make python.org 3.11 releases with the v142 compiler?

This is possible although it would require changing the build files, basically a revert of https://github.com/python/cpython/pull/29577 would suffice.

mdboom commented 2 years ago

How hard would it be to create a standalone C program to reproduce the regression? That would probably be the most helpful for the MSVC folks. If we suspect the allocator, maybe just something that simulates what the Python memory pools are doing in this instance?

kumaraditya303 commented 2 years ago

How hard would it be to create a standalone C program to reproduce the regression?

I suppose a simple static inline function with one or more branches which is used a lot in a c file is inlined or not is a good indication of MSVC PGO.

If we suspect the allocator, maybe just something that simulates what the Python memory pools are doing in this instance?

You can use libc malloc with PYTHONMALLOC=malloc as env var to avoid obmalloc from being used.

neonene commented 2 years ago

arena_map_*() in obmalloc.c are the most frequent functions in the deque cases above. I think that they feed into the heuristic for MSVC to optimize _PyEval_EvalFrameDefault() when they are inlined there with the chains below:

object_dealloc
  PyObject_Free
    _PyObject_Free
      pymalloc_free
        address_in_range -> bool
          arena_map_is_used -> int
            arena_map_get
              -PyMem_RawCalloc (optimized away)
              -PyMem_RawCalloc (optimized away)

static bool
address_in_range(void *p, poolp pool)
{
    return arena_map_is_used(p);
}

Workaround to reduce the cost of the optimization:

Make address_in_range() and arena_map_is_used() return int consistently.

Unfortunately, there was no improvement when I tried making them return bool. This saved the profiling cost though, omitting address_in_range() from the PGO instrumentation build.

Alternative:

Make arena_map_is_used() not share arena_map_get() with the other functions, duplicating it in obmalloc.c.

Isolating the PGO profile helps MSVC optimize away (smoothly) PyMem_RawCalloc() which arena_map_is_used() never wants.

Maybe there are more effective ways to help MSVC.v1932.

EDIT: It seems that this issue is about whether or not PyObject_Free() inlines arena_map_get(). So specifying __forceinline is another way to fix.

mdboom commented 2 years ago

@neonene: What type of CPU have you been measuring on? Our internal compiler contact is seeing something different on AMD vs. Intel (and maybe even the exact SKU may matter).

gvanrossum commented 2 years ago

A MS engineer suggested:

Maybe forcing an alignment of the usedpools global from obmalloc.c helps, that is the array indexed by the LEA rcx, [rip + X]
static poolp usedpools[2 * ((NB_SMALL_SIZE_CLASSES + 7) / 8) * 8] = {
align (C++) | Microsoft Docs

Can try something like a 64 byte alignment for ex.

I'm not sure what the pure-C equivalent would be.

neonene commented 2 years ago

@mdboom Sorry, I keep details private, but I'm an Intel user.

neonene commented 2 years ago

I'm not sure my patch is correct, but I can build 3.11 with:

__declspec(align(64)) static poolp usedpools[2 * ((NB_SMALL_SIZE_CLASSES ...

The performance did not change for me with arena_map_get() un-inlined.

gvanrossum commented 2 years ago

The performance did not change for me with arena_map_get() un-inlined.

Can you clarify -- did the 64-byte alignment not change performance, or did un-inlining arena_map_get() not change performance?

neonene commented 2 years ago

My 3.11 got faster when I marked __forceinline to arena_map_get(). And PyObject_Free() chains has not changed the binary code with __declspec(align(64)). I think this patch is not related to the deque performances.

As for v142, I can see the slowdown by marking __declspec(noinline) to arena_map_get().

gratianlup commented 2 years ago

Hi,

I'm the "compiler contact" :) The builds I did before for Python where using the 3.10.5 tag and with that arena_map_get() is not getting inlined by either MSVC version, a particular load instr taking longer seemed the only slowdown there, but I think it's not relevant here.

Tried now with 3.12a0 and confirming what @neonene noticed, __forceinlining arena_map_get with v143 shows a 15% speedup. There is some CSE of loads (store to load value propagation) that is enabled by the inlining.

Looking at the number of instrs in PyObject_Free also shows the benefit of the inlining:

Without inlining, PyObject_Free + arena_map_get = 82+48 = 130 instrs With inlining, PyObject_Free = 75 instr

From what I know there is right now no estimate of how inlining a func. helps such optimization like CSE, With PGO inlining relies mostly on the PGO counters + a few other simple heuristics.

There is a more recent compiler flag that is also worth trying, /Ob3 – it increases the budget of allowed inlining and it could help in other spots. For arena_map_get the __forceinline is for now still the best option to make sure it always gets inlined.

Thanks, Gratian

gvanrossum commented 2 years ago

Thanks @gratianlup!

@neonene could you prepare a patch that adds the force inline to arena_map_get? Then we can put this to rest.

neonene commented 2 years ago

PR has been posted: gh-94842

gvanrossum commented 2 years ago

This is now fixed in main and in 3.11 (will be in 3.11b5). The backport to 3.10 was a little problematic, and won't be necessary unless we also switch compiler versions there.

faster-cpython / ideas