Closed wangyi041228 closed 2 years ago
Can you show the code you used to measure this?
env: Win10 cmd
video(just upload): https://www.youtube.com/watch?v=zhdF9eyi10M
result: Python 3.11.0.b3 empty 216 193 216 204 a=i 388 357 431 391 sin(i) 1115 1050 1093 1066
Python 3.10.5 empty 168 176 177 178 171 a=i 290 309 308 315 292 sin(i) 1100 1081 1069 1073 1095
Python 3.9.13 Python 3.8.10 (just like Py310)
code: import time import math
t0 = time.perf_counter_ns() for i in range(10_000_000): ... t1 = time.perf_counter_ns() print(t1 - t0)
t0 = time.perf_counter_ns() for i in range(10_000_000): pass t1 = time.perf_counter_ns() print(t1 - t0)
t0 = time.perf_counter_ns() for i in range(10_000_000): a = i t1 = time.perf_counter_ns() print(t1 - t0)
t0 = time.perf_counter_ns() for i in range(10_000_000): a = math.sin(i) t1 = time.perf_counter_ns() print(t1 - t0)
Can you try again with the whole thing inside a function (which you call just once)? Your current code writes globals (i
and a
) and that may be slowing things down.
They're all faster, but Python311 is still 10-20% slower than Python310 for empty loops.
For empty loops: Python311 0.096s Python310 0.089s
For a=i: Python311 0.123s Python310 0.122-0.132s
For sin: Python311 0.70s Python310 0.850s
please use pyperformance or similar tool to call function hundreds of times. Many optimizations will not work with functions called only once.
I think it would still be good to know if Python 3.11 has performance regressions for code that doesn't call the same function multiple times (such as in simple scripts) so I tried creating OPs test cases in a way that others can hopefully more easily reproduce.
Test environment: AMD Ryzen 9 5900X 2 x 32 GBs DDR4 clocked at 3200 MT/s Windows 10 Version 21H2 OS Build 19044.1766 Python 3.10.5, Python 3.11.0b3 (both clean installs)
I have not set up anything special in terms of performance control, I am just running the tests several times to see if there is a consistent and big enough difference to make any conclusions (though my PC is well ventilated so there is no thermal throttling).
Test execution format:
py -{python_version} -m timeit [-s "{setup_code}"] "{run_code}"
Test 1
Run code: for i in range(10_000_000): pass
High level result: Python 3.11 is consistently 5-10% slower
Test 2
Run code: for i in range(10_000_000): a=i
High level result: After several dozen runs no significant statistical difference
Test 3
Setup code: import math
Run code: for i in range(10_000_000): a=math.sin(i)
High level result: Python 3.11 is consistently 10-20% faster
Not sure this information helps anyone, but there it is.
The math.sin() test is pretty uninteresting, but the ‘pass’ case deserves more investigation. @sweeneyde ?
It always seems to be very hard to get stable Windows benchmarks. I got these results:
## Python 3.10.4 ##
py -3.10 -m pyperf timeit -s "global i" "for i in range(10_000_000): pass" --rigorous
(I did that 5 times)
Mean +- std dev: 315 ms +- 7 ms
Mean +- std dev: 320 ms +- 10 ms
Mean +- std dev: 317 ms +- 10 ms
Mean +- std dev: 321 ms +- 10 ms
Mean +- std dev: 320 ms +- 8 ms
## Python 3.11.0b3 ##
py -3.11 -m pyperf timeit -s "global i" "for i in range(10_000_000): pass" --rigorous
(I did that 5 times)
Mean +- std dev: 322 ms +- 12 ms
Mean +- std dev: 320 ms +- 8 ms
Mean +- std dev: 322 ms +- 7 ms
Mean +- std dev: 323 ms +- 6 ms
Mean +- std dev: 316 ms +- 14 ms
AMD Ryzen 9 5900HX with Radeon Graphics 3.30 GHz Windows 10 Pro 21H2 19044.1766
I got "WARNING: unable to increase process priority" while measuring with py311.
## Python 3.10.5 ##
py -3.10 -m pyperf timeit -s "global i" "for i in range(10_000_000): pass" --rigorous
Mean +- std dev: 195 ms +- 9 ms
Mean +- std dev: 193 ms +- 6 ms
py -3.11 -m pyperf timeit -s "global i" "for i in range(10_000_000): pass" --rigorous
WARNING: unable to increase process priority
Mean +- std dev: 216 ms +- 9 ms
Mean +- std dev: 214 ms +- 6 ms
Does installing psutil for 3.11 make a difference?
I can repro the results from @wangyi041228 on my Windows laptop (although I wasn't very rigorous). I'm not sure why we see such different results where @sweeneyde saw basically no difference (at least, the difference of the means was a fraction of the std dev) -- it'd be surprising if the background noise always affected 3.11 more than 3.10.
We talked about this in the Faster CPython perf team and did not come up with any good explanation (although we expect the issue doesn't affect real code, since there's always some code in the loop). (@markshannon @brandtbucher)
I'm sure my Python 3.10.5 and 3.11.0b3 binaries all came from python.org, and I have to assume they were built using the same toolchain.
I reproduced (at least some of) the difference on my Mac as well, so I don't think it's the MSVC compiler.
I looked at the disassembly, and the heart of the loop is equivalent in 3.10 and 3.11:
3.10:
>> 8 FOR_ITER 2 (to 14)
10 STORE_GLOBAL 1 (i)
12 JUMP_ABSOLUTE 4 (to 8)
3.11:
>> 32 FOR_ITER 2 (to 38)
34 STORE_GLOBAL 1 (i)
36 JUMP_BACKWARD 3 (to 32)
Interestingly, getting rid of global i
did not change things. So it's also not something in the global dict that slows down 3.11 here.
Here's a thought. Could it be the _PyCode_Warmup()
call in JUMP_BACKWARD
? That's an inline function, but still, it could be an extra memory load?
static inline void
_PyCode_Warmup(PyCodeObject *code)
{
if (code->co_warmup != 0) {
code->co_warmup++;
if (code->co_warmup == 0) {
_PyCode_Quicken(code);
}
}
}
It always seems to be very hard to get stable Windows benchmarks.
Agreed, though I think it helps my machine has a high count of homogeneous CPU cores, there is no thermal throttling, and I have all virus scanning turned off. I created 2 virtual environments with pyperf and psutils (and both my Python installs are from python.org), I ran the pyperf several times and interleaved the runs between Python 3.10 and 3.11, I never saw the mean change more than 2% for any of the runs:
Python 3.10
(.venv310) PS C:\Users\damia> py -m pyperf timeit -s "global i" "for i in range(10_000_000): pass" --rigorous
.........................................
Mean +- std dev: 168 ms +- 5 ms
(.venv310) PS C:\Users\damia> py -m pyperf timeit -s "global i" "for i in range(10_000_000): pass" --rigorous
.........................................
Mean +- std dev: 169 ms +- 6 ms
(.venv310) PS C:\Users\damia> py -m pyperf timeit -s "global i" "for i in range(10_000_000): pass" --rigorous
.........................................
Mean +- std dev: 168 ms +- 5 ms
Python 3.11
(.venv311) PS C:\Users\damia> py -m pyperf timeit -s "global i" "for i in range(10_000_000): pass" --rigorous
.........................................
Mean +- std dev: 187 ms +- 12 ms
(.venv311) PS C:\Users\damia> py -m pyperf timeit -s "global i" "for i in range(10_000_000): pass" --rigorous
.........................................
Mean +- std dev: 188 ms +- 16 ms
(.venv311) PS C:\Users\damia> py -m pyperf timeit -s "global i" "for i in range(10_000_000): pass" --rigorous
.........................................
Mean +- std dev: 186 ms +- 13 ms
Though on my machine I can not reproduce anything other than an empty loop where 3.11 is slower, as soon as the loop has code in it I don't see any performance regression.
Here's a thought. Could it be the
_PyCode_Warmup()
call inJUMP_BACKWARD
? That's an inline function, but still, it could be an extra memory load?
To see what's actually being run on 3.11+, pass adaptive=True
to dis
. The JUMP_BACKWARD
is quickened into a JUMP_BACKWARD_QUICK
, which skips the _PyCode_Warmup
dance.
For those who can reproduce: how does the slowdown vary with the number of iterations? Is it constant, or does it grow larger/smaller for different range sizes?
The single-digit fast-path for PyLong_FromLong
did change slightly between 3.10 and 3.11. I don't see any obvious reason why the new one would be slower, though.
Do we see a similar slowdown when dropping the loop into C with:
pyperf timeit -s "from collections import deque; r = range(10_000_000)" "deque(r, 0)" --rigorous
This version simply exhausts the iterator with as little overhead as possible. It might give us an idea if this slowdown is happening in the VM or not.
I tried it on Windows with 10k iterations and with 10M iterations, and the % difference between 3.10 and 3.11 is the same for each (9% or so).
The variant using deque also shows a comparable slowdown.
Hm, so that would seem to point to a slowdown in range
iteration at the C level. I’m not aware of any changes to that in 3.11, so perhaps the construction of single-digit longs got a bit slower on certain builds?
Wouldn't it have had to be a lot slower to show a 10% slowdown on those three bytecode instructions?
Is the allocator at all under suspicion at this point?
Wouldn't it have had to be a lot slower to show a 10% slowdown on those three bytecode instructions?
Depends what you consider “a lot slower”, I guess. All this code is doing is creating and destroying tons of single-digit integers. If that specific code got 10-15% slower, it could possibly make this code 5-10% slower without really affecting anything else.
Is the allocator at all under suspicion at this point?
Maybe. I’m not sure I know enough about the object allocator to have any ideas there, but I would assume that a degradation like that would affect pretty much everything. Unless this is some edge case where we’re repeatedly allocating and freeing pools of memory or something.
Okay, I came back to this for a bit and I too can reproduce it on a Windows machine. For the record, I only tried the python.org installers.
I ran more experiments (using the deque(r, 0)
trick to iterate cheaply), and here's what I found. The timings were pretty consistent across several runs:
range(2**29, 2**29 + 10_000_000)
(fast range_iterator
yielding single-digit values)
range(2**30, 2**30 + 10_000_000)
(fast range_iterator
yielding double-digit values)
range(2**31, 2**31 + 10_000_000)
(slow longrange_iterator
yielding double-digit values)
My best theory is still that single-digit int
construction is the culprit. Why?
Anyways, I'll stop here, because I'm not really sure how to proceed further. Also, I'm not really convinced that this is severe enough to spend more time investigating. 🙃
If you want to isolate the cost of iterating over a range
, try all(range(1, 10_000_001)
as it doesn't need to do any allocation.
There have been changes to PyLongObject
as well. The code is a bit messy because sizeof(long)
is 4 on Windows, instead of 8 on other systems. This might be related.
So does this mean that DIGIT is only 15 bits on Windows? That might explain a lot. (Though not that I also saw a 6% slowdown on my Mac -- I wasn't very rigorous there though.)
The code is a bit messy because
sizeof(long)
is 4 on Windows, instead of 8 on other systems. This might be related.
Can we use int64_t
from C99 <stdint.h>
then?
This might be a relevant change in 3.11: https://github.com/python/cpython/issues/89732 (from 15 bits sometimes to always 30 bits by default)
Can someone with access to the Windows tooling verify that there's a perf drop for this example around the commit that changed the digit size default?
Um, sorry, according to sys.int_info
, the bits per digits hasn't changed on Windows between 3.10 and 3.11, so that's not it.
@brandtbucher is looking into definitive proof that it's in PyLong_FromLong()
for single-digit values.
This might be a relevant change in 3.11: https://github.com/python/cpython/issues/89732 (from 15 bits sometimes to always 30 bits by default)
Just checked, and both builds I were using to get these numbers have the same sys.int_info
:
PS C:\Users\brandtbucher> py -3.11
Python 3.11.0b3 (main, Jun 1 2022, 13:29:14) [MSC v.1932 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.int_info
sys.int_info(bits_per_digit=30, sizeof_digit=4)
>>> exit()
PS C:\Users\brandtbucher> py -3.10
Python 3.10.5 (tags/v3.10.5:f377153, Jun 6 2022, 16:14:13) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.int_info
sys.int_info(bits_per_digit=30, sizeof_digit=4)
>>> exit()
Oh, whoops. Crossed posts. Yeah, I'm going to try making the new PyLong_FromLong
path for single-digit ints as close as possible to the old 3.10 one, and see if I can still reproduce.
As for Windows, v143 tools in VS2022 cause this issue, in which the PGO build gets the same performance as the release build. It seems to me that the official binaries have been built with them since 3.11b1.
As for Windows, v143 tools in VS2022 cause this issue, in which the PGO build gets the same performance as the release build. It seems to me that the official binaries have been built with them since 3.11b1.
@zooba Can you confirm that?
I tried exercising the deque
cases additionally to absorb the differences between v142 and v143, which did not go well. Those tools do not affect the macro benchmark, though.
n = 2**29; deque(range(n, n + 10_000_000), 0)
n = 2**30; deque(range(n, n + 10_000_000), 0)
n = 2**31; deque(range(n, n + 10_000_000), 0)
n = 2**29; deque(range(n, n + 10_000_000), 0) |
py | v143 | v142 | v143 trained | v142 trained |
---|---|---|---|---|---|
3.11 | 120 ms (base) | 100 ms ( 80%) | 112 ms ( 93%) | 98.8 ms ( 78%) | |
3.10 | 123 ms (102%) | 106 ms ( 87%) | 109 ms ( 89%) | 97.8 ms ( 77%) |
n = 2**30; deque(range(n, n + 10_000_000), 0) |
py | v143 | v142 | v143 trained | v142 trained |
---|---|---|---|---|---|
3.11 | 153 ms (base) | 138 ms ( 89%) | 142 ms ( 92%) | 127 ms ( 79%) | |
3.10 | 155 ms (101%) | 136 ms ( 88%) | 139 ms ( 90%) | 125 ms ( 78%) |
n = 2**31; deque(range(n, n + 10_000_000), 0) |
py | v143 | v142 | v143 trained | v142 trained |
---|---|---|---|---|---|
3.11 | 714 ms (base) | 675 ms ( 94%) | 599 ms ( 81%) | 558 ms ( 72%) | |
3.10 | 679 ms ( 95%) | 639 ms ( 88%) | 615 ms ( 84%) | 573 ms ( 75%) |
This is off-topic, but the coverage.py
C tracer can also improve the performance (5~7%) with some exercise on Windows.
As for Windows, v143 tools in VS2022 cause this issue, in which the PGO build gets the same performance as the release build. It seems to me that the official binaries have been built with them since 3.11b1.
Can confirm that 3.11 is using the latest tools, yeah, and I think 3.10 still tries to use the one it originally released with. It shows up in the sys.version string - MSC v.1932
vs MSC v.1929
(the compiler version doesn't match the overall toolset version).
There are a number of other possibilities though. It may be that we need to adjust the profile to account for the changed interpreter loop. I assume it's not going to have the same proportion of "normal" vs. "unlikely" paths as it used to, which could affect training decisions. But that requires macro-benchmarks, not microbenchmarks, so it's less interesting to discuss ;-)
It would be easy enough to rule out the profile as the reason for the slowdown: Compile 3.11 with v142 and v143 tools and compare the result. IIUC that's exactly what @neonene did, and the "v143 trained" vs. "v142 trained" columns seem to indicate that code compiled with v143 is always slower than v142, both for 3.10 and for 3.11. IIUC this is with the same profile. Also, comparing the (untrained) "v142" and "v143" columns, we see the same thing -- for both 3.10 and 3.11, code compiled with v143 is slower than v142.
So that rather clearly points to v143 as the culprit. What am I missing that sheds doubt on this conclusion? Since we both work for Microsoft at this point it would make sense to invite some internal compiler backend expert to this discussion, right? And it's not unreasonable that there might be some regression here, given that the VS2022 compiler is still relatively young.
So that rather clearly points to v143 as the culprit.
Good point, you're right.
We've had such a terrible time getting any traction from the compiler team on this. I suspect they're overwhelmed with actual errors that a regression without a micro-reproduction can't justify the attention. Our best bet is probably to convince the team to include a CPython build and perf suite in their test runs.
I'm away from work for 2 weeks now, but Graham and/or Mike should be able to make the connections to start that discussion.
3.10.0
does not show this issue on my local PGO build with v143 tool set, so I tried bisecting.
Related commits on main (backported before 3.10.1
):
n = 2**29; deque(range(n, n + 10_000_000), 0)
3.10.5+ (Jul 1) | as-is | without commit 1 |
---|---|---|
v143 PGO | 123 ms (base) | 105 ms (83%) |
v142 PGO | 106 ms (84%) | 105 ms (83%) |
3.12a0 (Jul 5) | as-is | without 1,2 | without 1 | without 2 |
---|---|---|---|---|
v143 PGO | 125 ms (base) | 100 ms (75%) | 118 ms (94%) | 118 ms (94%) |
v142 PGO | 99.3 ms (74%) | 99.8 ms (74%) | - | - |
v143 release | 112 ms (88%) | 111 ms (87%) | - | - |
v142 release | 117 ms (93%) | 118 ms (93%) | - | - |
Possibly naive question: Could we make python.org 3.11 releases with the v142 compiler? Does that commit us to using the compiler for the life of the release, or are 142 and 143 ABI-compatible? The reason I'm asking is to understand the urgency. Is this something that needs to be resolved before 3.11rc1? I imagine no, both because (a) using an older compiler is fine and (b) this benchmark isn't terribly representative of real-world code we'd be ok with this specific regression (even if it is a bonafide regression in the C compiler).
are 142 and 143 ABI-compatible?
Yes, otherwise wheels on windows would not work.
Could we make python.org 3.11 releases with the v142 compiler?
This is possible although it would require changing the build files, basically a revert of https://github.com/python/cpython/pull/29577 would suffice.
How hard would it be to create a standalone C program to reproduce the regression? That would probably be the most helpful for the MSVC folks. If we suspect the allocator, maybe just something that simulates what the Python memory pools are doing in this instance?
How hard would it be to create a standalone C program to reproduce the regression?
I suppose a simple static inline function with one or more branches which is used a lot in a c file is inlined or not is a good indication of MSVC PGO.
If we suspect the allocator, maybe just something that simulates what the Python memory pools are doing in this instance?
You can use libc malloc with PYTHONMALLOC=malloc
as env var to avoid obmalloc from being used.
arena_map_*()
in obmalloc.c
are the most frequent functions in the deque
cases above. I think that they feed into the heuristic for MSVC to optimize _PyEval_EvalFrameDefault()
when they are inlined there with the chains below:
object_dealloc
PyObject_Free
_PyObject_Free
pymalloc_free
address_in_range -> bool
arena_map_is_used -> int
arena_map_get
-PyMem_RawCalloc (optimized away)
-PyMem_RawCalloc (optimized away)
static bool
address_in_range(void *p, poolp pool)
{
return arena_map_is_used(p);
}
Workaround to reduce the cost of the optimization:
Make address_in_range()
and arena_map_is_used()
return int
consistently.
Unfortunately, there was no improvement when I tried making them return bool
. This saved the profiling cost though, omitting address_in_range()
from the PGO instrumentation build.
Alternative:
Make arena_map_is_used()
not share arena_map_get()
with the other functions, duplicating it in obmalloc.c
.
Isolating the PGO profile helps MSVC optimize away (smoothly) PyMem_RawCalloc()
which arena_map_is_used()
never wants.
Maybe there are more effective ways to help MSVC.v1932
.
EDIT: It seems that this issue is about whether or not PyObject_Free()
inlines arena_map_get()
. So specifying __forceinline
is another way to fix.
@neonene: What type of CPU have you been measuring on? Our internal compiler contact is seeing something different on AMD vs. Intel (and maybe even the exact SKU may matter).
A MS engineer suggested:
Maybe forcing an alignment of the usedpools global from obmalloc.c helps, that is the array indexed by the
LEA rcx, [rip + X]
static poolp usedpools[2 * ((NB_SMALL_SIZE_CLASSES + 7) / 8) * 8] = {
Can try something like a 64 byte alignment for ex.
I'm not sure what the pure-C equivalent would be.
@mdboom Sorry, I keep details private, but I'm an Intel user.
I'm not sure my patch is correct, but I can build 3.11 with:
__declspec(align(64)) static poolp usedpools[2 * ((NB_SMALL_SIZE_CLASSES ...
The performance did not change for me with arena_map_get()
un-inlined.
The performance did not change for me with
arena_map_get()
un-inlined.
Can you clarify -- did the 64-byte alignment not change performance, or did un-inlining arena_map_get()
not change performance?
My 3.11 got faster when I marked __forceinline
to arena_map_get()
. And PyObject_Free()
chains has not changed the binary code with __declspec(align(64))
. I think this patch is not related to the deque
performances.
As for v142, I can see the slowdown by marking __declspec(noinline)
to arena_map_get()
.
Hi,
I'm the "compiler contact" :) The builds I did before for Python where using the 3.10.5 tag and with that arena_map_get() is not getting inlined by either MSVC version, a particular load instr taking longer seemed the only slowdown there, but I think it's not relevant here.
Tried now with 3.12a0 and confirming what @neonene noticed, __forceinlining arena_map_get with v143 shows a 15% speedup. There is some CSE of loads (store to load value propagation) that is enabled by the inlining.
Looking at the number of instrs in PyObject_Free also shows the benefit of the inlining:
Without inlining, PyObject_Free + arena_map_get = 82+48 = 130 instrs With inlining, PyObject_Free = 75 instr
From what I know there is right now no estimate of how inlining a func. helps such optimization like CSE, With PGO inlining relies mostly on the PGO counters + a few other simple heuristics.
There is a more recent compiler flag that is also worth trying, /Ob3 – it increases the budget of allowed inlining and it could help in other spots. For arena_map_get the __forceinline is for now still the best option to make sure it always gets inlined.
Thanks, Gratian
Thanks @gratianlup!
@neonene could you prepare a patch that adds the force inline to arena_map_get? Then we can put this to rest.
This is now fixed in main and in 3.11 (will be in 3.11b5). The backport to 3.10 was a little problematic, and won't be necessary unless we also switch compiler versions there.
EDITED: Py311 is 10-20% slower in big loops of empty or global variant assignments than Py38, Py39 and Py310, measuring with time or timeit. Even though is's faster in other functions and libs.