Closed gvanrossum closed 2 years ago
Would it be a viable option to use Clang instead of MSVC for Windows?
I have got it compiling under clang on windows in debug build
FWIW my own measurements on Windows show much better perf than that, but are inconclusive due to noise (we're waiting for a dedicated perf machine running Windows). See https://github.com/faster-cpython/ideas/discussions/315#discussioncomment-2376669 .
@zooba Do you see a problem with using clang instead of MSVC on Windows, assuming we can get it working?
Wouldn't that mean that anyone building C extensions would also have to switch over to Clang?
If so, that would require some coordination (e.g. cibuildwheel, conda-forge, numpy/scipy) to make such a switch and support two compilers if you want to make C extensions available for multiple Python versions.
@ivoflipse
Wouldn't that mean that anyone building C extensions would also have to switch over to Clang?
I don't know. I have no experience with this topic at all, which is why I'm asking @zooba for advice.
@kumaraditya303 Can you point to your branch here?
Yeah, that's a massive ecosystem change that we're not ready for. It almost certainly will affect the ABI in some way, though it might be manageable, and it'll definitely impact/break any build scripts that implicitly or explicitly assume MSVC. As I work through ARM64 support, I'm finding that is "many"... (unless, of course, it turns out to be perfectly safe to link between the two compilers, and then it'll only affect static lib users, which is practically nobody AFAIK).
Figuring out for certain if the size of ceval is causing it to de-optimise and then sensibly breaking it up is much easier.
FWIW @neonene came up with a PR that temporarily enables a flag that shows the inlining decisions. I'll look into this.
I have a branch https://github.com/python/cpython/compare/main...sweeneyde:ops_to_funcs that converts several of the longer opcodes into functions, in case that is useful for comparison.
Macro benchmarks' improvements are unstable for me in the range of 8~20% compared to 3.10.1+.
When building with PGO (only), changing the definition of Py_UNREACHABLE()
could be an improvement. For example, replacing __assume(0)
with a no-return-noop function in pymacro.h
. But __assume(0)
is not the only suspect that stops the optimizer.
Regarding __assume(0)
, recent PGO builds show the following pyperformance results. __assume(0)
can block the project's efforts, especially on x64.
Sample patch:
--- Include\pymacro.h
+++ Include\pymacro.h
@@ -126 +126 @@
-# define Py_UNREACHABLE() __assume(0)
+static inline void _Py_NO_RETURN Py_UNREACHABLE(void) {}
f4b328e (20220408) | Py_UNREACHABLE | x64 PGO | x86 PGO |
---|---|---|---|
original | __assume(0) | 1.00 | 1.00 |
void foo(void) {} | 1.01x faster | 1.05x faster | |
fork (forceinline, etc.) | __assume(0) | 1.02x faster | 1.08x faster |
void foo(void) {} | 1.12x faster | 1.16x faster |
Py_FatalError() can be used instead: |
f4b328e | Py_UNREACHABLE | x64 PGO |
---|---|---|---|
fork | void foo(void) {} | 1.00 | |
Py_FatalError | 1.01x slower | too common to be distinguished |
Relase builds: | f4b328e | Py_UNREACHABLE | x64 Release | x86 Release |
---|---|---|---|---|
original (/Ob3) | __assume(0) | 1.00 | 1.00 | |
void foo(void) {} | 1.01x slower | 1.02x faster |
It is possible another commit reverses the slower/faster.
Relase vs PGO, leaving __assume(0) as it is: |
f4b328e | x64 | x86 |
---|---|---|---|
Release (/Ob3) | 1.00 | 1.00 | |
PGO | 1.02x faster | 1.02x slower |
3.10.1 (x64) could get a 17% gap between release(/Ob3) and PGO.
I've figured out how to get assembly code from after the PGO/LTO passes run, but the output is hard to understand -- there are almost no symbols in the resulting file and absolutely no line numbers. I have to do this for python311.dll and the output is 630,000 lines. The function _PyEval_EvalFrameDefault
alone is over 8400 lines.
Anyway, the key thing I learned here is
dumpbin /disasm:nobytes python311.dll >disasm.txt
_PyEval_EvalFrameDefault:
I was hoping that looking at that disassembly would answer questions like "did the switch get compiled to the optimal form" but I can barely even find where the switch starts. :-(
I found one trick that seems to help:
Using search on the raw disassembly I was able to confirm that the code there looked the same. I now believe that the top of the switch compiles to this code (on my machine, with VS 2019, in PGO+LTO mode):
switch (opcode) {
00007FFE120A0C53 je _PyEval_EvalFrameDefault+1CC9h (07FFE120A2659h)
00007FFE120A0C59 mov rax,qword ptr [rsp+78h]
00007FFE120A0C5E mov eax,dword ptr [rax]
00007FFE120A0C60 test eax,eax
00007FFE120A0C62 jne _PyEval_EvalFrameDefault+51Eh (07FFE120A0EAEh)
00007FFE120A0C68 movzx ecx,word ptr [r14]
00007FFE120A0C6C mov esi,ecx
00007FFE120A0C6E movzx edi,cl
00007FFE120A0C71 shr esi,8
00007FFE120A0C74 or edi,dword ptr [cframe]
00007FFE120A0C78 mov qword ptr [rsp+30h],rsi
00007FFE120A0C7D mov qword ptr [rbp-50h],r15
00007FFE120A0C81 mov r10,r14
00007FFE120A0C84 mov r13,r15
00007FFE120A0C87 mov r9,r14
00007FFE120A0C8A mov r12,r15
00007FFE120A0C8D mov rdx,r14
00007FFE120A0C90 mov r8,r14
00007FFE120A0C93 mov r11,r15
00007FFE120A0C96 cmp edi,0FFh
00007FFE120A0C9C ja _guard_xfg_dispatch_icall_nop+2DCEh (07FFE121F1B0Eh)
00007FFE120A0CA2 lea rcx,[`__local_stdio_scanf_options'::`2'::_OptionsStorage (07FFE12060000h)]
00007FFE120A0CA9 movsxd rax,edi
00007FFE120A0CAC movsxd rsi,esi
00007FFE120A0CAF movzx eax,byte ptr [rcx+rax+490E0h]
00007FFE120A0CB7 mov ecx,dword ptr [rcx+rax*4+48E1Ch]
00007FFE120A0CBE lea rax,[`__local_stdio_scanf_options'::`2'::_OptionsStorage (07FFE12060000h)]
00007FFE120A0CC5 add rcx,rax
00007FFE120A0CC8 jmp rcx
00007FFE120A0CCA mov r13,qword ptr [rsp+48h]
00007FFE120A0CCF mov rax,r14
00007FFE120A0CD2 sub rax,qword ptr [rsp+40h]
00007FFE120A0CD7 add r14,2
00007FFE120A0CDB sar rax,1
00007FFE120A0CDE mov dword ptr [r13+38h],eax
00007FFE120A0CE2 mov rax,qword ptr [r13+rsi*8+48h]
00007FFE120A0CE7 test rax,rax
00007FFE120A0CEA je _guard_xfg_dispatch_icall_nop+1AF7h (07FFE121F0837h)
00007FFE120A0CF0 inc qword ptr [rax]
00007FFE120A0CF3 mov qword ptr [r15],rax
00007FFE120A0CF6 add r15,8
00007FFE120A0CFA mov qword ptr [rsp+50h],r15
00007FFE120A0CFF movzx eax,word ptr [r14]
00007FFE120A0D03 movzx edi,al
00007FFE120A0D06 mov esi,eax
00007FFE120A0D08 jmp _PyEval_EvalFrameDefault+2E1h (07FFE120A0C71h)
goto handle_eval_breaker;
(Which is weird, because that last line is not directly following the switch.)
Unfortunately, I don't know x86/x64 assembly, so I have no idea what this means.
Okay, the Disassembly window works a bit differently than I had understood before. It does not necessarily display all source lines, and/or it doesn't display them in order. I guess that's why it has to be so deeply integrated with the debugger. A more illustrative snippet:
TARGET(LOAD_CONST) {
00007FFE120A0D0D mov rax,qword ptr [rsp+30h]
00007FFE120A0D12 mov rsi,r14
PREDICTED(LOAD_CONST);
PyObject *value = GETITEM(consts, oparg);
00007FFE120A0D15 sub rsi,qword ptr [rsp+40h]
00007FFE120A0D1A mov r13,qword ptr [rsp+48h]
00007FFE120A0D1F sar rsi,1
00007FFE120A0D22 mov rcx,qword ptr [rbp-38h]
00007FFE120A0D26 add r14,2
00007FFE120A0D2A mov dword ptr [r13+38h],esi
00007FFE120A0D2E cdqe
00007FFE120A0D30 mov rcx,qword ptr [rcx+rax*8+18h]
Py_INCREF(value);
00007FFE120A0D35 inc qword ptr [rcx]
PUSH(value);
00007FFE120A0D38 mov qword ptr [r15],rcx
00007FFE120A0D3B add r15,8
00007FFE120A0D3F mov qword ptr [rsp+50h],r15
DISPATCH();
00007FFE120A0D44 jmp _PyEval_EvalFrameDefault+2D8h (07FFE120A0C68h)
DISPATCH();
}
That corresponds to this source code:
TARGET(LOAD_CONST) {
PREDICTED(LOAD_CONST);
PyObject *value = GETITEM(consts, oparg);
Py_INCREF(value);
PUSH(value);
DISPATCH();
}
(I'm not sure why the DISPATCH()
call is shown twice, probably a small bug.)
Tracing through the DISPATCH()
call one instruction at a time, I land in the big block from the previous comment. I wonder if that macro just does too much and all that has been consolidated in that big block.
Also note that I did all this in the PR branch where I am experimenting with turning INCREF/DECREF back into macros, in ceval.c only.
@neonene I am not ignoring you, far from it! I would like to dig into the Py_UNREACHABLE()
situation but the issue is, as you have said yourself, that the benchmarks are very noisy.
There are only two calls to it in ceval.c. How can we prove that the optimizer really does get distracted by those? Note that the two are not similar:
One call is here:
TARGET(CACHE) {
Py_UNREACHABLE();
}
which is only unreachable because the bytecode compiler never generates this opcode -- but the optimizer cannot know that (though PGO can deduce it, since it's never executed).
The other is here:
goto error;
} /* End instructions */
/* This should never be reached. Every opcode should end with DISPATCH()
or goto error. */
Py_UNREACHABLE();
Here the optimizer should know the code is really unreachable, if all cases indeed do end with DISPATCH()
or goto
.
Can we use this knowledge somehow? Maybe manually put calls to Py_FatalError()
intead?
PS. Can you link us to your branch?
Could anyone run benchmarks on the same version (timing) without code change, putting my posts aside? My English reading/writing is too slow to reply about __assume(0)
at this time.
Comparing Relese with PGO should be easier than 3.10PGO vs 3.11PGO.
Using /Ob3
would be better to keep consistent the inlining and the performance on Release builds, which should be already far faster than 3.10PGO.
--- PCbuild/pythoncore.vcxproj
+++ PCbuild/pythoncore.vcxproj
@@ -102 +102 @@
- <AdditionalOptions>/Zm200 %(AdditionalOptions)</AdditionalOptions>
+ <AdditionalOptions>/Zm200 /Ob3 %(AdditionalOptions)</AdditionalOptions>
on the same version
Any commit is fine. I don't mean the same version as mine.
@neonene I still don't have the capacity to run benchmarks without a lot of noise.
The rest here is lab notes:
I learned a bit of x64 assembler and I think I have a better grip on how switch (opcode)
is translated. Here's the blob of code:
00007FFE0DA50CA2 lea rcx,[`__local_stdio_scanf_options'::`2'::_OptionsStorage (07FFE0DA10000h)]
00007FFE0DA50CA9 movsxd rax,edi
00007FFE0DA50CAC movsxd rsi,esi
00007FFE0DA50CAF movzx eax,byte ptr [rcx+rax+490E0h]
00007FFE0DA50CB7 mov ecx,dword ptr [rcx+rax*4+48E1Ch]
00007FFE0DA50CBE lea rax,[`__local_stdio_scanf_options'::`2'::_OptionsStorage (07FFE0DA10000h)]
00007FFE0DA50CC5 add rcx,rax
00007FFE0DA50CC8 jmp rcx
Let me narrate that. (Register names like EAX are aliases for the lower 32 bits of RAX -- the latter is a 64-bit register.)
At the start, register EDI contains the opcode and ESI has the oparg (the result of the NEXTOPARG()
macro).
00007FFE0DA50CA2 lea rcx,[`__local_stdio_scanf_options'::`2'::_OptionsStorage (07FFE0DA10000h)]
00007FFE0DA50CA9 movsxd rax,edi
00007FFE0DA50CAC movsxd rsi,esi
00007FFE0DA50CAF movzx eax,byte ptr [rcx+rax+490E0h]
00007FFE0DA50CB7 mov ecx,dword ptr [rcx+rax*4+48E1Ch]
00007FFE0DA50CBE lea rax,[`__local_stdio_scanf_options'::`2'::_OptionsStorage (07FFE0DA10000h)]
00007FFE0DA50CC5 add rcx,rax
00007FFE0DA50CC8 jmp rcx
So we have two tables, call them uchar blob1[]
and uint32 blob2[]
. We do effectively this:
goto anchor + blob2[blob1[opcode]];
The critical point here is that we have two memory loads, where the second one is dependent on the first one:
00007FFE0DA50CAF movzx eax,byte ptr [rcx+rax+490E0h]
00007FFE0DA50CB7 mov ecx,dword ptr [rcx+rax*4+48E1Ch]
So this is very far from the "computed goto" we have with GCC and Clang, not is it anywhere near the "use opcode as jump index" we were hoping for (that would be using blob2[opcode]
rather than blob2[blob1[opcode]]
).
Given enough time and monkeys I can now answer any question about optimization. :-)
Another observation relates to DISPATCH()
: it seems that whenever the optimizer sees this macro it just jumps to a location a bit above the code I just narrated. That's probably fine.
Next I'm going to see what things look like on the main branch (so far I used my "inline" branch from https://github.com/python/cpython/pull/32387).
With a Release build from main, the switch(op) code looks pretty much the same.
With a Debug build a bit of clarity is obtained, but not much -- now we see this:
00007FFE0D8CFC21 lea rcx,[__ImageBase (07FFE0D2D0000h)]
00007FFE0D8CFC28 movzx eax,byte ptr [rcx+rax+61BFCCh]
00007FFE0D8CFC30 mov eax,dword ptr [rcx+rax*4+61BD04h]
00007FFE0D8CFC37 add rax,rcx
00007FFE0D8CFC3A jmp rax
This confirms my hunch that the lea
instruction loads a base offset for some big block (now named __ImageBase
-- the LTO seems to just have moved the base around so the offsets in the mov
instructions fit in fewer bits), but the rest is the same -- we're still effectively doing goto imagebase + blob2[blob1[opcode]]
.
Thanks for digging into it! I'll also look into further with my debugger.
So the dispatch is effectively (using gcc syntax):
goto *(base + offset_table[compact_table[opcode]])
?
Where compact_table
maps all opcodes to themselves and all other values to the default?
If we were to drop the default:
and generate cases for all undefined opcodes:
case X:
oparg = X; /* X needs to be a constant, not `opcode` so that the cases cannot be trivially merged */
goto unknown_opcode;
then would the compiler drop the compact_table
giving us goto *(base + offset_table[opcode])
?
@gvanrossum Could you try this out?
I wonder whether goto *(base + offset_table[opcode])
is faster or slower than goto *address_table[opcode]
?
The offset_table
is half the size, reducing cache pressure, but it needs base
to be loaded.
If it turns that goto *(base + offset_table[opcode])
is faster, then we can use that for non-Windows builds.
One thing that is not clear. Is the computed jmp performed at the end of each instruction, or just at one location?
One thing that is not clear. Is the computed jmp performed at the end of each instruction, or just at one location?
See Guido's comment above:
Another observation relates to
DISPATCH()
: it seems that whenever the optimizer sees this macro it just jumps to a location a bit above the code I just narrated. That's probably fine.
It looks like it follows the C code pretty literally, jumping up to dispatch_opcode
.
Half-baked idea: see how the different compilers handle a hack-y form of computed goto, where a switch over the opcode labels is stuffed into the DISPATCH()
macro.
#define DISPATCH_GOTO() \
switch (opcode) { \
case NOP: goto TARGET_NOP; \
case RESUME: goto TARGET_RESUME; \
...
}
I'm not sure I like it, though, since small changes (in the code or the compiler) could easily result in code size explosion.
I added dummy cases for all non-existent opcodes, but sadly this did not convince the compiler to skip the indirection table. The switch code looks the same as before, the only thing different is that there are a few other instructions between the first and the second memory load:
00007FF92E838B09 lea ecx,[`__local_stdio_scanf_options'::`2'::_OptionsStorage (07FF92E820000h)]
00007FF92E838B0F movsxd rax,edi
00007FF92E838B12 movsxd r14,r15d
00007FF92E838B15 mov r11,r12
00007FF92E838B18 mov qword ptr [rbp-68h],r12
00007FF92E838B1C mov r9,r13
00007FF92E838B1F mov qword ptr [rbp-60h],r12
00007FF92E838B23 mov rdx,r13
00007FF92E838B26 movzx eax,byte ptr [rcx+rax+1F778h] ; *** FIRST LOAD ***
00007FF92E838B2E mov r8,r13
00007FF92E838B31 mov qword ptr [rsp+50h],r12
00007FF92E838B36 mov rsi,r12
00007FF92E838B39 mov qword ptr [rbp-50h],r14
00007FF92E838B3D mov ecx,dword ptr [rcx+rax*4+1F4B4h] ; *** SECOND LOAD ***
00007FF92E838B44 lea rax,[`__local_stdio_scanf_options'::`2'::_OptionsStorage (07FF92E820000h)]
00007FF92E838B4B add rcx,rax
00007FF92E838B4E jmp rcx
I don't think this is worth the hassle.
Half-baked idea: see how the different compilers handle a hack-y form of computed goto, where a switch over the opcode labels is stuffed into the
DISPATCH()
macro.
This worries me because we'd have 150+ copies of that switch.
At this point I don't think it's very useful to try and argue with either the compiler, PGO, LTO, or the CPU -- the one area where I think we have some leverage is some common macros that fail to be inlined (IS_TYPE, DECREF, and load_atomic_relaxed[int32]). For that I think we need https://github.com/python/cpython/pull/32387.
@markshannon
(So I tried out the dummy cases but it didn't get rid of the offset_table, it just moved the memory access up a bit. See previous comment.)
I wonder whether
goto *(base + offset_table[opcode])
is faster or slower thangoto *address_table[opcode]
? The offset_table is half the size, reducing cache pressure, but it needs base to be loaded.
Here base
is actually a constant, it uses a lea
instruction (Load Effective Address) to get it into a register, like this (though modified by PGO+LTO):
00007FFE0D8CFC21 lea rcx,[__ImageBase (07FFE0D2D0000h)]
That's just a constant load, and while I suspect (since this is in a DLL) it's actually IP-relative, the CPU probably has a dedicated adder to compute that. So my guess is that the compiler writers know what they are doing, using 32-bit offsets. But they still want to avoid dummy entries there, hence the offset_table.
Hmm, maybe we can avoid that by making all the cases actually do something different.
Hmm, maybe we can avoid that by making all the cases actually do something different.
Nope, it still uses the offset_table. I'm giving up on this. Sorry to get your hopes up during our meeting earlier today.
I'm giving up on this.
Could you publish the branches with your code and the resulting assembler output, so that others can pick up where you left off?
I will try to create branches later this week.
Looking at this again I realized that opcode
is declared as int
so maybe that's why the switch doesn't optimize as expected. I'll look into this some more, starting by changing the switch line to this:
switch ((unsigned char)opcode) {
Well I'll be darned. Now the code effectively boils down to this, where initially rcx
is the opcode:
00007FF8FF9A7854 lea rax,[`__local_stdio_scanf_options'::`2'::_OptionsStorage (07FF8FF970000h)]
00007FF8FF9A7871 mov eax,dword ptr [rax+rcx*4+3F9D4h]
00007FF8FF9A787B lea rcx,[`__local_stdio_scanf_options'::`2'::_OptionsStorage (07FF8FF970000h)]
00007FF8FF9A7886 add rax,rcx
00007FF8FF9A788F jmp rax
(I left some instructions out that are interleaved but don't touch any of these and seem to be related to flushing registers to the stack or moving them around.)
Crucially, there's nothing like this to be found, which I found in last week's investigations:
movzx eax,byte ptr [rcx+rax+1F778h]
(which was the "compact_table" from above).
Thus the new code is now equivalent to
goto *(base + offset_table[opcode]);
which is one memory load less than before (where opcode was an int).
Let me see if we actually need the dummy cases.
Yeah, with just a default:
label, the extra memory load is back:
00007FF8F031789A movzx eax,byte ptr [rcx+rax+3DD70h]
00007FF8F03178A2 mov ecx,dword ptr [rcx+rax*4+3DA9Ch]
00007FF8F03178A9 lea rax,[`__local_stdio_scanf_options'::`2'::_OptionsStorage (07FF8F02E0000h)]
00007FF8F03178B0 add rcx,rax
00007FF8F03178B3 jmp rcx
Next I'll investigate whether the dummy cases need to contain any code.
Fortunately, the dummy cases don't need to do anything. I will prepare a separate PR for this.
Here's the PR. (Code is included in comments above.) https://github.com/python/cpython/pull/91718
This approach makes PREDICT()
macro unnecessary for me.
3.10.x
PGO does not need a fix around switch(opcode)
.
We can just compare 3.11
's performance with current 3.10
.
I've done the best measurements I can on my corporate laptop, and I conclude that on Windows 3.11 is now ~20% faster than 3.10. This is main at revision 2f233fceae.
Things I did to stabilize the benchmarks:
"I close the issue."
PS. I now also benchmarked 3.11 just before my first Windows perf commit (i.e., just before GH-91719). That one was only 7% faster than 3.10.4, and current head is 13% faster than that. So one or both of these did have a big effect!
I ran one more set of benchmarks, and conclude that the switch improvement (GH-91719) contributed 3% and the macro-fied inline functions (GH-89279) contributed the remaining 10%.
This is assuming none of the intervening commits contributed performance. The list of commits including the endpoints:
locale.getencoding()
instead of getpreferredencoding (GH-91732)
There was a report that on Windows the performance on the PyPerformance benchmark is only 3% better than for 3.10. We should try to confirm or refute this result, and if confirmed, attempt to improve the situation.
Likely culprit is the optimizer (or PGO or LTO) giving up on inlining in the main function in ceval.c. In particular it might be the INCREF/DECREF inline functions. A possible hack would be to turn these back into macros again at the top of ceval.c (if no debugging options are defined).