Closed adamsitnik closed 2 years ago
Tagging subscribers to this area: @maryamariyan See info in area-owners.md if you want to be subscribed.
Author: | adamsitnik |
---|---|
Assignees: | - |
Labels: | `os-windows`, `tenet-performance`, `area-Extensions-Logging` |
Milestone: | - |
that spike has too many commits to identify 😢 (missing data for ~7 days)
After doing git bisect, this seems to be the offending commit https://github.com/dotnet/runtime/pull/41897
@jkoritzinsky @AndyAyersMS do you think this regression is related to #41897?
As part of that issue, our PGO data got out of date, so I would not be amazed to have seen a regression then. However, now that we've collected new PGO data for coreclr and clrjit, I would expect that any regression from #41897 would have been fixed since then when we finally got the PGO collection pipeline back online.
@DrewScoggins could you please validate if the regression in this issue is gone after PGO fix?
Seems like we need more analysis here. Does anyone have a profile indicating where this test spends its time? I can't tell offhand if it is in jitted code or the native parts of the runtime.
PGO information got updated in January so if this was PGO related it would have cleared up long ago. So it is probably something else.
@jkoritzinsky I'll assign this issue to you as remarked before it is a regression caused by #41897. Please let's know if you think anything related to logging here. Thanks!
Tagging subscribers to this area: @dotnet/runtime-infrastructure See info in area-owners.md if you want to be subscribed.
Author: | adamsitnik |
---|---|
Assignees: | jkoritzinsky |
Labels: | `os-windows`, `tenet-performance`, `area-Infrastructure` |
Milestone: | 6.0.0 |
There's nothing related to logging in that PR. That PR was a large infrastructure change to use a different native build orchestrator (Ninja instead of MSBuild). There's little to no managed build changes there. I agree with @AndyAyersMS that we'll need a trace to see where the benchmark spends its time to be able to dig down into why there is still a regression today.
...we'll need a trace to see where the benchmark spends its time to be able to dig down into why there is still a regression today.
@jkoritzinsky Is the action on this in your court, or do you need this trace from someone else?
I can try to get the trace, but I'd prefer if someone who is more familiar with how we have the process set up to run against a recent runtime build could collect it instead.
I can try to get the trace, but I'd prefer if someone who is more familiar with how we have the process set up to run against a recent runtime build could collect it instead.
@adamsitnik Can you help out with getting this started please?
I can get some traces collected and take a first cut at them.
So I have collected traces, and you can find them here along with the copies of the runtime used to generate them, including PDBs. https://pvscmdupload.blob.core.windows.net/drewtest/59267.zip
My initial investigation shows that the regression is coming from more time spent in JIT_VirtualFunctionPointer
and GenericHandleCacheTraits::CompareKeys
. I dug into the disassembly of coreclr.dll for both, and I noticed a few thing. First we are allocating 0x20 more bytes on the stack than we were before. We are also now directly calling JitGenericHandleCacheTraits::CompareKeys
as part of the execution of the JIT_VirtualFunctionPointer
. Also looking at the code before we get into the search loop we seem to setup the cache more efficiently, with fewer moves.
I am not certain as to why this orchestration change caused this. My best guess is that we are not passing the same set of parameters to the native compiler any longer, and as such are getting different code out. Maybe someone else can speak with more authority on that.
Baseline Disasm
*************************************************************
* FUNCTION
*************************************************************
void * __fastcall JIT_VirtualFunctionPointer (Object *
void * RAX:8 <RETURN>
Object * RCX:8 param_1
CORINFO_CLASS_ RDX:8 param_2
CORINFO_METHOD R8:8 param_3
?JIT_VirtualFunctionPointer@@YAPEAXPEAVObject@ XREF[3]: 1803da588 (*) , 180400848 (*) ,
JIT_VirtualFunctionPointer 1804b7f50 (*)
1801083c0 48 89 5c MOV qword ptr [RSP + 0x8 ],RBX
24 08
1801083c5 48 89 74 MOV qword ptr [RSP + 0x10 ],RSI
24 10
1801083ca 57 PUSH RDI
1801083cb 48 83 ec SUB RSP ,0x20
20
1801083cf 49 8b f0 MOV RSI ,R8
1801083d2 4c 8b da MOV R11 ,RDX
1801083d5 48 85 c9 TEST RCX ,RCX
1801083d8 74 44 JZ LAB_18010841e
1801083da 48 8b 19 MOV RBX ,qword ptr [RCX ]
1801083dd 33 d2 XOR EDX ,EDX
1801083df 8b c6 MOV EAX ,ESI
1801083e1 45 8b d3 MOV R10D ,R11D
1801083e4 c1 c8 05 ROR EAX ,0x5
1801083e7 03 c3 ADD EAX ,EBX
1801083e9 41 c1 c2 ROL R10D ,0x5
05
1801083ed 44 03 d0 ADD R10D ,EAX
1801083f0 48 8b 05 MOV RAX ,qword ptr [g_pJitGenericHandleCache ] = NaP
61 56 39
00
1801083f7 4c 8b 40 MOV R8 ,qword ptr [RAX + 0x20 ]
20
1801083fb 41 8b c2 MOV EAX ,R10D
1801083fe 41 f7 70 DIV dword ptr [R8 + 0x8 ]
08
180108402 49 8b 00 MOV RAX ,qword ptr [R8 ]
180108405 48 8b 04 MOV RAX ,qword ptr [RAX + RDX *0x8 ]
d0
180108409 48 85 c0 TEST RAX ,RAX
18010840c 74 10 JZ LAB_18010841e
18010840e 66 90 NOP
LAB_180108410 XREF[1]: 18010841c (j)
180108410 44 39 50 CMP dword ptr [RAX + 0x8 ],R10D
08
180108414 74 23 JZ LAB_180108439
LAB_180108416 XREF[4]: 18010843d (j) , 180108443 (j) ,
180108449 (j) , 18010844f (j)
180108416 48 8b 00 MOV RAX ,qword ptr [RAX ]
180108419 48 85 c0 TEST RAX ,RAX
18010841c 75 f2 JNZ LAB_180108410
LAB_18010841e XREF[2]: 1801083d8 (j) , 18010840c (j)
18010841e 4c 8b c6 MOV R8 ,RSI
180108421 49 8b d3 MOV RDX ,R11
180108424 e8 87 00 CALL JIT_VirtualFunctionPointer_Framed void * JIT_VirtualFunctionPointe
00 00
180108429 48 8b 5c MOV RBX ,qword ptr [RSP + 0x30 ]
24 30
18010842e 48 8b 74 MOV RSI ,qword ptr [RSP + 0x38 ]
24 38
180108433 48 83 c4 ADD RSP ,0x20
20
180108437 5f POP RDI
180108438 c3 RET
LAB_180108439 XREF[1]: 180108414 (j)
180108439 48 39 58 CMP qword ptr [RAX + 0x18 ],RBX
18
18010843d 75 d7 JNZ LAB_180108416
18010843f 4c 39 58 CMP qword ptr [RAX + 0x20 ],R11
20
180108443 75 d1 JNZ LAB_180108416
180108445 48 39 70 CMP qword ptr [RAX + 0x28 ],RSI
28
180108449 75 cb JNZ LAB_180108416
18010844b f6 40 30 TEST byte ptr [RAX + 0x30 ],0x1
01
18010844f 74 c5 JZ LAB_180108416
180108451 48 8b 40 MOV RAX ,qword ptr [RAX + 0x10 ]
10
180108455 48 8b 5c MOV RBX ,qword ptr [RSP + 0x30 ]
24 30
18010845a 48 8b 74 MOV RSI ,qword ptr [RSP + 0x38 ]
24 38
18010845f 48 83 c4 ADD RSP ,0x20
20
180108463 5f POP RDI
180108464 c3 RET
Compare Disasm
*************************************************************
* FUNCTION
*************************************************************
void * __fastcall JIT_VirtualFunctionPointer (Object *
void * RAX:8 <RETURN>
Object * RCX:8 param_1
CORINFO_CLASS_ RDX:8 param_2
CORINFO_METHOD R8:8 param_3
?JIT_VirtualFunctionPointer@@YAPEAXPEAVObject@ XREF[3]: 1803b75a6 (*) , 1803ebf28 (*) ,
JIT_VirtualFunctionPointer 1804a4a04 (*)
1800f8190 48 8b c4 MOV RAX ,RSP
1800f8193 48 89 58 MOV qword ptr [RAX + 0x8 ],RBX
08
1800f8197 48 89 70 MOV qword ptr [RAX + 0x10 ],RSI
10
1800f819b 57 PUSH RDI
1800f819c 48 83 ec SUB RSP ,0x40
40
1800f81a0 49 8b f8 MOV RDI ,R8
1800f81a3 48 8b f2 MOV RSI ,RDX
1800f81a6 4c 8b d9 MOV R11 ,RCX
1800f81a9 48 85 c9 TEST RCX ,RCX
1800f81ac 74 7c JZ LAB_1800f822a
1800f81ae 4c 8b 09 MOV R9 ,qword ptr [RCX ]
1800f81b1 8b df MOV EBX ,EDI
1800f81b3 48 89 50 MOV qword ptr [RAX + -0x20 ],RDX
e0
1800f81b7 33 d2 XOR EDX ,EDX
1800f81b9 4c 89 40 MOV qword ptr [RAX + -0x18 ],R8
e8
1800f81bd 4c 89 48 MOV qword ptr [RAX + -0x28 ],R9
d8
1800f81c1 48 c7 40 MOV qword ptr [RAX + -0x10 ],0x1
f0 01 00
00 00
1800f81c9 8b c6 MOV EAX ,ESI
1800f81cb c1 c0 05 ROL EAX ,0x5
1800f81ce 41 03 c1 ADD EAX ,R9D
1800f81d1 c1 cb 05 ROR EBX ,0x5
1800f81d4 03 d8 ADD EBX ,EAX
1800f81d6 48 8b 05 MOV RAX ,qword ptr [g_pJitGenericHandleCache ] = NaP
8b 27 39
00
1800f81dd 4c 8b 40 MOV R8 ,qword ptr [RAX + 0x20 ]
20
1800f81e1 8b c3 MOV EAX ,EBX
1800f81e3 41 f7 70 DIV dword ptr [R8 + 0x8 ]
08
1800f81e7 49 8b 00 MOV RAX ,qword ptr [R8 ]
1800f81ea 4c 8b 14 MOV R10 ,qword ptr [RAX + RDX *0x8 ]
d0
1800f81ee eb 14 JMP LAB_1800f8204
LAB_1800f81f0 XREF[1]: 1800f820f (j)
1800f81f0 48 8d 54 LEA RDX ,[RSP + 0x20 ]
24 20
1800f81f5 49 8b ca MOV RCX ,R10
1800f81f8 e8 3f 00 CALL JitGenericHandleCacheTraits::CompareKeys int CompareKeys(EEHashEntry * pa
00 00
1800f81fd 85 c0 TEST EAX ,EAX
1800f81ff 75 10 JNZ LAB_1800f8211
LAB_1800f8201 XREF[1]: 1800f820d (j)
1800f8201 4d 8b 12 MOV R10 ,qword ptr [R10 ]
LAB_1800f8204 XREF[1]: 1800f81ee (j)
1800f8204 4d 85 d2 TEST R10 ,R10
1800f8207 74 08 JZ LAB_1800f8211
1800f8209 41 39 5a CMP dword ptr [R10 + 0x8 ],EBX
08
1800f820d 75 f2 JNZ LAB_1800f8201
1800f820f eb df JMP LAB_1800f81f0
LAB_1800f8211 XREF[2]: 1800f81ff (j) , 1800f8207 (j)
1800f8211 4d 85 d2 TEST R10 ,R10
1800f8214 74 14 JZ LAB_1800f822a
1800f8216 49 8b 42 MOV RAX ,qword ptr [R10 + 0x10 ]
10
LAB_1800f821a XREF[1]: 1800f8238 (j)
1800f821a 48 8b 5c MOV RBX ,qword ptr [RSP + 0x50 ]
24 50
1800f821f 48 8b 74 MOV RSI ,qword ptr [RSP + 0x58 ]
24 58
1800f8224 48 83 c4 ADD RSP ,0x40
40
1800f8228 5f POP RDI
1800f8229 c3 RET
LAB_1800f822a XREF[2]: 1800f81ac (j) , 1800f8214 (j)
1800f822a 4c 8b c7 MOV R8 ,RDI
1800f822d 48 8b d6 MOV RDX ,RSI
1800f8230 49 8b cb MOV RCX ,R11
1800f8233 e8 4c 00 CALL JIT_VirtualFunctionPointer_Framed void * JIT_VirtualFunctionPointe
00 00
1800f8238 eb e0 JMP LAB_1800f821a
From my selfish perspective the good news is that this regression apparently hasn't been triggered by my recent test build script cleanup change ;-). Other than that, these days I spend hours poring over iDNA analyses of CoreCLR runtime behavior and I'm trying to use them to better understand CoreCLR runtime startup costs; from that perspective these are certainly super-interesting pointers, thanks Drew! And have a great weekend too, by the way.
Tomas
Thanks, @DrewScoggins!
@trylek -- does this give you any concern for 6.0 GA, or can we go ahead and move this to the 7.0.0 milestone for further investigation?
Frankly speaking I don't think we have too much of a wiggle room. Naturally putting new features in incurs some perf penalties and making code paths more complex in the CoreCLR runtime always gnaws at our perf, I recall David Wrighton had to pull off a couple tricks to compensate for the runtime slowdown due to the newly introduced static virtual methods. The results of Drew's investigation seem to point at subtle changes in JIT behavior so I guess someone like @AndyAyersMS should chime in as to what is expected vs. unexpected; my personal belief is that .NET Core 6 pulled off a bunch of cool tricks to boost perf in spite of getting more complex and the last time I heard that was also reflected in various benchmarks and 3rd party developer findings so from that perspective I believe it's reasonable to postpone completion of the perf regression investigation to .NET 7.
Thanks! I'll move this to 7.0.0 and put it into the codegen area for further triage/investigation. If this is a recognized but acceptable trade-off, that's perfectly fine.
Tagging subscribers to this area: @JulieLeeMSFT See info in area-owners.md if you want to be subscribed.
Author: | adamsitnik |
---|---|
Assignees: | adamsitnik |
Labels: | `os-windows`, `tenet-performance`, `area-CodeGen-coreclr`, `needs further triage` |
Milestone: | 6.0.0 |
That matches my understanding unless Andy Ayers notices something unexpected w.r.t. JIT behavior.
That matches my understanding unless Andy Ayers notices something unexpected w.r.t. JIT behavior.
The only thing I would add I this change is in native code (coreclr.dll). I am a little suspicious that a change that looks like different codegen from the native compiler happened just after a big orchestrator change.
I assume it was checked during the change, but the flags passed to the native compiler didn't change, right?
I did a little more digging this morning. If you look at the options that we are using for doing the compile for coreclr.dll they are different from before and after this change.
The key difference is that we used to use these opt flags /O1 /Ob2 /Oi /Oy-
and we now use /Ox /Ob2 /Oi /Oy-
. Looking at https://github.com/dotnet/runtime/pull/41897 and the discussion that happened there, it does not look like this was intended, but I am not sure. @jkoritzinsky can you chime in here?
Baseline:
ClCompile:
C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.28.29910\bin\HostX64\x64\CL.exe /c
/ID:\git\runtime\src\coreclr\src\pal\prebuilt\inc /ID:\git\runtime\src\coreclr\..\..\artifacts\obj\coreclr
/ID:\git\runtime\src\coreclr\src\inc /ID:\git\runtime\src\coreclr\src\debug\inc /ID:\git\runtime\src\coreclr\src\debug\inc\amd64
/ID:\git\runtime\src\coreclr\src\debug\inc\dump /ID:\git\runtime\src\coreclr\src\md\inc
/ID:\git\runtime\src\coreclr\src\classlibnative\bcltype /ID:\git\runtime\src\coreclr\src\classlibnative\cryptography
/ID:\git\runtime\src\coreclr\src\classlibnative\inc /ID:\git\runtime\artifacts\obj\coreclr\Windows_NT.x64.Release\src\inc
/ID:\git\runtime\artifacts\obj\coreclr\Windows_NT.x64.Release\src\inc\etw /ID:\git\runtime\src\coreclr\src\dlls\mscoree\..\..\inc
/Zi /nologo /W3 /WX /diagnostics:column /MP /O1 /Ob2 /Oi /Oy- /GL /D _WINDLL /D _UNICODE /D UNICODE /D WIN32 /D
_WINDOWS /D NDEBUG /D URTBLDENV_FRIENDLY=Retail /D HOST_AMD64 /D HOST_64BIT /D HOST_WINDOWS /D
TARGET_AMD64 /D TARGET_64BIT /D TARGET_WINDOWS /D FEATURE_UTF8STRING /D _BLD_CLR /D DEBUGGING_SUPPORTED
/D PROFILING_SUPPORTED /D _WIN32 /D WINVER=0x0602 /D _WIN32_WINNT=0x0602 /D WIN32_LEAN_AND_MEAN /D
_CRT_SECURE_NO_WARNINGS /D EnC_SUPPORTED /D FEATURE_ARRAYSTUB_AS_IL /D FEATURE_MULTICASTSTUB_AS_IL /D
FEATURE_PORTABLE_SHUFFLE_THUNKS /D FEATURE_INSTANTIATINGSTUB_AS_IL /D FEATURE_CODE_VERSIONING /D
FEATURE_COLLECTIBLE_TYPES /D FEATURE_COMWRAPPERS /D FEATURE_COMINTEROP /D
FEATURE_COMINTEROP_APARTMENT_SUPPORT /D FEATURE_COMINTEROP_UNMANAGED_ACTIVATION /D
FEATURE_BASICFREEZE /D FEATURE_CORECLR /D FEATURE_CORESYSTEM /D FEATURE_DEFAULT_INTERFACES /D
FEATURE_EVENT_TRACE /D FEATURE_PERFTRACING /D FEATURE_HIJACK /D FEATURE_ICASTABLE /D
FEATURE_INTEROP_DEBUGGING /D FEATURE_ISYM_READER /D FEATURE_MANAGED_ETW /D
FEATURE_MANAGED_ETW_CHANNELS /D FEATURE_MULTICOREJIT /D FEATURE_PROFAPI_ATTACH_DETACH /D
FEATURE_READYTORUN /D FEATURE_REJIT /D FEATURE_SVR_GC /D FEATURE_SYMDIFF /D FEATURE_TIERED_COMPILATION /D
FEATURE_ON_STACK_REPLACEMENT /D FEATURE_PGO /D FEATURE_TYPEEQUIVALENCE /D UNIX_AMD64_ABI_ITF /D
FEATURE_USE_ASM_GC_WRITE_BARRIERS /D FEATURE_USE_SOFTWARE_WRITE_WATCH_FOR_GC_HEAP /D
FEATURE_MANUALLY_MANAGED_CARD_BUNDLES /D FEATURE_WIN32_REGISTRY /D _SECURE_SCL=0 /D UNICODE /D _UNICODE
/D FEATURE_DATABREAKPOINT /D FEATURE_EH_FUNCLETS /D FEATURE_STANDALONE_GC /D
FX_VER_INTERNALNAME_STR=CoreCLR.dll /D "CMAKE_INTDIR=\"Release\"" /D coreclr_EXPORTS /Gm- /EHa /MT /Zp8 /GS
/guard:cf /Gy /fp:precise /Zc:wchar_t /Zc:forScope /Zc:inline /GR- /Fo"coreclr.dir\Release\\" /Fd"coreclr.dir\Release\vc142.pdb"
/Gd /TP /wd4960 /wd4961 /wd4603 /wd4627 /wd4838 /wd4456 /wd4457 /wd4458 /wd4459 /wd4091 /wd4291 /wd5105 /FC /Zl
/errorReport:queue /we4640 /we4007 /we4013 /we4102 /we4551 /we4700 /we4806 /Zm200 /Zc:strictStrings /w34092 /w34121
/w34125 /w34130 /w34132 /w34212 /w34530 /w35038 /w44177 /ZH:SHA_256 /source-charset:utf-8
D:\git\runtime\src\coreclr\src\dlls\mscoree\mscoree.cpp D:\git\runtime\src\coreclr\src\dlls\mscoree\unixinterface.cpp
D:\git\runtime\src\coreclr\src\dlls\mscoree\delayloadhook.cpp
Compare
ClCompile:
C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.28.29910\bin\HostX64\x64\CL.exe /c
/ID:\git\runtime\src\coreclr\src\pal\prebuilt\inc /ID:\git\runtime\src\coreclr\..\..\artifacts\obj\coreclr
/ID:\git\runtime\src\coreclr\src\inc /ID:\git\runtime\src\coreclr\src\debug\inc /ID:\git\runtime\src\coreclr\src\debug\inc\amd64
/ID:\git\runtime\src\coreclr\src\debug\inc\dump /ID:\git\runtime\src\coreclr\src\md\inc
/ID:\git\runtime\src\coreclr\src\classlibnative\bcltype /ID:\git\runtime\src\coreclr\src\classlibnative\cryptography
/ID:\git\runtime\src\coreclr\src\classlibnative\inc /ID:\git\runtime\artifacts\obj\coreclr\Windows_NT.x64.Release\src\inc
/ID:\git\runtime\artifacts\obj\coreclr\Windows_NT.x64.Release\src\inc\etw /ID:\git\runtime\src\coreclr\src\dlls\mscoree\..\..\inc
/Zi /nologo /W3 /WX /diagnostics:column /MP /Ox /Ob2 /Oi /Oy- /GL /D _WINDLL /D _UNICODE /D UNICODE /D WIN32 /D
_WINDOWS /D NDEBUG /D URTBLDENV_FRIENDLY=Retail /D HOST_AMD64 /D HOST_64BIT /D HOST_WINDOWS /D
TARGET_AMD64 /D TARGET_64BIT /D TARGET_WINDOWS /D FEATURE_UTF8STRING /D _BLD_CLR /D DEBUGGING_SUPPORTED
/D PROFILING_SUPPORTED /D _WIN32 /D WINVER=0x0602 /D _WIN32_WINNT=0x0602 /D WIN32_LEAN_AND_MEAN /D
_CRT_SECURE_NO_WARNINGS /D EnC_SUPPORTED /D FEATURE_ARRAYSTUB_AS_IL /D FEATURE_MULTICASTSTUB_AS_IL /D
FEATURE_PORTABLE_SHUFFLE_THUNKS /D FEATURE_INSTANTIATINGSTUB_AS_IL /D FEATURE_CODE_VERSIONING /D
FEATURE_COLLECTIBLE_TYPES /D FEATURE_COMWRAPPERS /D FEATURE_COMINTEROP /D
FEATURE_COMINTEROP_APARTMENT_SUPPORT /D FEATURE_COMINTEROP_UNMANAGED_ACTIVATION /D
FEATURE_BASICFREEZE /D FEATURE_CORECLR /D FEATURE_CORESYSTEM /D FEATURE_DEFAULT_INTERFACES /D
FEATURE_EVENT_TRACE /D FEATURE_PERFTRACING /D FEATURE_HIJACK /D FEATURE_ICASTABLE /D
FEATURE_INTEROP_DEBUGGING /D FEATURE_ISYM_READER /D FEATURE_MANAGED_ETW /D
FEATURE_MANAGED_ETW_CHANNELS /D FEATURE_MULTICOREJIT /D FEATURE_PROFAPI_ATTACH_DETACH /D
FEATURE_READYTORUN /D FEATURE_REJIT /D FEATURE_SVR_GC /D FEATURE_SYMDIFF /D FEATURE_TIERED_COMPILATION /D
FEATURE_ON_STACK_REPLACEMENT /D FEATURE_PGO /D FEATURE_TYPEEQUIVALENCE /D UNIX_AMD64_ABI_ITF /D
FEATURE_USE_ASM_GC_WRITE_BARRIERS /D FEATURE_USE_SOFTWARE_WRITE_WATCH_FOR_GC_HEAP /D
FEATURE_MANUALLY_MANAGED_CARD_BUNDLES /D FEATURE_WIN32_REGISTRY /D _SECURE_SCL=0 /D UNICODE /D _UNICODE
/D FEATURE_DATABREAKPOINT /D FEATURE_EH_FUNCLETS /D FEATURE_STANDALONE_GC /D
FX_VER_INTERNALNAME_STR=CoreCLR.dll /D "CMAKE_INTDIR=\"Release\"" /D coreclr_EXPORTS /Gm- /EHa /MT /Zp8 /GS
/guard:cf /Gy /fp:precise /Zc:wchar_t /Zc:forScope /Zc:inline /GR- /Fo"coreclr.dir\Release\\" /Fd"coreclr.dir\Release\vc142.pdb"
/Gd /TP /wd4960 /wd4961 /wd4603 /wd4627 /wd4838 /wd4456 /wd4457 /wd4458 /wd4459 /wd4091 /wd4291 /wd5105 /FC /Zl
/errorReport:queue /we4640 /we4007 /we4013 /we4102 /we4551 /we4700 /we4806 /Zm200 /Zc:strictStrings /w34092 /w34121
/w34125 /w34130 /w34132 /w34212 /w34530 /w35038 /w44177 /ZH:SHA_256 /source-charset:utf-8
D:\git\runtime\src\coreclr\src\dlls\mscoree\mscoree.cpp D:\git\runtime\src\coreclr\src\dlls\mscoree\unixinterface.cpp
D:\git\runtime\src\coreclr\src\dlls\mscoree\delayloadhook.cpp
That change looks to be unintentional.
IIRC, there were multiple conflicting optimization flags in the old build (/Ox was specified by CMake, /O1 by our build). The CMake generator for Ninja would put all of the optimization flags on the command line (including the conflicting ones), but the CMake generator for MSBuild looks to always pick the last specified one. Looks like when I was resolving this, I accidentally picked the wrong setting (what the command line from the Ninja and NMake generators would use instead of what the MSBuild generator used).
The results of Drew's investigation seem to point at subtle changes in JIT behavior
Per Drew's further analysis this looks like a change in the way a native part of the windows runtime was optimized, either directly from the flags change above, or indirectly via the impact that change in conjunction with PGO.
@jkoritzinsky does this change impact the entire set of runtime files?
Per this doc and subsequent, /O1
is equivalent to /Og /Os /Oy /Ob2 /GF /Gy
and /Ox
to /Ob2 /Oi /Ot /Oy
.
So the main difference (ignoring /Og
for now, which seems to be dead) is /Os
vs /Ot
, /Oi
, and the extra /G*
options.
Normally /Os
(opt for size) vs /Ot
(opt for speed) should not matter that much if you're also using PGO, as PGO will make its own speed/size determination. But there could be subtleties here as both these are "bundle options" and we later on tweak some of the things set by the bundles.
We explicitly set /Oi
(intrinsic opts) after these so it should be the same either way. And likewise we explicitly turn on /Gy
(function level linking).
So perhaps we lost /GF
-- string pooling. This does not seem like it would explain the diff above.
At any rate I don't have an good explanation for this, but it seems like we ought to look into changing the opt options back to how it was.
This setting does impact the entire set of runtime files.
I tried changing the options back and doing a local run and I didn't see an appreciable performance improvement across the board. I saw a few of these benchmarks get better and a few get worse (some significantly, like 20% worse). I don't have enough confidence that reverting the optimization change will substantially improve the benchmarks without causing issues in others.
Here's my results:
main:
Method | HasISupportLoggingScopeLogger | CaptureScopes | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FilteredByLevel | False | False | 12.89 ns | 0.172 ns | 0.152 ns | 12.88 ns | 12.70 ns | 13.15 ns | 1.00 | 0.00 | - | - | - | - |
FilteredByLevel_InsideScope | False | False | 1,435.48 ns | 10.315 ns | 9.144 ns | 1,436.77 ns | 1,417.73 ns | 1,449.82 ns | 111.38 | 1.33 | - | - | - | - |
NotFiltered | False | False | 13,977.78 ns | 1,574.853 ns | 1,685.075 ns | 13,600.00 ns | 11,500.00 ns | 17,100.00 ns | 1,096.13 | 146.02 | - | - | - | 96 B |
NotFiltered_InsideScope | False | False | 16,800.00 ns | 3,121.156 ns | 3,339.602 ns | 16,450.00 ns | 12,600.00 ns | 24,000.00 ns | 1,336.98 | 284.93 | - | - | - | 96 B |
FilteredByLevel | False | True | 12.89 ns | 0.100 ns | 0.088 ns | 12.90 ns | 12.74 ns | 13.04 ns | 1.00 | 0.00 | - | - | - | - |
FilteredByLevel_InsideScope | False | True | 2,868.11 ns | 92.471 ns | 106.490 ns | 2,820.34 ns | 2,774.83 ns | 3,090.93 ns | 223.36 | 8.99 | - | - | - | - |
NotFiltered | False | True | 13,833.33 ns | 947.692 ns | 1,014.019 ns | 14,050.00 ns | 11,800.00 ns | 15,200.00 ns | 1,072.48 | 86.74 | - | - | - | 384 B |
NotFiltered_InsideScope | False | True | 17,633.33 ns | 2,387.268 ns | 2,554.350 ns | 18,500.00 ns | 12,400.00 ns | 21,400.00 ns | 1,340.50 | 212.74 | - | - | - | 384 B |
FilteredByLevel | True | False | 12.37 ns | 0.062 ns | 0.055 ns | 12.35 ns | 12.27 ns | 12.48 ns | 1.00 | 0.00 | - | - | - | - |
FilteredByLevel_InsideScope | True | False | 1,406.04 ns | 16.370 ns | 14.512 ns | 1,401.81 ns | 1,390.54 ns | 1,441.66 ns | 113.71 | 1.17 | - | - | - | - |
NotFiltered | True | False | 14,888.89 ns | 3,088.538 ns | 3,304.701 ns | 14,100.00 ns | 11,000.00 ns | 22,000.00 ns | 1,230.75 | 297.83 | - | - | - | 384 B |
NotFiltered_InsideScope | True | False | 15,505.56 ns | 1,351.109 ns | 1,445.672 ns | 15,650.00 ns | 12,900.00 ns | 18,200.00 ns | 1,267.84 | 117.57 | - | - | - | 384 B |
FilteredByLevel | True | True | 14.08 ns | 0.573 ns | 0.660 ns | 13.86 ns | 13.28 ns | 15.51 ns | 1.00 | 0.00 | - | - | - | - |
FilteredByLevel_InsideScope | True | True | 16,838.24 ns | 332.393 ns | 341.343 ns | 16,690.68 ns | 16,477.62 ns | 17,473.89 ns | 1,202.93 | 60.30 | - | - | - | 120 B |
NotFiltered | True | True | 14,000.00 ns | 1,782.794 ns | 1,907.570 ns | 13,900.00 ns | 10,900.00 ns | 17,200.00 ns | 995.15 | 146.26 | - | - | - | 384 B |
NotFiltered_InsideScope | True | True | 56,922.22 ns | 4,698.774 ns | 5,027.636 ns | 56,850.00 ns | 48,900.00 ns | 66,500.00 ns | 4,050.30 | 456.65 | - | - | - | 504 B |
Using /O2 across the board.
Method | HasISupportLoggingScopeLogger | CaptureScopes | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FilteredByLevel | False | False | 12.93 ns | 0.165 ns | 0.147 ns | 12.89 ns | 12.76 ns | 13.19 ns | 1.00 | 0.00 | - | - | - | - |
FilteredByLevel_InsideScope | False | False | 1,409.52 ns | 13.509 ns | 12.636 ns | 1,410.69 ns | 1,391.21 ns | 1,433.44 ns | 108.99 | 1.69 | - | - | - | - |
NotFiltered | False | False | 13,888.24 ns | 1,722.752 ns | 1,769.139 ns | 13,300.00 ns | 10,700.00 ns | 17,500.00 ns | 1,070.69 | 137.79 | - | - | - | 384 B |
NotFiltered_InsideScope | False | False | 16,425.00 ns | 1,479.476 ns | 1,453.043 ns | 17,100.00 ns | 13,900.00 ns | 19,200.00 ns | 1,269.06 | 115.05 | - | - | - | 384 B |
FilteredByLevel | False | True | 12.14 ns | 0.086 ns | 0.071 ns | 12.12 ns | 12.05 ns | 12.30 ns | 1.00 | 0.00 | - | - | - | - |
FilteredByLevel_InsideScope | False | True | 2,850.19 ns | 39.585 ns | 37.028 ns | 2,837.73 ns | 2,799.48 ns | 2,934.18 ns | 235.17 | 3.60 | - | - | - | - |
NotFiltered | False | True | 13,894.74 ns | 1,744.423 ns | 1,938.921 ns | 14,100.00 ns | 10,600.00 ns | 17,800.00 ns | 1,132.08 | 182.99 | - | - | - | 384 B |
NotFiltered_InsideScope | False | True | 18,627.78 ns | 1,743.102 ns | 1,865.099 ns | 18,000.00 ns | 16,100.00 ns | 23,000.00 ns | 1,535.65 | 169.00 | - | - | - | 672 B |
FilteredByLevel | True | False | 11.56 ns | 0.058 ns | 0.051 ns | 11.53 ns | 11.50 ns | 11.67 ns | 1.00 | 0.00 | - | - | - | - |
FilteredByLevel_InsideScope | True | False | 1,410.14 ns | 11.084 ns | 9.825 ns | 1,410.25 ns | 1,392.11 ns | 1,427.61 ns | 122.04 | 1.16 | - | - | - | - |
NotFiltered | True | False | 14,227.78 ns | 1,327.065 ns | 1,419.944 ns | 14,150.00 ns | 11,600.00 ns | 16,700.00 ns | 1,223.73 | 116.98 | - | - | - | 384 B |
NotFiltered_InsideScope | True | False | 16,663.16 ns | 2,706.251 ns | 3,007.991 ns | 16,800.00 ns | 12,000.00 ns | 23,000.00 ns | 1,392.56 | 257.14 | - | - | - | 384 B |
FilteredByLevel | True | True | 12.59 ns | 0.124 ns | 0.110 ns | 12.59 ns | 12.43 ns | 12.80 ns | 1.00 | 0.00 | - | - | - | - |
FilteredByLevel_InsideScope | True | True | 16,971.99 ns | 491.024 ns | 525.390 ns | 16,848.39 ns | 16,404.33 ns | 18,168.06 ns | 1,356.26 | 37.07 | - | - | - | 120 B |
NotFiltered | True | True | 14,044.44 ns | 1,285.936 ns | 1,375.937 ns | 14,400.00 ns | 10,800.00 ns | 16,200.00 ns | 1,114.45 | 111.85 | - | - | - | 384 B |
NotFiltered_InsideScope | True | True | 62,122.22 ns | 6,162.752 ns | 6,594.075 ns | 60,750.00 ns | 52,100.00 ns | 78,600.00 ns | 4,923.45 | 578.02 | - | - | - | 504 B |
Using /O1 across the board.
Method | HasISupportLoggingScopeLogger | CaptureScopes | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FilteredByLevel | False | False | 12.81 ns | 0.090 ns | 0.080 ns | 12.79 ns | 12.67 ns | 12.95 ns | 1.00 | 0.00 | - | - | - | - |
FilteredByLevel_InsideScope | False | False | 1,418.64 ns | 9.310 ns | 7.775 ns | 1,418.64 ns | 1,407.11 ns | 1,435.51 ns | 110.72 | 1.02 | - | - | - | - |
NotFiltered | False | False | 13,827.78 ns | 2,919.012 ns | 3,123.310 ns | 12,750.00 ns | 10,200.00 ns | 21,700.00 ns | 1,069.83 | 245.67 | - | - | - | 384 B |
NotFiltered_InsideScope | False | False | 16,173.68 ns | 1,839.376 ns | 2,044.462 ns | 16,000.00 ns | 13,300.00 ns | 21,800.00 ns | 1,221.86 | 132.57 | - | - | - | 384 B |
FilteredByLevel | False | True | 12.82 ns | 0.129 ns | 0.115 ns | 12.78 ns | 12.70 ns | 13.04 ns | 1.00 | 0.00 | - | - | - | - |
FilteredByLevel_InsideScope | False | True | 2,809.46 ns | 20.251 ns | 18.943 ns | 2,808.26 ns | 2,772.07 ns | 2,837.51 ns | 219.32 | 1.75 | - | - | - | - |
NotFiltered | False | True | 13,921.05 ns | 1,531.401 ns | 1,702.149 ns | 14,100.00 ns | 11,400.00 ns | 17,100.00 ns | 1,081.61 | 141.83 | - | - | - | 384 B |
NotFiltered_InsideScope | False | True | 18,047.06 ns | 2,218.454 ns | 2,278.189 ns | 17,100.00 ns | 15,100.00 ns | 21,600.00 ns | 1,417.68 | 166.08 | - | - | - | 384 B |
FilteredByLevel | True | False | 12.86 ns | 0.170 ns | 0.159 ns | 12.84 ns | 12.65 ns | 13.22 ns | 1.00 | 0.00 | - | - | - | - |
FilteredByLevel_InsideScope | True | False | 1,515.62 ns | 82.576 ns | 95.095 ns | 1,510.77 ns | 1,393.06 ns | 1,684.38 ns | 119.34 | 7.47 | - | - | - | - |
NotFiltered | True | False | 13,383.33 ns | 2,101.634 ns | 2,248.725 ns | 13,700.00 ns | 9,400.00 ns | 17,800.00 ns | 1,025.76 | 182.43 | - | - | - | 384 B |
NotFiltered_InsideScope | True | False | 15,433.33 ns | 2,106.701 ns | 2,254.147 ns | 15,050.00 ns | 12,200.00 ns | 20,100.00 ns | 1,165.73 | 156.40 | - | - | - | 96 B |
FilteredByLevel | True | True | 12.24 ns | 0.161 ns | 0.150 ns | 12.25 ns | 11.99 ns | 12.55 ns | 1.00 | 0.00 | - | - | - | - |
FilteredByLevel_InsideScope | True | True | 16,779.27 ns | 363.143 ns | 388.559 ns | 16,659.58 ns | 16,425.86 ns | 17,829.21 ns | 1,374.81 | 32.18 | - | - | - | 120 B |
NotFiltered | True | True | 14,015.79 ns | 1,539.217 ns | 1,710.836 ns | 13,800.00 ns | 11,500.00 ns | 17,600.00 ns | 1,125.32 | 126.53 | - | - | - | 96 B |
NotFiltered_InsideScope | True | True | 64,489.47 ns | 10,687.735 ns | 11,879.389 ns | 59,300.00 ns | 50,000.00 ns | 89,200.00 ns | 5,433.07 | 1,031.12 | - | - | - | 504 B |
@AndyAyersMS any suggestions on next steps here?
Recent perf history shows we've regained most of the missing perf (at least on this one test):
So if it was me I'd just close this.
FWIW the big drop in March 2022 seems to have come from https://github.com/dotnet/runtime/pull/65926.
I'll close this then.
It seems to be affecting only Windows (-10% on average). Was detected in https://github.com/DrewScoggins/performance-2/issues/3245 but not reported in runtime repo (cc @DrewScoggins).
Repro:
https://pvscmdupload.blob.core.windows.net/reports/allTestHistory%2frefs%2fheads%2fmain_x64_Windows%2010.0.18362%2fMicrosoft.Extensions.Logging.ScopesOverheadBenchmark.FilteredByLevel_InsideScope(HasISupportLoggingScopeLogger%3a%20False%2c%20CaptureScopes%3a%20True).html