Open oleg-loop54 opened 3 months ago
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
Do you know what hardware this runs on? Certain versions of certain processors are very sensitive to the alignment of branches in memory, and perhaps that is what is leading to the variation you see.
cc @dotnet/jit-contrib
@AndyAyersMS we're running it on 1U Ultra Dual AMD Epyc 7002 server with two "AMD EPYC 7343 16C/32T 3.2/3.9GHz 128MB 190W" CPUs
We need more info and work with the author to understand what we should do. Putting it to Future because we don't have time for .NET 9.
What other info do you need @JulieLeeMSFT ?
Would it be possible for you to run this on Intel HW or maybe an AMD Zen4? Wondering if we're seeing something specific to the Zen3 microarchitecture here.
Ran the test on Intel Xeon 6210U (20 cores in total):
first run, avg service times per instance ranging from 146ms to 163 ms
second run, avg service times: from 171ms to 180ms
Also ran it on 2x EPYC 9274F (24 cores each, so 48 cores in total):
first run, avg service times: from 88ms to 115ms
second run, avg service times: from 90ms to 105ms
So it looks like there is still some variability, but not nearly as bad as on the AMD64 Zen3 HW?
The disassembly above is indeed identical modulo layout. Given that this method does a fair amount of calling and does not contain a loop it is hard to see why the layout would matter all that much.
@tannergooding do we know of any microarchitectural oddities around Zen3?
One thing that surprises me a little is that the profile data is pretty "thin" -- we are only seeing ~30 calls, which is the minimum we'd ever see. It's also a little unusual to see a class init call in a Tier1 method, since usually classes get initialized by the Tier0 versions.
Can you share out (privately if necessary) the full DOTNET_JitDisasmSummary
?
As an experiment, you might try increasing the tiering threshold some, say DOTNET_TC_CallCountThreshold=200
(value is parsed as hex, so this should mean 512 calls), to see if that leads to more predictable results.
do we know of any microarchitectural oddities around Zen3?
Like you said, the code between the fast and slow version isn't really that much different even, it's primarily that G_M000_IG03
is instead placed at G_M000_IG11
and so 3 jumps become long jumps rather than the short jumps they are in the fast version.
The most notably potential quirk is Zen3 has a perf hit if there are more than two jump instructions in a 16-byte aligned window, but its not a significant hit and there isn't a loop here that would massively impact things (its more than 3 in the same window for Zen 4).
My guess is the long jumps are impacting the Zen3 decode/prefetch pipeline and its causing a throughput hit. It's also possible that this is a general code alignment issue. The latest CPUs fetch in 32-64 byte aligned blocks (typically a fetch window can be done every 1-2 cycles) and then typically decode in 16-32 byte windows (physical cores typically have 2 decode units, one for each logical core in a hyperthreaded system). So even if the method is 32-byte aligned, then if the jump target differences cause a difference in whether a given target is 16-byte aligned that can negatively impact throughput.
-- That's just a guess though, I don't see anything "obvious" here and would recommend profiling with something like AMD uProf (https://www.amd.com/en/developer/uprof.html) to see if it highlights any particular part of the assembly as causing the bottleneck.
What's puzzling is that there is no fast path through this method, every call does a fair amount of work, including making numerous other calls. Even if the method is called a lot (which seems likely), the impact of fetch windows or branch density delays would not be anywhere near this bad.
I wonder if dotnetTrace is giving a misleading picture. For instance, it shows one of the dominant callees is IsInstanceOfClass
, but from the disasm it's not an immediate callee. Maybe one of the immediate callees tail calls into this helper, but that seems a bit unlikely. On the other hand some bits here are clearly from F# so perhaps there are tail calls.
Since this is on Linux perhaps we can use perf
to drill in...?
Did a little bit more testing with the suggested DOTNET_TC_CallCountThreshold=200
:
So, on Zen3 differences are still there, but the generated assembly code now looks pretty much identical. See JIT summaries: zen3_callcount512.zip
I also ran it on another Zen3 machine at our disposal to exclude if this is a machine-specific hardware fault - the results show same behavior.
@oleg-loop54 as I was saying above, I wonder if the performance issue is somewhere else.
You might consider, if possible, capturing slow and fast run profiles with perfcollect and sharing those (we can find ways to keep the files private if that's a concern).
Likely 15 seconds or so should be enough.
Here are some other things you might try, if you have the time to experiment. These may reduce overall performance but may also reduce variability:
DOTNET_TieredPGO=0
DOTNET_TC_OnStackReplacement=0
DOTNET_TieredCompilation=0
(this implicitly disables both PGO and OSR, since they rely on tiering)If one or more of these is stable then that might help us focus subsequent investigations.
Description
We have an application that processes some network traffic. When several instances of that application is being run simultaneously on an on-premises server, processing the very same requests takes different times.
There are plenty of RAM and CPU available on the server while running these application instances. The code is the same, built on the same framework version, published as self-contained, the same executable is started several times under the same user and each instance is fed the same traffic.
Configuration
Built and published using .NET SDK 8.0.303 OS:
Linux <..> 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
, Ubuntu 24.04 LTSMain executable is built with these:
Data
Average response times for 4 instances, running simultaneously on the same server. Started roughly at the same time, traffic switched on exactly at the same time:
Did some profiling with dotnetTrace (using sampling mode), worst and best shown below:
So, MatchQueryWord function took ~13s in one case and ~19s in another, processing same ~1500 requests.
Also did
export DOTNET_JitDisasm=MatchQueryWord DOTNET_JitDisasmDiffable=1 DOTNET_JitDisasmSummary=1 DOTNET_JitStdOutFile=/tmp/engineX_jit.txt
The final code for most performant instance is:
for worst performant instance:
the biggest difference I see here is usage of
je
vsjne
just before labelG_M000_IG03
and using short jump. I find it hard to believe that this causes such huge impact (13s vs 19s).Another thing that is strange is that
IsInstanceOfClass
has different run times although generated code size (and I assume the code itself) is the same for both best and worst instances.<..> JIT compiled System.Runtime.CompilerServices.CastHelpers:IsInstanceOfClass(ulong,System.Object) [Tier1 with Dynamic PGO, IL size=97, code size=88]
It's as if one instance just "got into a wrong place" (??) and consistently runs slower than another.
If I set up a separate dedicated cloud server per instance, then repeating the same scenario yields identical processing times.
I'm stumbled where to look next, as this behavior hinders performance testing of our solution, so any suggestions would be most welcome! Thanks in advance!