Open adamsitnik opened 2 years ago
Tagging subscribers to this area: @dotnet/area-meta See info in area-owners.md if you want to be subscribed.
Author: | adamsitnik |
---|---|
Assignees: | - |
Labels: | `area-Meta`, `tenet-performance`, `tracking` |
Milestone: | - |
Nice! I did a similar report last week and shared on our perf meeting last Monday
A lot of Base64Encode benchmarks like System.Buffers.Text.Tests.Base64Tests.Base64Encode(NumberOfBytes: 1000) are 6 up to 16 times slower (most likely due to lack of vectorization). @tannergooding @GrabYourPitchforks is it expected?
Base64 (for utf8) is only vectorized for x64, there is an issue for arm64 https://github.com/dotnet/runtime/issues/35033 (I think we wanted to assign it to someone to ramp up)
System.Numerics.Tests.Perf_BitOperations.PopCount_ulong is 5-8 time slower (most likely due to lack of vectorization).
it is properly accelerated (I compared it with __builtin_popcnt in LLVM), the problem that popcnt is vector only on arm64 so we have some overhead on packing/extracting - 5 instructions vs 1 on x64
Some RentReturnArrayPoolTests benchmarks are up to few times slower
My guess that Rent-Return is most likely bottle-necked on TLS access speed, can be improved with https://github.com/dotnet/runtime/issues/63619 if arm64 has special registers for that.
A lot of System.Collections.Contains benchmarks are 2-3 times slower (most likely due to lack of vectorization).
A lot of SequenceCompareTo benchmarks are 30% up to 4 times slower (most likely due to lack of vectorization
That is expected due to lack of Vector256 I believe, I proposed to add dual-vector128 for arm64 here https://github.com/dotnet/runtime/pull/66993
Burgers.Test3 is 12-59% slower (most likely it's using a method that has not been vectorized)
SIMD.ConsoleMandel benchmarks are 40% slower
Same here, it uses Vector<T>
so it's Vector256 on x64 vs Vector128 on arm64
Various Perf_Interlocked benchmarks are slower, but this is expected due to memory model differences.
Correct, the codegen for interlocked ops is completely fine on both arm64 8.0 and 8.1 (Atomics)
System.MathBenchmarks.Double.Exp and ystem.MathBenchmarks.Single.Exp are 35% slower.
If arm64 was M1 than it's the jump-stubs issue, see https://github.com/dotnet/runtime/issues/62302#issuecomment-1013874430
PerfLabTests.LowLevelPerf.GenericClassGenericStaticField benchmark can be from 16% to x3 times slower. Same goes for PerfLabTests.LowLevelPerf.GenericClassGenericStaticMethod. @jkotas @AndyAyersMS is it expected?
My guess that it's because we don't use relocs on arm64 and have to compose full 64bit address using several instructions to access a static field. E.g.:
static int field;
void IncrementField() => field++;
X64:
FF05C6CC4200 inc dword ptr [(reloc 0x7ffeb73eac3c)]
arm64:
D2958780 movz x0, #0xac3c
F2B6E760 movk x0, #0xb73b LSL #16
F2CFFFC0 movk x0, #0x7ffe LSL #32
B9400001 ldr w1, [x0]
11000421 add w1, w1, #1
B9000001 str w1, [x0]
Overall, I have a feeling that we might get a very nice boost for many benchmarks/GC if we integrate PGO for native code (VM/GC)
System.Security.Cryptography.Tests.Perf_Hashing.Sha1 is 17-55% slower (most likely due to lack of vectorization).
jit-contrib
is it expected?
The SHA1.ComputeHash
is going to be backed by the platform's SHA1 implementation (OpenSSL, CNG, SecurityTransforms) and doesn't do any vectorization itself anywhere. It's possible that the platform the tests were run under do not have optimized ARM64 implementations of SHA1.
Nice! I did a similar report last week and shared on our perf meeting last Monday
@EgorBo that data seems like something you could share on a gist for everyone? (Or perhaps just the scenarios with unusual ratios)
The System.Drawing ones may just be a difference in Windows GDI+ performance since it's largely a wrapper.
PerfLabTests.LowLevelPerf.GenericClassGenericStaticField benchmark can be from 16% to x3 times slower. Same goes for PerfLabTests.LowLevelPerf.GenericClassGenericStaticMethod. @jkotas @AndyAyersMS is it expected?
My guess that it's because we don't use relocs on arm64 and have to compose full 64bit address using several instructions to access a static field.
Access for generic statics (for shared generics at least, maybe for all?) can more complicated -- the address must be looked up in runtime data structures. Worth investigating.
System.Globalization.Tests.Perf_DateTimeCultureInfo.Parse(culturestring: ja) benchmark can be from 20% to x7 times slower (it's most likely an ICU problem).
Most likely it is because of ICU. We already having the issue https://github.com/dotnet/runtime/issues/31273 tracking that. I don't know though why ARM64 runs make more slower.
Access for generic statics (for shared generics at least, maybe for all?) can more complicated -- the address must be looked up in runtime data structures. Worth investigating.
@EgorBo perhaps you could open an issue and update the top post?
@EgorBo perhaps you could open an issue and update the top post?
Access for generic statics (for shared generics at least, maybe for all?) can more complicated -- the address must be looked up in runtime data structures. Worth investigating.
right, but it doesn't look to be the case here since it's not shared
@EgorBo that data seems like something you could share on a gist for everyone?
Sure, let me see how to export an excel sheet to gist 😄
The System.Drawing ones may just be a difference in Windows GDI+ performance since it's largely a wrapper.
There is a lot of interop in this scenario. Could be differences in interop or performance of this callback https://github.com/dotnet/runtime/blob/3ae87395f638a533f37b8e3385f6d3f199a72f4f/src/libraries/System.Drawing.Common/src/System/Drawing/Internal/GPStream.COMWrappers.cs#L29 Could compare to performance of a load that doesn't use stream, and thus would be more of a GDI+ baseline. cc @eerhardt
@jkoritzinsky for that interop possibility. Jeremy anything notable in the interop here - any potentially relevant known issue on Arm64?
System.Text.Json.Serialization.Tests.WriteJson
.SerializeToStream benchmark can be from 16% to x4 times slower.
this one serializes an array of bytes so it spends most of the time encoding data into base64. So it's the same as https://github.com/dotnet/runtime/issues/35033
for that interop possibility. Jeremy anything notable in the interop here - any potentially relevant known issue on Arm64?
We don't have any notable differences (or even any differences I can think of) in the portion of interop used there for ARM64 vs x64. I definitely wouldn't be amazed at all if some portion of GDI+ is better optimized for x64 and we're just seeing that here. @dotnet/interop-contrib if anyone else on the interop team has any issues that come to mind.
For the regex ones -- do we know we have vectorization gaps that are specific to Arm64 in any areas like -- StartsWith, IndexOf, IndexOfAny - @EgorBo ? (For char, not byte)
Few RegularExpressions benchmarks like System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "(?i)Sher[a-z]+|Hol[a-z]+", Options: Compiled) are 40-50% slower (most likely it's using a method that has not been vectorized).
For the regex ones -- do we know we have vectorization gaps that are specific to Arm64 in any areas like -- StartsWith, IndexOf, IndexOfAny - @EgorBo ? (For char, not byte)
The cited pattern will use IndexOfAny("HOho")
to find the next possible match location. It has a 256-bit vectorization path on x64 but only 128-bit on ARM64.
@EgorBo is that IndexOfAny(char, char..) work part of https://github.com/dotnet/runtime/pull/66993 ?
@EgorBo is that IndexOfAny(char, char..) work part of #66993 ?
It is, but I start to think that we won't be able to properly lower Vector256 to double Vector128s in JIT, so I wonder if we should do that on C#/IL level instead e.g. Source-Generators if we really want to - some say that generally these APIs mostly work with small data and cases when we need to open a 0.5Mb book and find a word in it are rare..
I really don't think its worth focusing on or investing in that.
Like you mentioned, doing it in the JIT is somewhat problematic because you have to take Vector256<T>
which is a user-defined non HVA struct (not equivalent to struct Hva256<T> { Vector128<T> _lower; Vector128<T> _upper; }
) and then decompose it into 2x efficient 128-bit
operations.
Decomposition here isn't necessarily trivial and has questionable perf throughput for various operations leading users to a potential pit of failure, particularly when running on low-power devices (may negatively impact Mobile).
We could do some clever things here and other various optimizations to make it work nicely (including treating it as an HVA), but its not a small amount of work.
On top of that, it won't really "close" the gap. The places where doing 2x 128-bit
ops on ARM64 are likely the same places where doing 2x 256-bit
ops on x64 would provide similar gains.
We simply shouldn't be trying to compare 128-bit Arm64
vs 256-bit x64
, just like we shouldn't compare 256-bit x64
to 512-bit x64
(or 128-bit x64
to 256-bit x64
); nor should we try to compare ARM SVE
(if/when we get that support) against x64
.
We should instead, when doing x64 vs Arm64 comparisons compare 128-bit Arm64
to 128-bit x64
. The "simplest" way to do that here is generally COMPlus_EnableAVX2=0
, but more ideally we'd just have a way to force 128-bit code paths without disabling any ISAs.
some say that generally these APIs mostly work with small data and cases when we need to open a 0.5Mb book and find a word in it are rare..
I don't think you can assume this given they're critical to regex matching. @stephentoub @joperezr may have a better sense of typical regex text lengths (of course it also depends on how common hits are)
We simply shouldn't be trying to compare 128-bit Arm64 vs 256-bit x64
Comparing across hardware is inevitably bogus -- I thought the purpose of this exercise was to look for unusual ratios that might suggest room for targeted improvement by whatever means. Just sounds like there may not be a means, in this case.
On top of that, it won't really "close" the gap. The places where doing 2x 128-bit ops on ARM64 are likely the same places where doing 2x 256-bit ops on x64 would provide similar gains.
I support your point, however, I think SpanHelpers
methods are core performance primitives (just like memset and memcpy) in many things, especially IndexOf
, IndexOfAny
and SequenceEqual
, I've seen these 3 in a lot of profiles in different apps (but I've not measured the average input size they worked on) so they might deserve to have 2x256 path or even 4x256 - that's what native compilers do when you ask them to unroll a loop on e.g. Skylake - they will even do 2(4256) per iteration. Although, in order to close the gap here for arm64 we need SVE2 😄
We can add JIT support here, e.g. JIT will be responsible to replace SpanHelpers.IndexOf with a call to a heavily optimized pipelined version if inputs are usually big (PGO)
https://godbolt.org/z/MxhGPPvaj
here I wrote a simple loop to add 2
to all elements in an array of integers.
1) arm64 with all ISAs available - two SVE2 vectors
2) arm64 for Apple-M1 - two Vector128 operations
3) x64 Skylake - 2 groups of 4 Vector256 operations
I didn't even use -O3 here 😐
I support your point, however, I think SpanHelpers methods are core performance primitives (just like memset and memcpy) in many things, especially IndexOf, IndexOfAny and SequenceEqual, I've seen these 3 in a lot of profiles in different apps (but I've not measured the average input size they worked on) so they might deserve to have 2x256 path or even 4x256 - that's what native compilers do when you ask them to unroll a loop on e.g. Skylake - they will even do 2(4256) per iteration. Although, in order to close the gap here for arm64 we need SVE2 😄
Right. My point is that we shouldn't drive the work solely based on closing some non-representative Arm64 vs x64 perf gap, because that will be impossible given the two sets of hardware we have (particularly if we actually try and do our best for each platform).
If it is perf critical, we should be hand tuning this to fit our needs for all the relevant platforms. If that includes manually unrolling and pipelining, then that's fine (assuming numbers across the hardware we care about show the respective gains).
These API's are perf critical (certainly for 'char', if it matters)-- if we think it's feasible at reasonable cost to make them significantly faster on this architecture by whatever means, can we get an issue open for that?
These API's are perf critical (certainly for 'char', if it matters)-- if we think it's feasible at reasonable cost to make them significantly faster on this architecture by whatever means, can we get an issue open for that?
Sure, but I'd love to mine some data first for some apps, 1st parties, benchmarks to understand typical inputs better
Recently @kunalspathak asked me if I could produce a report similar to https://github.com/dotnet/runtime/issues/66848 for x64 vs arm64 comparison.
I took .NET 7 Preview2 results provided by @AndyAyersMS, @kunalspathak and myself for https://github.com/dotnet/runtime/issues/66848, hacked the tool a little bit (it was not designed to compare different architecture results) and compared x64 vs arm64 using the following configs:
Of course it was not an apples-to-apples comparision, just the best thing we could do right now.
Full public results (without absolute values, as I don't have the permission to share them) can be found here. Internal MS results (with absolute values) can be found here. If you don't have the access please ping me on Teams.
As usual, I've focused on the benchmarks that take longer to execute on arm64 compared to x64. If you are interested in benchmarks that take less to execute, you need to read the report linked above in the reverse order.
Benchmarks:
@kunalspathak
System.Numerics.Tests.Perf_BitOperations.PopCount_ulong
is 5-8 time slower (most likely due to lack of vectorization).PopCount_uint
is slower only on Windows.@tannergooding @GrabYourPitchforks
Base64Encode
benchmarks likeSystem.Buffers.Text.Tests.Base64Tests.Base64Encode(NumberOfBytes: 1000)
are 6 up to 16 times slower #35033@stephentoub @kouvel
RentReturnArrayPoolTests
benchmarks are up to few times slower, but these are multi-threaded and very often multimodal benchmarks. #63619System.Threading.Tests.Perf_Timer.AsynchronousContention
is 2-3 times slower.@wfurt @MihaZupan
SocketSendReceivePerfTest
benchmarks likeSystem.Net.WebSockets.Tests.SocketSendReceivePerfTest.ReceiveSend
are 2 times slower.@dotnet/area-system-drawing
System.Drawing.Tests.Perf_Image_Load.Image_FromStream_NoValidation
are few times slower on Windows. Only theNoValidation
benchmarks seem to run slower.@stephentoub
RegularExpressions
benchmarks likeSystem.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "(?i)Sher[a-z]+|Hol[a-z]+", Options: Compiled)
are 40-50% slower. This pattern usesIndexOfAny("HOho")
to find the next possible match location. It has a 256-bit vectorization path on x64 but only 128-bit on ARM64.@jkotas @AndyAyersMS
PerfLabTests.LowLevelPerf.GenericClassGenericStaticField
benchmark can be from 16% to x3 times slower. Same goes forPerfLabTests.LowLevelPerf.GenericClassGenericStaticMethod
.@dotnet/jit-contrib
System.Security.Cryptography.Tests.Perf_Hashing.Sha1
is 17-55% slower. (Potentially differences in the GDI+ code)System.IO.Tests.Perf_StreamWriter.WriteString(writeLength: 100)
is 21-46% slower.System.Text.Json.Serialization.Tests.WriteJson<BinaryData>.SerializeToStream
benchmark can be from 16% to x4 times slower. #35033SIMD.ConsoleMandel
benchmarks are 40% slower . #66993Burgers.Test3
is 12-59% slower #66993System.Collections.Contains
benchmarks are 2-3 times slower (most likely due to lack of vectorization). Same goes forSystem.Memory.Span<Char>.IndexOfValue
,System.Memory.Span<Char>.Fill
,System.Memory.Span<Int32>.StartsWith
,System.Memory.Span<Byte>.IndexOfAnyTwoValues
andSystem.Memory.ReadOnlySpan.IndexOfString(Ordinal)
. #66993SequenceCompareTo
benchmarks are 30% up to 4 times slower #66993@tannergooding
System.MathBenchmarks.Double.Exp
andSystem.MathBenchmarks.Single.Exp
are 35% slower. #62302@dotnet/area-system-globalization
[ ]
System.Globalization.Tests.Perf_DateTimeCultureInfo.Parse(culturestring: ja)
benchmark can be from 20% to x7 times slower (it's most likely an ICU problem). #31273[x] Various
Perf_Interlocked
benchmarks are slower, but this is expected due to memory model differences.[ ] Various
Perf_Process.Start
benchmarks are slower, but only on macOS so it's most likely a macOS issue.