dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.99k stars 4.67k forks source link

x64 vs ARM64 Microbenchmarks Performance Study Report #67339

Open adamsitnik opened 2 years ago

adamsitnik commented 2 years ago

Recently @kunalspathak asked me if I could produce a report similar to https://github.com/dotnet/runtime/issues/66848 for x64 vs arm64 comparison.

I took .NET 7 Preview2 results provided by @AndyAyersMS, @kunalspathak and myself for https://github.com/dotnet/runtime/issues/66848, hacked the tool a little bit (it was not designed to compare different architecture results) and compared x64 vs arm64 using the following configs:

Of course it was not an apples-to-apples comparision, just the best thing we could do right now.

Full public results (without absolute values, as I don't have the permission to share them) can be found here. Internal MS results (with absolute values) can be found here. If you don't have the access please ping me on Teams.

As usual, I've focused on the benchmarks that take longer to execute on arm64 compared to x64. If you are interested in benchmarks that take less to execute, you need to read the report linked above in the reverse order.

Benchmarks:

@kunalspathak

@tannergooding @GrabYourPitchforks

@stephentoub @kouvel

@wfurt @MihaZupan

@dotnet/area-system-drawing

@stephentoub

@jkotas @AndyAyersMS

@dotnet/jit-contrib

@tannergooding

@dotnet/area-system-globalization

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-meta See info in area-owners.md if you want to be subscribed.

Issue Details
Recently @kunalspathak asked me if I could produce a report similar to https://github.com/dotnet/runtime/issues/66848 for x64 vs arm64 comparison. I took .NET 7 Preview2 results provided by @AndyAyersMS, @kunalspathak and myself for https://github.com/dotnet/runtime/issues/66848, hacked the tool a little bit (it was not designed to compare different architecture results) and compared x64 vs arm64 using the following configs: * my old 4 year old macBook Pro x64: macOS Monterey 12.2.1, Intel Core i7-5557U CPU 3.10GHz (Broadwell), 1 CPU, 4 logical and 2 physical cores vs @AndyAyersMS M1 Max arm64: macOS Monterey 12.2.1, Apple M1 Max 2.40GHz, 1 CPU, 10 logical and 10 physical cores * @kunalspathak Windows 10 (10.0.20348.587) Intel Xeon Platinum 8272CL CPU 2.60GHz, 2 CPU, 104 logical and 52 physical cores vs @kunalspathak Windows 11 (10.0.25058.1000) ARM64 machine with lots of cores Of course it was not an apples-to-apples comparision, just the best thing we could do right now. Full public results (without absolute values, as I don't have the permission to share them) can be found [here](https://gist.github.com/adamsitnik/3df04e23d5a88806204153593bc5f420). Internal MS results (with absolute values) can be found [here](https://microsofteur-my.sharepoint.com/:t:/g/personal/adsitnik_microsoft_com/ESIzrKQkyZdHhnrdw_utqzsBVRhvNQpxXFRTI57V2D7TxA?e=mjbwcC). If you don't have the access please ping me on Teams. As usual, I've focused on the benchmarks that take longer to execute on arm64 compared to x64. If you are interested in benchmarks that take less to execute, you need to read the report linked above in the reverse order. Benchmarks: * A lot of `Base64Encode` benchmarks like `System.Buffers.Text.Tests.Base64Tests.Base64Encode(NumberOfBytes: 1000)` are 6 up to 16 times slower (most likely due to lack of vectorization). @tannergooding @GrabYourPitchforks is it expected? * `System.Numerics.Tests.Perf_BitOperations.PopCount_ulong` is 5-8 time slower (most likely due to lack of vectorization). `PopCount_uint` is slower only on Windows. @kunalspathak is this expected? * Some `RentReturnArrayPoolTests` benchmarks are up to few times slower, but these are multi-threaded and very often multimodal benchmarks. @stephentoub @kouvel is it expected? * `System.Globalization.Tests.Perf_DateTimeCultureInfo.Parse(culturestring: ja)` benchmark can be from 20% to x7 times slower (it's most likely an ICU problem). @dotnet/area-system-globalization is it expected? * A lot of `System.Collections.Contains` benchmarks are 2-3 times slower (most likely due to lack of vectorization). Same goes for `System.Memory.Span.IndexOfValue`, `System.Memory.Span.Fill`, `System.Memory.Span.StartsWith`, `System.Memory.Span.IndexOfAnyTwoValues` and `System.Memory.ReadOnlySpan.IndexOfString(Ordinal)`. @tannergooding @EgorBo is it expected? * A lot of `SequenceCompareTo` benchmarks are 30% up to 4 times slower (most likely due to lack of vectorization). @tannergooding @EgorBo is it expected? * `System.Text.Json.Serialization.Tests.WriteJson.SerializeToStream` benchmark can be from 16% to x4 times slower. @dotnet/jit-contrib is this expected? * `System.Threading.Tests.Perf_Timer.AsynchronousContention` is 2-3 times slower. @stephentoub @kouvel is it expected? * A lot of `SocketSendReceivePerfTest` benchmarks like`System.Net.WebSockets.Tests.SocketSendReceivePerfTest.ReceiveSend` are 2 times slower. @wfurt @MihaZupan is it expected? * `System.Drawing.Tests.Perf_Image_Load.Image_FromStream_NoValidation` are few times slower on Windows. @dotnet/area-system-drawing is it expected? Only the `NoValidation` benchmarks seem to run slower. * `PerfLabTests.LowLevelPerf.GenericClassGenericStaticField` benchmark can be from 16% to x3 times slower. Same goes for `PerfLabTests.LowLevelPerf.GenericClassGenericStaticMethod`. @jkotas @AndyAyersMS is it expected? * Few `RegularExpressions` benchmarks like `System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "(?i)Sher[a-z]+|Hol[a-z]+", Options: Compiled)` are 40-50% slower (most likely it's using a method that has not been vectorized). @stephentoub is it expected? * `Burgers.Test3` is 12-59% slower (most likely it's using a method that has not been vectorized). @dotnet/jit-contrib is it expected? * `System.Security.Cryptography.Tests.Perf_Hashing.Sha1` is 17-55% slower (most likely due to lack of vectorization). @dotnet/jit-contrib is it expected? * `SIMD.ConsoleMandel` benchmarks are 40% slower (most likely due to lack of vectorization). @dotnet/jit-contrib is it expected? * `System.IO.Tests.Perf_StreamWriter.WriteString(writeLength: 100)` is 21-46% slower (most likely due to lack of vectorization). @dotnet/jit-contrib is it expected? * `System.MathBenchmarks.Double.Exp` and `ystem.MathBenchmarks.Single.Exp` are 35% slower. @tannergooding is it expected? * Various `Perf_Interlocked` benchmarks are slower, but this is expected due to memory model differences. * Various `Perf_Process.Start` benchmarks are slower, but only on macOS so it's most likely a macOS issue.
Author: adamsitnik
Assignees: -
Labels: `area-Meta`, `tenet-performance`, `tracking`
Milestone: -
EgorBo commented 2 years ago

Nice! I did a similar report last week and shared on our perf meeting last Monday

A lot of Base64Encode benchmarks like System.Buffers.Text.Tests.Base64Tests.Base64Encode(NumberOfBytes: 1000) are 6 up to 16 times slower (most likely due to lack of vectorization). @tannergooding @GrabYourPitchforks is it expected?

Base64 (for utf8) is only vectorized for x64, there is an issue for arm64 https://github.com/dotnet/runtime/issues/35033 (I think we wanted to assign it to someone to ramp up)


System.Numerics.Tests.Perf_BitOperations.PopCount_ulong is 5-8 time slower (most likely due to lack of vectorization).

it is properly accelerated (I compared it with __builtin_popcnt in LLVM), the problem that popcnt is vector only on arm64 so we have some overhead on packing/extracting - 5 instructions vs 1 on x64


Some RentReturnArrayPoolTests benchmarks are up to few times slower

My guess that Rent-Return is most likely bottle-necked on TLS access speed, can be improved with https://github.com/dotnet/runtime/issues/63619 if arm64 has special registers for that.


A lot of System.Collections.Contains benchmarks are 2-3 times slower (most likely due to lack of vectorization).

A lot of SequenceCompareTo benchmarks are 30% up to 4 times slower (most likely due to lack of vectorization

That is expected due to lack of Vector256 I believe, I proposed to add dual-vector128 for arm64 here https://github.com/dotnet/runtime/pull/66993

Burgers.Test3 is 12-59% slower (most likely it's using a method that has not been vectorized)

SIMD.ConsoleMandel benchmarks are 40% slower

Same here, it uses Vector<T> so it's Vector256 on x64 vs Vector128 on arm64


Various Perf_Interlocked benchmarks are slower, but this is expected due to memory model differences.

Correct, the codegen for interlocked ops is completely fine on both arm64 8.0 and 8.1 (Atomics)


System.MathBenchmarks.Double.Exp and ystem.MathBenchmarks.Single.Exp are 35% slower.

If arm64 was M1 than it's the jump-stubs issue, see https://github.com/dotnet/runtime/issues/62302#issuecomment-1013874430


PerfLabTests.LowLevelPerf.GenericClassGenericStaticField benchmark can be from 16% to x3 times slower. Same goes for PerfLabTests.LowLevelPerf.GenericClassGenericStaticMethod. @jkotas @AndyAyersMS is it expected?

My guess that it's because we don't use relocs on arm64 and have to compose full 64bit address using several instructions to access a static field. E.g.:

static int field;

void IncrementField() => field++;

X64:

       FF05C6CC4200         inc      dword ptr [(reloc 0x7ffeb73eac3c)]

arm64:

        D2958780          movz    x0, #0xac3c
        F2B6E760          movk    x0, #0xb73b LSL #16
        F2CFFFC0          movk    x0, #0x7ffe LSL #32
        B9400001          ldr     w1, [x0]
        11000421          add     w1, w1, #1
        B9000001          str     w1, [x0]

Overall, I have a feeling that we might get a very nice boost for many benchmarks/GC if we integrate PGO for native code (VM/GC)

vcsjones commented 2 years ago

System.Security.Cryptography.Tests.Perf_Hashing.Sha1 is 17-55% slower (most likely due to lack of vectorization). jit-contrib is it expected?

The SHA1.ComputeHash is going to be backed by the platform's SHA1 implementation (OpenSSL, CNG, SecurityTransforms) and doesn't do any vectorization itself anywhere. It's possible that the platform the tests were run under do not have optimized ARM64 implementations of SHA1.

danmoseley commented 2 years ago

Nice! I did a similar report last week and shared on our perf meeting last Monday

@EgorBo that data seems like something you could share on a gist for everyone? (Or perhaps just the scenarios with unusual ratios)

danmoseley commented 2 years ago

The System.Drawing ones may just be a difference in Windows GDI+ performance since it's largely a wrapper.

AndyAyersMS commented 2 years ago

PerfLabTests.LowLevelPerf.GenericClassGenericStaticField benchmark can be from 16% to x3 times slower. Same goes for PerfLabTests.LowLevelPerf.GenericClassGenericStaticMethod. @jkotas @AndyAyersMS is it expected?

My guess that it's because we don't use relocs on arm64 and have to compose full 64bit address using several instructions to access a static field.

https://github.com/dotnet/performance/blob/d7dac8a7ca12a28d099192f8a901cf8e30361384/src/benchmarks/micro/runtime/perflab/LowLevelPerf.cs#L320-L325

Access for generic statics (for shared generics at least, maybe for all?) can more complicated -- the address must be looked up in runtime data structures. Worth investigating.

tarekgh commented 2 years ago

System.Globalization.Tests.Perf_DateTimeCultureInfo.Parse(culturestring: ja) benchmark can be from 20% to x7 times slower (it's most likely an ICU problem).

Most likely it is because of ICU. We already having the issue https://github.com/dotnet/runtime/issues/31273 tracking that. I don't know though why ARM64 runs make more slower.

danmoseley commented 2 years ago

Access for generic statics (for shared generics at least, maybe for all?) can more complicated -- the address must be looked up in runtime data structures. Worth investigating.

@EgorBo perhaps you could open an issue and update the top post?

EgorBo commented 2 years ago

@EgorBo perhaps you could open an issue and update the top post?

Access for generic statics (for shared generics at least, maybe for all?) can more complicated -- the address must be looked up in runtime data structures. Worth investigating.

right, but it doesn't look to be the case here since it's not shared

@EgorBo that data seems like something you could share on a gist for everyone?

Sure, let me see how to export an excel sheet to gist 😄

ericstj commented 2 years ago

The System.Drawing ones may just be a difference in Windows GDI+ performance since it's largely a wrapper.

There is a lot of interop in this scenario. Could be differences in interop or performance of this callback https://github.com/dotnet/runtime/blob/3ae87395f638a533f37b8e3385f6d3f199a72f4f/src/libraries/System.Drawing.Common/src/System/Drawing/Internal/GPStream.COMWrappers.cs#L29 Could compare to performance of a load that doesn't use stream, and thus would be more of a GDI+ baseline. cc @eerhardt

danmoseley commented 2 years ago

@jkoritzinsky for that interop possibility. Jeremy anything notable in the interop here - any potentially relevant known issue on Arm64?

EgorBo commented 2 years ago

System.Text.Json.Serialization.Tests.WriteJson.SerializeToStream benchmark can be from 16% to x4 times slower.

this one serializes an array of bytes so it spends most of the time encoding data into base64. So it's the same as https://github.com/dotnet/runtime/issues/35033

image

jkoritzinsky commented 2 years ago

for that interop possibility. Jeremy anything notable in the interop here - any potentially relevant known issue on Arm64?

We don't have any notable differences (or even any differences I can think of) in the portion of interop used there for ARM64 vs x64. I definitely wouldn't be amazed at all if some portion of GDI+ is better optimized for x64 and we're just seeing that here. @dotnet/interop-contrib if anyone else on the interop team has any issues that come to mind.

danmoseley commented 2 years ago

For the regex ones -- do we know we have vectorization gaps that are specific to Arm64 in any areas like -- StartsWith, IndexOf, IndexOfAny - @EgorBo ? (For char, not byte)

stephentoub commented 2 years ago

Few RegularExpressions benchmarks like System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "(?i)Sher[a-z]+|Hol[a-z]+", Options: Compiled) are 40-50% slower (most likely it's using a method that has not been vectorized).

For the regex ones -- do we know we have vectorization gaps that are specific to Arm64 in any areas like -- StartsWith, IndexOf, IndexOfAny - @EgorBo ? (For char, not byte)

The cited pattern will use IndexOfAny("HOho") to find the next possible match location. It has a 256-bit vectorization path on x64 but only 128-bit on ARM64.

danmoseley commented 2 years ago

@EgorBo is that IndexOfAny(char, char..) work part of https://github.com/dotnet/runtime/pull/66993 ?

EgorBo commented 2 years ago

@EgorBo is that IndexOfAny(char, char..) work part of #66993 ?

It is, but I start to think that we won't be able to properly lower Vector256 to double Vector128s in JIT, so I wonder if we should do that on C#/IL level instead e.g. Source-Generators if we really want to - some say that generally these APIs mostly work with small data and cases when we need to open a 0.5Mb book and find a word in it are rare..

tannergooding commented 2 years ago

I really don't think its worth focusing on or investing in that.


Like you mentioned, doing it in the JIT is somewhat problematic because you have to take Vector256<T> which is a user-defined non HVA struct (not equivalent to struct Hva256<T> { Vector128<T> _lower; Vector128<T> _upper; }) and then decompose it into 2x efficient 128-bit operations.

Decomposition here isn't necessarily trivial and has questionable perf throughput for various operations leading users to a potential pit of failure, particularly when running on low-power devices (may negatively impact Mobile).

We could do some clever things here and other various optimizations to make it work nicely (including treating it as an HVA), but its not a small amount of work.


On top of that, it won't really "close" the gap. The places where doing 2x 128-bit ops on ARM64 are likely the same places where doing 2x 256-bit ops on x64 would provide similar gains.

We simply shouldn't be trying to compare 128-bit Arm64 vs 256-bit x64, just like we shouldn't compare 256-bit x64 to 512-bit x64 (or 128-bit x64 to 256-bit x64); nor should we try to compare ARM SVE (if/when we get that support) against x64.

We should instead, when doing x64 vs Arm64 comparisons compare 128-bit Arm64 to 128-bit x64. The "simplest" way to do that here is generally COMPlus_EnableAVX2=0, but more ideally we'd just have a way to force 128-bit code paths without disabling any ISAs.

danmoseley commented 2 years ago

some say that generally these APIs mostly work with small data and cases when we need to open a 0.5Mb book and find a word in it are rare..

I don't think you can assume this given they're critical to regex matching. @stephentoub @joperezr may have a better sense of typical regex text lengths (of course it also depends on how common hits are)

We simply shouldn't be trying to compare 128-bit Arm64 vs 256-bit x64

Comparing across hardware is inevitably bogus -- I thought the purpose of this exercise was to look for unusual ratios that might suggest room for targeted improvement by whatever means. Just sounds like there may not be a means, in this case.

EgorBo commented 2 years ago

On top of that, it won't really "close" the gap. The places where doing 2x 128-bit ops on ARM64 are likely the same places where doing 2x 256-bit ops on x64 would provide similar gains.

I support your point, however, I think SpanHelpers methods are core performance primitives (just like memset and memcpy) in many things, especially IndexOf, IndexOfAny and SequenceEqual, I've seen these 3 in a lot of profiles in different apps (but I've not measured the average input size they worked on) so they might deserve to have 2x256 path or even 4x256 - that's what native compilers do when you ask them to unroll a loop on e.g. Skylake - they will even do 2(4256) per iteration. Although, in order to close the gap here for arm64 we need SVE2 😄

We can add JIT support here, e.g. JIT will be responsible to replace SpanHelpers.IndexOf with a call to a heavily optimized pipelined version if inputs are usually big (PGO)

EgorBo commented 2 years ago

https://godbolt.org/z/MxhGPPvaj

here I wrote a simple loop to add 2 to all elements in an array of integers. 1) arm64 with all ISAs available - two SVE2 vectors 2) arm64 for Apple-M1 - two Vector128 operations 3) x64 Skylake - 2 groups of 4 Vector256 operations

I didn't even use -O3 here 😐

tannergooding commented 2 years ago

I support your point, however, I think SpanHelpers methods are core performance primitives (just like memset and memcpy) in many things, especially IndexOf, IndexOfAny and SequenceEqual, I've seen these 3 in a lot of profiles in different apps (but I've not measured the average input size they worked on) so they might deserve to have 2x256 path or even 4x256 - that's what native compilers do when you ask them to unroll a loop on e.g. Skylake - they will even do 2(4256) per iteration. Although, in order to close the gap here for arm64 we need SVE2 😄

Right. My point is that we shouldn't drive the work solely based on closing some non-representative Arm64 vs x64 perf gap, because that will be impossible given the two sets of hardware we have (particularly if we actually try and do our best for each platform).

If it is perf critical, we should be hand tuning this to fit our needs for all the relevant platforms. If that includes manually unrolling and pipelining, then that's fine (assuming numbers across the hardware we care about show the respective gains).

danmoseley commented 2 years ago

These API's are perf critical (certainly for 'char', if it matters)-- if we think it's feasible at reasonable cost to make them significantly faster on this architecture by whatever means, can we get an issue open for that?

EgorBo commented 2 years ago

These API's are perf critical (certainly for 'char', if it matters)-- if we think it's feasible at reasonable cost to make them significantly faster on this architecture by whatever means, can we get an issue open for that?

Sure, but I'd love to mine some data first for some apps, 1st parties, benchmarks to understand typical inputs better