Significant Performance Disparity Between Arm64 and x64 Write Barriers

ebepho commented 3 months ago

Description

We observed a significant performance disparity between the Arm64 and x64 write barriers. When running a program without the write barrier, Arm64 was 3x slower than x64. However, with the write barrier enabled, Arm64 became 10x slower. This suggests that Arm64's handling of the write barrier is less optimized compared to x64.

Data

Performance Counter Stats without the Write Barrier

To test the performance of the write barrier, we used Crank to run a simple program 10 times on the two machines. Notice that when we do not access the write barrier, it’s approximately 3x slower on the Arm64 machine.

This is a simple program that does not access the write barrier that we measured the performance of using crank:

int[] foo = new int[1];
for (long i = 0; i < 100_000_000; i++)
{
   foo[0]++;
}

Table 1: Average Performance Counter Stats without the write barrier.

Architecture	x64		Arm64
# of iterations	100,000,000	200,000,000	100,000,000	200,000,000
cache-references	7199555	7210098	266711905	467403412.6
cache-misses	1673444	1673888	1021946.5	1042045.5
cycles	812275185	1513438858	831957725	1517325563
instructions	656685121	1156933373.4	881350905	1583055913
branches	131173961	231219510.1	121014944	221181620.1
faults	2123.4	2123.2	3290.1	3290.9
migrations	50.9	51.7	71.1	84.8
Time elapsed (seconds)	0.26562	0.47812	0.82561	1.4412
User (seconds)	0.24808	0.46158	0.74556	1.3178
Sys (seconds)	0.00801	0.00946	0.16161	0.20523

Performance Counter Stats with the Write Barrier

When we do access the write barrier, performance degrades further, with the Arm64 machine becoming 10x slower.

This is a simple program that access the write barrier that we measured the performance of using crank:

Foo foo = new Foo();
for (long i = 0; i < (# of iterations); i++)
{
    foo.x = foo;
}
internal class Foo
{
    public volatile Foo x;
}

Table 2: Performance Counter Stats with the write barrier.

Architecture	x64		Arm64
# of iterations	100,000,000	200,000,000	100,000,000	200,000,000
cache-references	7252140	7178833	568014397	1068659425
cache-misses	1697333	1684188	1025013	1012689
cycles	713364359	1313245706	2756710296	5360611600
instructions	1456194567	2756823577	1983627681	3785656008
branches	431088498	831198368	621239460	1221448774
faults	2116	2124	3291	3296
migrations	50.9	52.3	72.7	61.6
Time elapsed (seconds)	0.23283	0.41492	2.6058	4.2126
User (seconds)	0.21495	0.39656	2.5438	4.0788
Sys (seconds)	0.01169	0.01188	0.14361	0.1984

dotnet-policy-service[bot] commented 3 months ago

Tagging subscribers to this area: @dotnet/gc See info in area-owners.md if you want to be subscribed.

teo-tsirpanis commented 3 months ago

Is it certain that the write barrier is to blame? volatile writes have release semantics which I think adds an overhead on ARM architectures.

ebepho commented 3 months ago

Is it certain that the write barrier is to blame? volatile writes have release semantics which I think adds an overhead on ARM architectures.

The volatile overhead is not significant enough to explain the performance regressions observed. The numbers were roughly the same with and without it.

EgorBo commented 3 months ago

@EgorBot -arm64 -amd -perf -commit 55987917ad1ff6ac3f3f49d32b1624196d17a27a vs 55987917ad1ff6ac3f3f49d32b1624196d17a27a

using System;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

public class Bench
{
    [Benchmark]
    public void WB()
    {
        Foo foo = new Foo();
        for (long i = 0; i < 200000000; i++)
            foo.x = foo;
    }
}

internal class Foo
{
    public volatile Foo x;
}

neon-sunset commented 3 months ago

9.0.100-rc.1.24406.4, M1 Pro, osx-arm64 compiled with dotnet publish -p:PublishAot=true

var foo = new Foo();
for (long i = 0; i < 200_000_000; i++) {
    foo.x = foo;
}

class Foo {
    public volatile Foo? x;
}

time ./wbcost                                                               (base) 
________________________________________________________
Executed in  425.01 millis    fish           external
   usr time  404.48 millis    0.07 millis  404.41 millis
   sys time   18.57 millis    1.02 millis   17.55 millis

EgorBot commented 3 months ago

Benchmark results on Amd

``` BenchmarkDotNet v0.14.0, Ubuntu 22.04.4 LTS (Jammy Jellyfish) AMD EPYC 7763, 1 CPU, 8 logical and 4 physical cores Job-FTDGMO : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2 Job-TBVJKS : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2 ``` | Method | Toolchain | Mean | Error | Ratio | |------- |------------------------ |---------:|--------:|------:| | WB | Main | 433.0 ms | 0.22 ms | 1.00 | | WB | PR | 432.7 ms | 0.08 ms | 1.00 | [BDN_Artifacts.zip](https://telegafiles.blob.core.windows.net/telega/BDN_Artifacts_89bc6b54.zip) Flame graphs: [Main](https://telegafiles.blob.core.windows.net/telega/base_flamegraph_89bc6b54.svg) vs [PR](https://telegafiles.blob.core.windows.net/telega/diff_flamegraph_89bc6b54.svg) 🔥 Hot asm: [Main](https://gist.github.com/EgorBot/3df0288c3687846548d794954d349a81) vs [PR](https://gist.github.com/EgorBot/54227a2bc9031241cd9ac32687d32453) Hot functions: [Main](https://gist.github.com/EgorBot/ce3d19d03b3681d21a2c0452db7c184c) vs [PR](https://gist.github.com/EgorBot/0a103a84e2288773e25a8c8260d64c52) _For clean `perf` results, make sure you have just one `[Benchmark]` in your app._

EgorBot commented 3 months ago

Benchmark results on Arm64

``` BenchmarkDotNet v0.14.0, Ubuntu 22.04.4 LTS (Jammy Jellyfish) Unknown processor Job-YWJZIJ : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD Job-YRDIGZ : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD ``` | Method | Toolchain | Mean | Error | Ratio | |------- |------------------------ |---------:|--------:|------:| | WB | Main | 468.3 ms | 0.27 ms | 1.00 | | WB | PR | 468.7 ms | 0.49 ms | 1.00 | [BDN_Artifacts.zip](https://telegafiles.blob.core.windows.net/telega/BDN_Artifacts_2c2b8931.zip) Flame graphs: [Main](https://telegafiles.blob.core.windows.net/telega/base_flamegraph_2c2b8931.svg) vs [PR](https://telegafiles.blob.core.windows.net/telega/diff_flamegraph_2c2b8931.svg) 🔥 Hot asm: [Main](https://gist.github.com/EgorBot/4a0219bc901b9656142fe0fba9aebd14) vs [PR](https://gist.github.com/EgorBot/8e50929611b893b4a8e6c8297ebf5def) Hot functions: [Main](https://gist.github.com/EgorBot/9052f6bc3662b68ac24996c242cc7bbb) vs [PR](https://gist.github.com/EgorBot/c7a6974ef9b3a4a1df9bbd79254ded4d) _For clean `perf` results, make sure you have just one `[Benchmark]` in your app._

EgorBo commented 3 months ago

@EgorBot -arm64 -amd -perf -commit 55987917ad1ff6ac3f3f49d32b1624196d17a27a vs 55987917ad1ff6ac3f3f49d32b1624196d17a27a --envvars DOTNET_TieredCompilation:0 DOTNET_ReadyToRun:0

using System;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

public class Bench
{
    [Benchmark]
    public void WB()
    {
        Foo foo = new Foo();
        for (long i = 0; i < 200000000; i++)
            foo.x = foo;
    }
}

internal class Foo
{
    public volatile Foo x;
}

EgorBot commented 3 months ago

Benchmark results on Amd

``` BenchmarkDotNet v0.14.0, Ubuntu 22.04.4 LTS (Jammy Jellyfish) AMD EPYC 7763, 1 CPU, 8 logical and 4 physical cores Job-LUJGBA : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2 Job-XLQIIV : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2 EnvironmentVariables=DOTNET_TieredCompilation=0,DOTNET_ReadyToRun=0 ``` | Method | Toolchain | Mean | Error | Ratio | |------- |------------------------ |---------:|--------:|------:| | WB | Main | 370.8 ms | 0.04 ms | 1.00 | | WB | PR | 370.9 ms | 0.07 ms | 1.00 | [BDN_Artifacts.zip](https://telegafiles.blob.core.windows.net/telega/BDN_Artifacts_b235bb2b.zip) Flame graphs: [Main](https://telegafiles.blob.core.windows.net/telega/base_flamegraph_b235bb2b.svg) vs [PR](https://telegafiles.blob.core.windows.net/telega/diff_flamegraph_b235bb2b.svg) 🔥 Hot asm: [Main](https://gist.github.com/EgorBot/85e78eea163ce7ed997f1ac8b8505ec5) vs [PR](https://gist.github.com/EgorBot/7a747712719086644aefd3a3c1c15e1e) Hot functions: [Main](https://gist.github.com/EgorBot/b888f2f39f4fb58fe9cb651840dccf04) vs [PR](https://gist.github.com/EgorBot/43913f521bb5ba6353ebf3220cdba106) _For clean `perf` results, make sure you have just one `[Benchmark]` in your app._

EgorBot commented 3 months ago

Benchmark results on Arm64

``` BenchmarkDotNet v0.14.0, Ubuntu 22.04.4 LTS (Jammy Jellyfish) Unknown processor Job-HCAGWK : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD Job-RPUMUX : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD EnvironmentVariables=DOTNET_TieredCompilation=0,DOTNET_ReadyToRun=0 ``` | Method | Toolchain | Mean | Error | Ratio | |------- |------------------------ |---------:|--------:|------:| | WB | Main | 467.4 ms | 0.07 ms | 1.00 | | WB | PR | 467.4 ms | 0.05 ms | 1.00 | [BDN_Artifacts.zip](https://telegafiles.blob.core.windows.net/telega/BDN_Artifacts_89409087.zip) Flame graphs: [Main](https://telegafiles.blob.core.windows.net/telega/base_flamegraph_89409087.svg) vs [PR](https://telegafiles.blob.core.windows.net/telega/diff_flamegraph_89409087.svg) 🔥 Hot asm: [Main](https://gist.github.com/EgorBot/ab14df84a549c1cdff3cb6224fcf23da) vs [PR](https://gist.github.com/EgorBot/7fad7efb7094971bc80562159584e1c6) Hot functions: [Main](https://gist.github.com/EgorBot/c97d25b764dcc0410447fb2377cb1877) vs [PR](https://gist.github.com/EgorBot/a09f11ca8ce05d0a595423565f1b2b66) _For clean `perf` results, make sure you have just one `[Benchmark]` in your app._

EgorBo commented 3 months ago

I cannot reproduce your numbers, I suspect you might be measuring OSR pace difference (consider running with DOTNET_TieredCompilation=0).

Although, arm64 is still slower due to:

I see a jump-stub in the traces, it's rarely an issue on x64
Since VM has to patch several pointer-sized constants - we have to introduce memory loads in arm64 WB, while on x64 we can patch movabs r10, 0xF0F0F0F0F0F0F0F0 directly (they have to be aligned), etc. Looks like Arm64's WB performs 5 memory loads (wbs_sw_ww_table, wbs_ephemeral_low, wbs_ephemeral_high, wbs_card_table + card table value load) while x64 has just one. Annotated asm: arm64 vs x64
X64 seems to have additional implementations of WB for Pre/Post ephemeral heap growing to eliminate some checks dynamically (although, it doesn't look like that saves a lot)
minor: WB implements ByrefWB contract, so it ends with redundant add x14, x14, #0x8

Also, we might want to have a more complicated benchmark where objects aren't ephemeral as well?

EgorBo commented 3 months ago

@jkotas @cshung If you're not busy - do you have any idea why "is card table already updated" check is so expensive on arm64? 🙂

Arm64: https://gist.github.com/EgorBot/9549a52ea9ec3ff3b576ba567c103032#file-base_asm_2f8459e6-asm-L26 - 41.44% of all samples are near the ldrb load (wbs_card_table) on arm64
x64: https://gist.github.com/EgorBot/c0d33e8b75ee98bf228b45e53dc7db92#file-diff_asm_124c2160-asm-L62 (here on the left I have absolute number of samples unlike arm64) - just one sample around the similar "is card table updated" load 😐

can it be some false sharing etc?

Another thing I noticed that arm64 WB is so expensive that we can add yet another branch ("is object reference null? Exit") and the regression will be <1% (while giving us 2X improvement when we actually write null)

cshung commented 3 months ago

Also, we might want to have a more complicated benchmark where objects aren't ephemeral as well?

Yes, we should totally understand the performance of the write barrier function under other execution paths - for example - when cache miss, when branching away because of heap range, generations, and so on. The initial benchmark was designed to be easy to understand. For example, I wanted to make sure the cache always hit and read exactly the same location, that make sure we don't hit any cache issues. As we can see, even in this trivial scenario, the data is showing surprising results, make it more varying will only make it harder to interpret.

can it be some false sharing etc?

I doubt it is false sharing. Since we aren't allocating, the GC should not be running, and no other thread should be accessing the card table, so the core should have exclusive access to the cache entry.

Beside the obvious fact that this "slow load" used a different instruction, this slow load is also loading from a computed address, does the ARM architecture does anything special with respect to loading from a hard coded address? I don't know.

I wonder if tools like this can give us more insight on what is going on. https://learn.arm.com/learning-paths/servers-and-cloud-computing/top-down-n1/analysis-1/

jkotas commented 3 months ago

My bet would be sampling bias or some micro-architecture issue. I think it would be best to ask Arm hw engineers to replicate this on a simulator and tell us what's actually going on.

dotnet / runtime