List Add is way slower (almost 3 times) in `net9.0 preview 3` than with `net8.0`

linkdotnet commented 7 months ago

With the latest preview 3 (net9) there seems to be a major performance regression on MacOS 14.4 (M2 Pro) with lists. Given that they are one of the most used types, it should receive special treatment.

Benchmark

[Benchmark]
public List<int> ListAdd10000()
{
    var list = new List<int>();
    for (var i = 0; i < 10_000; i++)
    {
        list.Add(i);
    }

    return list;
}

[Benchmark]
public List<int> ListAdd10000PreAlloc()
{
    var list = new List<int>(10_000);
    for (var i = 0; i < 10_000; i++)
    {
        list.Add(i);
    }

    return list;
}

Results:

BenchmarkDotNet v0.13.12, macOS Sonoma 14.4.1 (23E224) [Darwin 23.4.0]
Apple M2 Pro, 1 CPU, 12 logical and 12 physical cores
.NET SDK 9.0.100-preview.3.24204.13
  [Host]   : .NET 9.0.0 (9.0.24.17209), Arm64 RyuJIT AdvSIMD
  .NET 8.0 : .NET 8.0.2 (8.0.224.6711), Arm64 RyuJIT AdvSIMD
  .NET 9.0 : .NET 9.0.0 (9.0.24.17209), Arm64 RyuJIT AdvSIMD

| Method               | Runtime  | Mean      | Error     | StdDev    | Ratio | 
|--------------------- |--------- |----------:|----------:|----------:|------:|-
| ListAdd10000         | .NET 8.0 |  9.739 us | 0.0236 us | 0.0209 us |  1.00 | 
| ListAdd10000         | .NET 9.0 | 25.646 us | 0.1531 us | 0.1432 us |  2.63 | 
|                      |          |           |           |           |       | 
| ListAdd10000PreAlloc | .NET 8.0 |  6.701 us | 0.0175 us | 0.0146 us |  1.00 | 
| ListAdd10000PreAlloc | .NET 9.0 | 22.562 us | 0.1429 us | 0.1336 us |  3.37 |

Interestingly even the pre-allocated list is comparably slow.

`dotnet --info`

Output

dotnet --info .NET SDK: Version: 9.0.100-preview.3.24204.13 Commit: 81f61d8290 Workload version: 9.0.100-manifests.77bb7ba9 MSBuild version: 17.11.0-preview-24178-16+7ca3c98fa Runtime Environment: OS Name: Mac OS X OS Version: 14.4 OS Platform: Darwin RID: osx-arm64 Base Path: /usr/local/share/dotnet/sdk/9.0.100-preview.3.24204.13/ .NET workloads installed: There are no installed workloads to display. Host: Version: 9.0.0-preview.3.24172.9 Architecture: arm64 Commit: 9e6ba1f68c .NET SDKs installed: 6.0.417 [/usr/local/share/dotnet/sdk] 7.0.306 [/usr/local/share/dotnet/sdk] 7.0.404 [/usr/local/share/dotnet/sdk] 8.0.100 [/usr/local/share/dotnet/sdk] 8.0.101 [/usr/local/share/dotnet/sdk] 8.0.200 [/usr/local/share/dotnet/sdk] 9.0.100-preview.2.24121.2 [/usr/local/share/dotnet/sdk] 9.0.100-preview.2.24157.14 [/usr/local/share/dotnet/sdk] 9.0.100-preview.3.24204.13 [/usr/local/share/dotnet/sdk] .NET runtimes installed: Microsoft.AspNetCore.App 6.0.25 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 7.0.14 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 8.0.0 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 8.0.1 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 8.0.2 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 9.0.0-preview.2.24120.6 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 9.0.0-preview.2.24128.4 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 9.0.0-preview.3.24172.13 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App] Microsoft.NETCore.App 6.0.25 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App] Microsoft.NETCore.App 7.0.14 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App] Microsoft.NETCore.App 8.0.0 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App] Microsoft.NETCore.App 8.0.1 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App] Microsoft.NETCore.App 8.0.2 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App] Microsoft.NETCore.App 9.0.0-preview.2.24120.11 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App] Microsoft.NETCore.App 9.0.0-preview.2.24128.5 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App] Microsoft.NETCore.App 9.0.0-preview.3.24172.9 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App] Other architectures found: x64 [/usr/local/share/dotnet/x64] Environment variables: Not set global.json file: Not found Learn more: https://aka.ms/dotnet/info Download .NET: https://aka.ms/dotnet/download

dotnet-policy-service[bot] commented 7 months ago

Tagging subscribers to this area: @dotnet/area-system-collections See info in area-owners.md if you want to be subscribed.

EgorBo commented 7 months ago

Doesn't repro for me on x64-windows:


BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3447/23H2/2023Update/SunValley3)
AMD Ryzen 9 7950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK 9.0.100-preview.3.24204.13
  [Host]   : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  .NET 8.0 : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  .NET 9.0 : .NET 9.0.0 (9.0.24.17209), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

| Method               | Job      | Runtime  | Mean     | Error     | StdDev    |
|--------------------- |--------- |--------- |---------:|----------:|----------:|
| ListAdd10000         | .NET 8.0 | .NET 8.0 | 7.339 us | 0.0598 us | 0.0559 us |
| ListAdd10000PreAlloc | .NET 8.0 | .NET 8.0 | 5.002 us | 0.0988 us | 0.1508 us |
| ListAdd10000         | .NET 9.0 | .NET 9.0 | 7.472 us | 0.1401 us | 0.2597 us |
| ListAdd10000PreAlloc | .NET 9.0 | .NET 9.0 | 4.992 us | 0.0987 us | 0.1730 us |

don't have an macos-arm64 machine to test

vcsjones commented 7 months ago

@EgorBo I can repro on my M1.

BenchmarkDotNet v0.13.12, macOS Sonoma 14.4.1 (23E224) [Darwin 23.4.0] Apple M1 Ultra, 1 CPU, 20 logical and 20 physical cores .NET SDK 9.0.100-preview.3.24204.13 [Host] : .NET 8.0.3 (8.0.324.11423), Arm64 RyuJIT AdvSIMD Job-OISOJI : .NET 8.0.3 (8.0.324.11423), Arm64 RyuJIT AdvSIMD Job-YEXJFM : .NET 9.0.0 (9.0.24.17209), Arm64 RyuJIT AdvSIMD

Method	Runtime	Mean	Error	StdDev	Ratio
ListAdd10000	.NET 8.0	10.488 us	0.0102 us	0.0080 us	1.00
ListAdd10000	.NET 9.0	28.429 us	0.0501 us	0.0469 us	2.71

ListAdd10000PreAlloc	.NET 8.0	7.137 us	0.0147 us	0.0138 us	1.00
ListAdd10000PreAlloc	.NET 9.0	24.054 us	0.0877 us	0.0732 us	3.37

linkdotnet commented 7 months ago

The same applies to 9.0.100-preview.2.24157.14 as well - so the regression might be since preview.1 or preview.2.

EgorBo commented 7 months ago

Interesting! Reproduces for me on M2 Max as well (just found it 🙂):

BenchmarkDotNet v0.13.12, macOS Sonoma 14.4.1 (23E224) [Darwin 23.4.0]
Apple M2 Max, 1 CPU, 12 logical and 12 physical cores
.NET SDK 9.0.100-preview.3.24204.13
  [Host]   : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD
  .NET 8.0 : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD
  .NET 9.0 : .NET 9.0.0 (9.0.24.17209), Arm64 RyuJIT AdvSIMD

| Method               | Job      | Runtime  | Mean      | Error     | StdDev    |
|--------------------- |--------- |--------- |----------:|----------:|----------:|
| ListAdd10000         | .NET 8.0 | .NET 8.0 |  9.633 us | 0.0906 us | 0.0848 us |
| ListAdd10000PreAlloc | .NET 8.0 | .NET 8.0 |  6.688 us | 0.0389 us | 0.0364 us |
| ListAdd10000         | .NET 9.0 | .NET 9.0 | 24.315 us | 0.1284 us | 0.1201 us |
| ListAdd10000PreAlloc | .NET 9.0 | .NET 9.0 | 21.207 us | 0.0854 us | 0.0757 us |

tannergooding commented 7 months ago

Didn't we change Arm64 to start using the newer Arm64 barrier intstructions for the GC if they existed around that point?

IIRC, @kunalspathak did some of that work.

tannergooding commented 7 months ago

Not finding the barrier PR I was remembering, but there was https://github.com/dotnet/runtime/pull/97953 too

EgorBo commented 7 months ago

Doesn't repro on Linux-arm64

EgorBo commented 7 months ago

codegen diff: https://www.diffchecker.com/6tDgYQ8w/ - the only difference is ldp on net9 vs 2 ldr on net8. Hard to imagine it being a culprit 🤔

tannergooding commented 7 months ago

Could be something more subtle has changed and is impacting the general measurements or iteration count?

ldr and ldp are both supposed to be 3 cycle, 3 latency. But alternatively perhaps there's some weird data dependency or other issue here causing the delay?

I don't think Apple or Arm64 in general has anything like Intel VTune or AMD uProf, so seeing where the stalls are happening isn't as easy, unfortunately.

EgorBo commented 7 months ago

Could be something more subtle has changed and is impacting the general measurements or iteration count?

ldr and ldp are both supposed to be 3 cycle, 3 latency. But alternatively perhaps there's some weird data dependency or other issue here causing the delay?

I don't think Apple or Arm64 in general has anything like Intel VTune or AMD uProf, so seeing where the stalls are happening isn't as easy, unfortunately.

I've just checked it locally - I've built a runtime that exactly matches 9.0-preview3 and removed the Ldr->Ldp optimization - it has fixed the perf 😐

EgorBo commented 7 months ago

Here is the asm diff for ReplaceLdrStrWithPairInstr being there (left) and removed (right) on the same runtime: https://www.diffchecker.com/8C38Yoor/

EgorBo commented 7 months ago

After some brainstorming in our Community Discord:

1) The loop alignment is the same in both cases - we even tried to put an nop after ldp so the overall codegen's (and loop) size matches -- no change 2) The issue was introduced in https://github.com/dotnet/runtime/pull/92768 in .NET 9.0. That PR allows pre-existing ReplaceLdrStrWithPairInstr optimization to recognize more pairs of loads as ldp. 3) Linux-arm64 Ampere is not affected it seems (if I correctly measured) so for now it's macOS arm64 specific. Perhaps someone can try it on some other non-mac arm64 machine? The benchmark is:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;

[SimpleJob(runtimeMoniker: RuntimeMoniker.Net80)]
[SimpleJob(runtimeMoniker: RuntimeMoniker.Net90)]
public class MyClass
{
    static void Main(string[] args)
    {
        BenchmarkSwitcher.FromAssembly(typeof(MyClass).Assembly).Run(args);
    }

    [Benchmark]
    public List<int> ListAdd10000PreAlloc()
    {
        var list = new List<int>(10_000);
        for (var i = 0; i < 10_000; i++)
            list.Add(i);
        return list;
    }
}

(with <TargetFrameworks>net8.0;net9.0</TargetFrameworks> in the csproj)

tannergooding commented 7 months ago

It could potentially be related to behavioral changes that exist for LDP when LSE2 is supported.

In particular

If FEAT_LSE2 is implemented, LDP, LDNP, and STP instructions that load or store two 64-bit registers are single-copy atomic when all of the following conditions are true: • The overall memory access is aligned to 16 bytes. • Accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory.

If FEAT_LSE2 is implemented, LDP, LDNP, and STP instructions that access fewer than 16 bytes are single-copy atomic when all of the following conditions are true: • All bytes being accessed are within a 16-byte quantity aligned to 16 bytes. • Accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memor

The actual operation (or a snippet of it) is then described loosely as:

if HaveLSE2Ext() && !signed then
    bits(2*datasize) full_data;
    full_data = Mem[address, 2*dbytes, accdesc, TRUE];
    if BigEndian(accdesc.acctype) then
        data2 = full_data<(datasize-1):0>;
        data1 = full_data<(2*datasize-1):datasize>;
    else
        data1 = full_data<(datasize-1):0>;
        data2 = full_data<(2*datasize-1):datasize>;
else
    data1 = Mem[address, dbytes, accdesc];
    data2 = Mem[address+dbytes, dbytes, accdesc];

So it may be beneficial to test this on a Linux machine with LSE2 to see if that makes a difference or not.

vcsjones commented 7 months ago

Graviton 3:

Method	Job	Runtime	Mean	Error	StdDev
ListAdd10000PreAlloc	.NET 8.0	.NET 8.0	23.84 us	0.422 us	0.414 us
ListAdd10000PreAlloc	.NET 9.0	.NET 9.0	23.75 us	0.270 us	0.252 us

To @tannergooding's point though, either the hardware or the kernel I am using do not support LSE2 based on dmesg output.

EgorBo commented 7 months ago

this change fixes the perf too. the loads we're looking at are _size and _version inside List<T> so we're likely dealing with some sort of stall here probably caused by LSE2 that Tanner mentioned

tannergooding commented 7 months ago

My guess, given the above, is then that since LSE2 makes ldp atomic for loads that exist within a 16-byte window (and we know these will be since the they'll be 8-byte aligned), that the write to _version is placing a stall against the read for _size, which wouldn't exist for two independent ldr.

EgorBo commented 7 months ago

I like that guess. Although, the fact that on Graviton 3 we're seeing 23 us for both cases (and I am seeing similar numbers on Ampere-linux-arm64) while x64 and apple m1 show 7us for the best case, might be hinting something else

EgorBo commented 7 months ago

a dummy field between _size and _version in List<> fixes the issue as well (since it breaks the ldp optimization too)

EgorBo commented 7 months ago

Minimal repro:

using System.Diagnostics;
using System.Runtime.CompilerServices;

public class MyClass
{
    static void Main()
    {
        var mc = new MyClass();
        Stopwatch sw = Stopwatch.StartNew();
        while (true)
        {
            sw.Restart();
            for (int i = 0; i < 10000; i++)
                mc.Test();
            sw.Stop();
            Console.WriteLine(sw.ElapsedMilliseconds);
        }
    }

    int field1;
    int field2;

    [MethodImpl(MethodImplOptions.NoInlining)]
    public void Test()
    {
        for (int i = 0; i < 10000; i++)
        {
            field1++;
            field2++;
        }
    }
}

Codegen diff for Test(): https://www.diffchecker.com/72No1VJ6/ Left - Main, ~191 ms per iteration Right - LDP optimization disabled, ~67 ms per iteration (~3x faster)

EgorBo commented 7 months ago

cc @a74nh @TamarChristinaArm Perhaps, you know why we could see such a terrible perf hit from ldp on Apple arm64 CPUs

EgorBo commented 7 months ago

The fun part that this benchmark is 3x faster under x64 emulation (Rosetta), e.g.:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;
using System.Runtime.Intrinsics;

BenchmarkRunner.Run<MyClass>();

[InProcess]
public class MyClass
{
    [Benchmark]
    public List<int> ListAdd10000PreAlloc()
    {
        var list = new List<int>(10_000);
        for (var i = 0; i < 10_000; i++)
            list.Add(i);
        return list;
    }
}

Then:

dotnet publish --sc -f net9.0 -r osx-x64 
dotnet publish --sc -f net9.0 -r osx-arm64

Native (arm64):

Apple M2 Max, 1 CPU, 12 logical and 12 physical cores
  [Host] : .NET 9.0.0 (9.0.24.17209), Arm64 RyuJIT AdvSIMD

| Method               | Mean     | Error    | StdDev   |
|--------------------- |---------:|---------:|---------:|
| ListAdd10000PreAlloc | 22.07 us | 0.258 us | 0.229 us |

Rosetta (x64) on the same hw

Apple M2 Max 2.40GHz, 1 CPU, 12 logical and 12 physical cores
  [Host] : .NET 9.0.0 (9.0.24.17209), X64 RyuJIT SSE4.2

| Method               | Mean     | Error     | StdDev    |
|--------------------- |---------:|----------:|----------:|
| ListAdd10000PreAlloc | 7.724 us | 0.0862 us | 0.0764 us |

linkdotnet commented 7 months ago

On an arm64 RPi 4B there is no regression as well:

BenchmarkDotNet v0.13.12, Debian GNU/Linux 12 (bookworm) Unknown processor .NET SDK 9.0.100-preview.3.24204.13 [Host] : .NET 9.0.0 (9.0.24.17209), Arm64 RyuJIT AdvSIMD .NET 8.0 : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD .NET 9.0 : .NET 9.0.0 (9.0.24.17209), Arm64 RyuJIT AdvSIMD

Method	Job	Runtime	Mean	Error	StdDev
ListAdd10000PreAlloc	.NET 8.0	.NET 8.0	74.73 us	0.164 us	0.128 us
ListAdd10000PreAlloc	.NET 9.0	.NET 9.0	64.74 us	0.468 us	0.415 us

tannergooding commented 7 months ago

My expectation remains that this is due to LSE2 making ldp atomic if it doesn't split a 16-byte window and that's causing the write to stall the otherwise unrelated read.

I expect that if someone has an LSE2 capable Linux box, they'd see the same perf hit. The hard problem is that not many easily testable A64 CPUs have LSE2 currently outside Apple M1 and later.

jakobbotsch commented 7 months ago

Are you saying that LSE2 makes it impossible for the hardware to implement ldp as efficiently as before, leading to this 3.5x slowdown simply from updating to LSE2 capable ARM64 hardware? It feels hard to believe. I'm hopeful this is some poorly handled case in Apple Silicon that we'll be able to understand better. Regardless it seems we'll need to figure out how exactly we want to disable this peephole.

EgorBo commented 7 months ago

I've checked that my linux-arm64 box has lse2 (via hwCap & HWCAP_USCAT) and it doesn't have this problem.

I'm hopeful this is some poorly handled case in Apple Silicon

... or just nicely handled case for ldr case there 🙂

tannergooding commented 7 months ago

Are you saying that LSE2 makes it impossible for the hardware to implement ldp as efficiently as before, leading to this 3.5x slowdown simply from updating to LSE2 capable ARM64 hardware?

Yes, for some scenarios. For most cases ldp will be strictly better/more efficient, but for others it has the chance to introduce a new data dependency due to the LSE2 requirement that it be atomic in some scenarios.

If you have (semi pseudo-code):

label: 
    ldr w0, [addr1]
    ldr w1, [addr1 + #4]
    ; do something with w1
    add w0, w0, #1
    str [addr1], w0
    br label

Then the code will first iterate through the loop normally, and the second iteration ldr w0, [addr1] will be scheduled but will be dependent on str [addr1], w0 from the previous iteration completing first. However, ldr w1, [addr2] is a separate load and therefore can happen independently of the first load and so we can continue onto executing do something with w1 even if the str [addr1], w0 hasn't completed yet.

This all follows the documented Arm64 memory model, which is a weakly ordered model. Where by default:

A read that is generated by a load instruction that loads a single general-purpose register and is aligned to the size of the read in the instruction is single-copy atomic.
A write that is generated by a store instruction that stores a single general-purpose register and is aligned to the size of the write in the instruction is single-copy atomic.
Reads that are generated by a Load Pair instruction that loads two general-purpose registers and are aligned to the size of the load to each register are treated as two single-copy atomic reads, one for each register being loaded.

and where the important part of the general semantics under B2.3 Definition of the Arm memory model can basically be summarized as a read effect E2 will occur after write effect E1 if they are to the same location and cannot be reordered with regards to eachother (it is formally defined as the Local read successor).

Thus, given the above the follow program is strictly equivalent:

label: 
    ldp w0, w1, [addr1]
    ; do something with w1
    add w0, w0, #1
    str [addr1], w0
    br label

LSE2, however, changes the semantics for ldp such that

If FEAT_LSE2 is implemented, LDP, LDNP, and STP instructions that load or store two 64-bit registers are single-copy atomic when all of the following conditions are true:

The overall memory access is aligned to 16 bytes.

Accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory.

If FEAT_LSE2 is implemented, LDP, LDNP, and STP instructions that access fewer than 16 bytes are single-copy atomic when all of the following conditions are true:

All bytes being accessed are within a 16-byte quantity aligned to 16 bytes.

Accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory.

This change to make the load a single-copy atomic that encompasses both registers, rather than it being 2 separate single-copy atomic operations (one for each register), really flips how ldp works (at least based on my understanding of the rules) and it means that due to the memory ordering rules of read effects to write effects and for local read successors, that the load of data into w1 can no longer happen independently of the str [addr], w0

tannergooding commented 7 months ago

There's notably, beyond the LSE2 changes, a number of additional changes to the memory model specifically for Armv8.4+ (detailed in B2.2.1.1 Changes to single-copy atomicity in Armv8.4), including in how coherence works for some types of overlapping reads/writes that are single-copy atomic.

tamlin-mike commented 7 months ago

I think @EgorBo hit the nail on the head with moving the increment of _version last in the Add method.

Not only does it (seemingly) avoid the CPU stall, I'd argue it's also correct. Having the increment at the top means it always increments version, even if something later goes wrong, such as terminating with an OOM exception. Now you have placed the list in a conceptually inconsistent state; the List is not modified, but _version claims it is modified. Placing the increment last makes sure it's only incremented once everything else is completed.

Performace-wise, having that read-modify-write just before method return could allow the CPU decode/execute the return (even if it may be only be a conceptual return instruction, due to the AggressiveInlining tag) in parallel with the RMW (assuming superscalar CPU).

Provided that logic is valid, perhaps it could be worth running comparative perf-tests for every ISA supported, with the RMW at head, vs. moved to the very end?

The core issue with the JIT obviously remains, but it would at least fix a logic error in the runtime.

tannergooding commented 7 months ago

Having the increment at the top means it always increments version, even if something later goes wrong, such as terminating with an OOM exception. Now you have placed the list in a conceptually inconsistent state; the List is not modified, but _version claims it is modified.

If an exception is thrown, you generally have to assume the potential for torn/corrupted state. The same general issue exists for every single one of the field mutations.

You typically want to try and avoid such cases where possible, but this is one of the cases where multi-threaded access is already considered UB and where incrementing the version where the only normal error would be an OOM, which is already considered a "catastrophic failure", is fine.

Clockwork-Muse commented 7 months ago

Now you have placed the list in a conceptually inconsistent state; the List is not modified, but _version claims it is modified.

So if I perform two modifications that net to no changes in the list, should the version be put back?

neon-sunset commented 7 months ago

Perhaps this could also serve as a tipping point to finally remove the version checks from List<T> altogether (or leave them in debug only)? 😄

EgorBo commented 7 months ago

Perhaps this could also serve as a tipping point to finally remove the version checks from List<T> altogether (or leave them in debug only)? 😄

For reference: https://github.com/dotnet/runtime/issues/81523

tamlin-mike commented 7 months ago

If an exception is thrown, you generally have to assume the potential for torn/corrupted state.

Nonsense. If you are dealing with "remote" resources (remote from the CPU's POV), e.g. graphics cards, network connections, USB devices et.c. it can be valid, but not for local memory, and certainly not for collection classes. This problem was solved decades ago in C++ std library.

The same general issue exists for every single one of the field mutations.

Again, nonsense. This is a simple matter of order of mutation. You simply do not modify the object until you have performed all potentially-exception-throwing operations, collecting their results in local variable(s). Once that is done, you can mutate this object fields in whatever order you like.

You typically want to try and avoid such cases where possible, but this is one of the cases where multi-threaded access is already considered UB

I didn't mention multi-threading, since the current design breaks even single-threaded.

Grab iterator.
Try to Add() something.
Add() fails with OOM, i.e. throws exception.

Here you expect the iterator to still be valid, and the collection object to be unmodified. With current implementation it is not. With the proposed change it is.

I hope the vast majority of people would prefer at least the basic level of exception-safe code in the core library.

So if I perform two modifications that net to no changes in the list, should the version be put back?

Since that's so silly, I expected to find a smiley at the end. Lacking that, I have to assume it's a serious question. No, obviously not. _version is a monotonic counter, intended to keep count of number-of-internal-state-modifications to the object. Let us try an analogy:

You query your bank account balance, calling it X.
You deposit 100 units of money to your bank account.
You withdraw 100 units of money from your bank account.
You query your bank account balance, calling it Y.

At this time, disregarding potential transaction fees, interest and so on, you expect X == Y. However, what's the bank internal transaction counter for your account (the equivalent of List._version)? Would you excpt that to roll back on the withdrawal? No, of course not. Heck, it's probably illegal in most civilized nations to do such a rollback.

tannergooding commented 7 months ago

This problem was solved decades ago in C++ std library.

There are different classifications of errors/exceptions, ranging from "the application messed up" to "the system messed up". While code ideally tries to order things such that you avoid the potential for torn state, you can not always perform updates atomically and may not be able to ensure all exceptions are done in a way that avoids the ability to have torn state. Accordingly, it is not safe to make the assumption that all exceptions are safe to catch. It is not safe to assume how a given piece of external to the module code does its operations and whether or not it could safely handle torn state.

Most code is a best effort and cannot necessarily capture all possible nuance. This is accordingly why C++ explicitly documents behavior as undefined or implementation defined in many cases. It is why the C++ spec explicitly defines valid but unspecified state early and refers to it regularly. Such a definition is important as it is very different from something which is in a fully unspecified state and therefore may be invalid.

To quote a part of the spec directly, insert, as it appears for std::vector<T> in 26.3.11.5 vector modifiers, explicitly documents that there are cases where there must be no effects, but that there is potential for there to be effects in other scenarios as it is impossible to definitively assert that it can be safe:

If an exception is thrown other than by the copy constructor, move constructor, assignment operator, or move assignment operator of T or by any InputIterator operation there are no effects. If an exception is thrown while inserting a single element at the end and T is CopyInsertable or is_nothrow_move_constructible_v is true, there are no effects. Otherwise, if an exception is thrown by the move constructor of a non-CopyInsertable T, the effects are unspecified.

Given the above, there are many more cases of potential torn state for inserting into a std::vector then there are for a List<T> which namely comes about from the existence of copy/move constructors and assignment operator support. Since .NET doesn't itself have those concerns for arbitrary T, and since Add functions as inserting a single element at the end, there are no effects other than an OOM that can occur in single-threaded code and thus no potential for torn state.

Here you expect the iterator to still be valid, and the collection object to be unmodified.

OOM, as thrown by the GC, is explicitly documented as a catastrophic exception. The docs further cover that if you were to catch the exception, you are really expected to do the bare minimal handling and then fail fast. Much like other types of catastrophic exception (such as StackOverflow), it is not expected to be something an application can reasonably recover from nor is something that you may guaranteed to be able to catch and handle.

This documentation is what makes when _version is updated for Add, not really matter. One could argue its better for it to happen later since its done after the necessary mutations have been made. One could alternatively argue its better for it to happen sooner, as it may help expose multithreaded access issues sooner. But neither is more or less correct, what happens is undefined behavior for any scenario where it might be an issue.

It is important to note that there are other types of "system thrown exceptions" which are representative of catastrophic failures, such as ExecutionEngineException or StackOverflowException. There are also exception types (such as OverflowException, IndexOutOfRangeException, and NullReferenceException) which are not catastrophic but are thrown by the system as an implicit side effect an operation. These all fall into the same general bucket, which is to say they are often representative that it may be unsafe to assume the state of the object is "valid".

We then have the more typical application thrown exceptions, such as ArgumentException, where it is only ever explicitly thrown by the developer. It is extremely typical for this type of exception to happen up front before any kinds of state mutations have happened and so its often considered okay to presume the state remains "valid, but unspecified" when one is thrown.

There is, of course, code out there that violates these presumptions. There is code that will manually throw system exceptions themselves (despite analyzers existing telling users to not do that) and there is code that will throw application exceptions after state mutations have occurred. In many cases this may be unintentional and could be caused by a myriad of different considerations, such as behavioral or breaking changes made by a dependency. However, it just reiterates that all of this remains a "best effort", that exceptions in general are meant to be exceptional, and that one needs to be considerate when catching or otherwise handling exceptions.

Clockwork-Muse commented 7 months ago

Since that's so silly, I expected to find a smiley at the end. Lacking that, I have to assume it's a serious question.

I guess I should have added the smiley.

We then have the more typical application thrown exceptions, such as ArgumentException, where it is only ever explicitly thrown by the developer. It is extremely typical for this type of exception to happen up front before any kinds of state mutations have happened and so its often considered okay to presume the state remains "valid, but unspecified" when one is thrown.

Frankly, anything that hits an ArgumentException should likely be fail-fast too, since it means the programmer has done something catastrophic themselves.

tannergooding commented 7 months ago

Frankly, anything that hits an ArgumentException should likely be fail-fast too, since it means the programmer has done something catastrophic themselves.

Right, it's mostly meant to tell the developer "you passed something incorrectly here" and therefore that they should go and fix their code. Things where exceptions might be expected, such as T Parse(string) where the string is likely user input typically have Try variants (i.e. bool TryParse(string, out T)). This doesn't block all exceptions, since we still want to differentiate between "the developer messed up" (passing in completely incompatible parse options) from "the user messed up" (typically some form of typo or not understanding the prompt).

Since it's mostly meant to be to tell the developer, catching the exception is typically not what you want. Catching it should really be a last resort workaround because a relevant Try API doesn't exist or there is some quirk/bug/etc. Scenarios exist, they should just be very rare

mqudsi commented 7 months ago

Frankly, anything that hits an ArgumentException should likely be fail-fast too, since it means the programmer has done something catastrophic themselves.

You would think so, though as @tannergooding mentions, there are reasons why you might be handling it. One from the comments (and my own experience): Process.GetProcessById() throws ArgumentException if the id doesn't exist.

tannergooding commented 7 months ago

I'd suggest people open API suggestions for cases like this:

Process.GetProcessById() throws ArgumentException if the id doesn't exist.

The general API review process is detailed here: https://github.com/dotnet/runtime/blob/main/docs/project/api-review-process.md

The template for a new API proposal is then: https://github.com/dotnet/runtime/issues/new?assignees=&labels=api-suggestion&projects=&template=02_api_proposal.yml&title=%5BAPI+Proposal%5D%3A+

It really doesn't need to be that complex and can really just detail that something like Process.TryGetProcessById would be beneficial because the current version throws ArgumentException if the id doesn't exist and there is no safe way to deterministically know if a process id remains valid since a separate process is strictly running concurrently and could shut down in between some check and query.

jkotas commented 7 months ago

FWIW, proposals for various Try* methods have been rejected by API review in the past https://github.com/dotnet/runtime/issues/21785#issuecomment-330615069 .

mqudsi commented 7 months ago

I submitted it as #101582, for whatever it is worth. If anyone sees anything amiss with the proposal, please do feel free to correct me!

EgorBo commented 7 months ago

So we have a good guess (discussed internally) what's going on here, but it's not clear how to detect bad patterns and disable ldr->ldp optimization. The PRs which added ldr->ldp optimization only have a very few improvement reports attached to them (except ldp for SIMD regs), so we can, probably, just conservatively disable it for Apple platforms (we've confirmed that this problem is Apple specific at this point).

I think the first thing we have to do is to add macOS-arm64 platform to our dotnet/performance runs (it's currently Linux and Windows only) before we make any changes.

marcan commented 7 months ago

Reminder that Linux on Apple Silicon is a thing, both virtualized and bare metal, and virtual Windows ARM64 on Apple Silicon is also a thing. Please do not gate this on macOS only.

Given the LSE2 theory, I think it makes sense to treat this as a performance issue potentially affecting more CPU platforms (especially future ones), given that there is a plausible explanation for the mechanism and why the code is, for at least some logical implementations, actually sub-optimal. If the gains of this peephole opt are otherwise minimal, and it's not easy to detect the problem code sequences (the ones where it triggers extra stalls) then I think the logical thing to do is just disable it across the board. Alternatively, if you think this is rare enough and the List.Add case is an outlier, then just work around it there (e.g. with the version change).

But please don't use OS platform as a gate for this, since that makes no sense given multiple OSes exist (you'd have to use CPU implementer instead, and I assume that's not practical given ahead-of-time compilation?).

neon-sunset commented 7 months ago

It looks like part of the discussion was marked off-topic.

Just wanted to re-iterate that this particular regression is still solvable by removing _version when users target Release altogether, and it might be a good opportunity to finally explore this option as described in https://github.com/dotnet/runtime/issues/81523

linkdotnet commented 7 months ago

It looks like part of the discussion was marked off-topic.

Just wanted to re-iterate that this particular regression is still solvable by removing _version when users target Release altogether, and it might be a good opportunity to finally explore this option as described in #81523

Wouldn’t that only solve the problem for List? Leaving an area for potential other regressions now and the future!?

linkdotnet commented 7 months ago

Don’t get me wrong - one can still lead this discussion but I do feel it should be handled separately (and not only because of performance characteristics)

EgorBo commented 7 months ago

Minimal repro for the Apple issue using Clang and inline asm: https://gist.github.com/EgorBo/88196a218559ec93a197a7d1d5600548

ldp makes this "benchmark" ~4-5x slower. An interesting thing is that if we have ldp and stp - it's still slow (we thought that the problem reproduces only when we mix str with ldp and stp with ldp would be fine)

mqudsi commented 7 months ago

Wouldn't just moving the _version write as @EgorBo demonstrated just be the least controversial change/fix here (fixes this particular performance regression, avoids arguing about the bigger questions involved with removing version checks altogether, keeps version checks in place even for List.Add())?

The performance regressions can be tackled by monitoring macOS benchmark results (the point about Apple Silicon != macOS is well taken, but also somewhat besides the point since Linux (OS) performance is tracked separately and monitoring macOS (OS + hw) performance would have caught this for both newer Macs (OS + hw) and Linux-on-Apple-silicon — you do not need "idealized" performance monitoring where only one factor changes between each monitored configuration) and figuring out where other notable regressions occur?

You probably don't want to blanket disable ldr to ldp optimizations, as this was a pathological case where there was a false data dependency on two adjacent fields but there likely are cases where compiler optimizations can take advantage of the single copy atomicity improvements to net performance gains.

(But I must confess I am confused why this would problem would not manifest on non-Apple silicon aarch64 w/ FEAT_LSE2 present, as reported earlier in this thread, unless the cpu wasn't taking advantage of the available instruction-level parallelism before so there is no regression now?)

Developer-Ecosystem-Engineering commented 7 months ago

We are reviewing this issue thanks for raising it!

Developer-Ecosystem-Engineering commented 6 months ago

The use of LDP instructions is generally encouraged. In this specific case, the structure of the loop (that includes a memory dependence between stores and loads), coupled with the use of LDP instructions prevents certain hardware optimizations from engaging. The small and tight nature of the loop also exacerbates the impact.

Please also see section 4.6.11 of the Apple Silicon CPU optimization guide.

dotnet / runtime

List Add is way slower (almost 3 times) in `net9.0 preview 3` than with `net8.0` #101437

Benchmark

`dotnet --info`