Closed linkdotnet closed 4 months ago
Tagging subscribers to this area: @dotnet/area-system-collections See info in area-owners.md if you want to be subscribed.
Doesn't repro for me on x64-windows:
BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3447/23H2/2023Update/SunValley3)
AMD Ryzen 9 7950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK 9.0.100-preview.3.24204.13
[Host] : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
.NET 8.0 : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
.NET 9.0 : .NET 9.0.0 (9.0.24.17209), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
| Method | Job | Runtime | Mean | Error | StdDev |
|--------------------- |--------- |--------- |---------:|----------:|----------:|
| ListAdd10000 | .NET 8.0 | .NET 8.0 | 7.339 us | 0.0598 us | 0.0559 us |
| ListAdd10000PreAlloc | .NET 8.0 | .NET 8.0 | 5.002 us | 0.0988 us | 0.1508 us |
| ListAdd10000 | .NET 9.0 | .NET 9.0 | 7.472 us | 0.1401 us | 0.2597 us |
| ListAdd10000PreAlloc | .NET 9.0 | .NET 9.0 | 4.992 us | 0.0987 us | 0.1730 us |
don't have an macos-arm64 machine to test
@EgorBo I can repro on my M1.
BenchmarkDotNet v0.13.12, macOS Sonoma 14.4.1 (23E224) [Darwin 23.4.0] Apple M1 Ultra, 1 CPU, 20 logical and 20 physical cores .NET SDK 9.0.100-preview.3.24204.13 [Host] : .NET 8.0.3 (8.0.324.11423), Arm64 RyuJIT AdvSIMD Job-OISOJI : .NET 8.0.3 (8.0.324.11423), Arm64 RyuJIT AdvSIMD Job-YEXJFM : .NET 9.0.0 (9.0.24.17209), Arm64 RyuJIT AdvSIMD
Method | Runtime | Mean | Error | StdDev | Ratio |
---|---|---|---|---|---|
ListAdd10000 | .NET 8.0 | 10.488 us | 0.0102 us | 0.0080 us | 1.00 |
ListAdd10000 | .NET 9.0 | 28.429 us | 0.0501 us | 0.0469 us | 2.71 |
ListAdd10000PreAlloc | .NET 8.0 | 7.137 us | 0.0147 us | 0.0138 us | 1.00 |
ListAdd10000PreAlloc | .NET 9.0 | 24.054 us | 0.0877 us | 0.0732 us | 3.37 |
The same applies to 9.0.100-preview.2.24157.14
as well - so the regression might be since preview.1
or preview.2
.
Interesting! Reproduces for me on M2 Max as well (just found it 🙂):
BenchmarkDotNet v0.13.12, macOS Sonoma 14.4.1 (23E224) [Darwin 23.4.0]
Apple M2 Max, 1 CPU, 12 logical and 12 physical cores
.NET SDK 9.0.100-preview.3.24204.13
[Host] : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD
.NET 8.0 : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD
.NET 9.0 : .NET 9.0.0 (9.0.24.17209), Arm64 RyuJIT AdvSIMD
| Method | Job | Runtime | Mean | Error | StdDev |
|--------------------- |--------- |--------- |----------:|----------:|----------:|
| ListAdd10000 | .NET 8.0 | .NET 8.0 | 9.633 us | 0.0906 us | 0.0848 us |
| ListAdd10000PreAlloc | .NET 8.0 | .NET 8.0 | 6.688 us | 0.0389 us | 0.0364 us |
| ListAdd10000 | .NET 9.0 | .NET 9.0 | 24.315 us | 0.1284 us | 0.1201 us |
| ListAdd10000PreAlloc | .NET 9.0 | .NET 9.0 | 21.207 us | 0.0854 us | 0.0757 us |
Didn't we change Arm64 to start using the newer Arm64 barrier intstructions for the GC if they existed around that point?
IIRC, @kunalspathak did some of that work.
Not finding the barrier PR I was remembering, but there was https://github.com/dotnet/runtime/pull/97953 too
Doesn't repro on Linux-arm64
codegen diff: https://www.diffchecker.com/6tDgYQ8w/ - the only difference is ldp
on net9 vs 2 ldr
on net8. Hard to imagine it being a culprit 🤔
Could be something more subtle has changed and is impacting the general measurements or iteration count?
ldr
and ldp
are both supposed to be 3 cycle, 3 latency. But alternatively perhaps there's some weird data dependency or other issue here causing the delay?
I don't think Apple or Arm64 in general has anything like Intel VTune or AMD uProf, so seeing where the stalls are happening isn't as easy, unfortunately.
Could be something more subtle has changed and is impacting the general measurements or iteration count?
ldr
andldp
are both supposed to be 3 cycle, 3 latency. But alternatively perhaps there's some weird data dependency or other issue here causing the delay?I don't think Apple or Arm64 in general has anything like Intel VTune or AMD uProf, so seeing where the stalls are happening isn't as easy, unfortunately.
I've just checked it locally - I've built a runtime that exactly matches 9.0-preview3 and removed the Ldr->Ldp optimization - it has fixed the perf 😐
Here is the asm diff for ReplaceLdrStrWithPairInstr
being there (left) and removed (right) on the same runtime: https://www.diffchecker.com/8C38Yoor/
After some brainstorming in our Community Discord:
1) The loop alignment is the same in both cases - we even tried to put an nop
after ldp so the overall codegen's (and loop) size matches -- no change
2) The issue was introduced in https://github.com/dotnet/runtime/pull/92768 in .NET 9.0. That PR allows pre-existing ReplaceLdrStrWithPairInstr
optimization to recognize more pairs of loads as ldp
.
3) Linux-arm64 Ampere is not affected it seems (if I correctly measured) so for now it's macOS arm64 specific. Perhaps someone can try it on some other non-mac arm64 machine? The benchmark is:
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;
[SimpleJob(runtimeMoniker: RuntimeMoniker.Net80)]
[SimpleJob(runtimeMoniker: RuntimeMoniker.Net90)]
public class MyClass
{
static void Main(string[] args)
{
BenchmarkSwitcher.FromAssembly(typeof(MyClass).Assembly).Run(args);
}
[Benchmark]
public List<int> ListAdd10000PreAlloc()
{
var list = new List<int>(10_000);
for (var i = 0; i < 10_000; i++)
list.Add(i);
return list;
}
}
(with <TargetFrameworks>net8.0;net9.0</TargetFrameworks>
in the csproj)
It could potentially be related to behavioral changes that exist for LDP
when LSE2
is supported.
In particular
If FEAT_LSE2 is implemented, LDP, LDNP, and STP instructions that load or store two 64-bit registers are single-copy atomic when all of the following conditions are true: • The overall memory access is aligned to 16 bytes. • Accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory.
If FEAT_LSE2 is implemented, LDP, LDNP, and STP instructions that access fewer than 16 bytes are single-copy atomic when all of the following conditions are true: • All bytes being accessed are within a 16-byte quantity aligned to 16 bytes. • Accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memor
The actual operation (or a snippet of it) is then described loosely as:
if HaveLSE2Ext() && !signed then
bits(2*datasize) full_data;
full_data = Mem[address, 2*dbytes, accdesc, TRUE];
if BigEndian(accdesc.acctype) then
data2 = full_data<(datasize-1):0>;
data1 = full_data<(2*datasize-1):datasize>;
else
data1 = full_data<(datasize-1):0>;
data2 = full_data<(2*datasize-1):datasize>;
else
data1 = Mem[address, dbytes, accdesc];
data2 = Mem[address+dbytes, dbytes, accdesc];
So it may be beneficial to test this on a Linux machine with LSE2 to see if that makes a difference or not.
Graviton 3:
Method | Job | Runtime | Mean | Error | StdDev |
---|---|---|---|---|---|
ListAdd10000PreAlloc | .NET 8.0 | .NET 8.0 | 23.84 us | 0.422 us | 0.414 us |
ListAdd10000PreAlloc | .NET 9.0 | .NET 9.0 | 23.75 us | 0.270 us | 0.252 us |
To @tannergooding's point though, either the hardware or the kernel I am using do not support LSE2 based on dmesg output.
this change fixes the perf too. the loads we're looking at are _size
and _version
inside List<T>
so we're likely dealing with some sort of stall here probably caused by LSE2 that Tanner mentioned
My guess, given the above, is then that since LSE2
makes ldp
atomic for loads that exist within a 16-byte window (and we know these will be since the they'll be 8-byte aligned), that the write to _version
is placing a stall against the read for _size
, which wouldn't exist for two independent ldr
.
I like that guess. Although, the fact that on Graviton 3 we're seeing 23 us for both cases (and I am seeing similar numbers on Ampere-linux-arm64) while x64 and apple m1 show 7us for the best case, might be hinting something else
a dummy field between _size and _version in List<>
fixes the issue as well (since it breaks the ldp optimization too)
Minimal repro:
using System.Diagnostics;
using System.Runtime.CompilerServices;
public class MyClass
{
static void Main()
{
var mc = new MyClass();
Stopwatch sw = Stopwatch.StartNew();
while (true)
{
sw.Restart();
for (int i = 0; i < 10000; i++)
mc.Test();
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}
}
int field1;
int field2;
[MethodImpl(MethodImplOptions.NoInlining)]
public void Test()
{
for (int i = 0; i < 10000; i++)
{
field1++;
field2++;
}
}
}
Codegen diff for Test()
: https://www.diffchecker.com/72No1VJ6/
Left - Main, ~191 ms per iteration
Right - LDP optimization disabled, ~67 ms per iteration (~3x faster)
cc @a74nh @TamarChristinaArm Perhaps, you know why we could see such a terrible perf hit from ldp
on Apple arm64 CPUs
The fun part that this benchmark is 3x faster under x64 emulation (Rosetta), e.g.:
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;
using System.Runtime.Intrinsics;
BenchmarkRunner.Run<MyClass>();
[InProcess]
public class MyClass
{
[Benchmark]
public List<int> ListAdd10000PreAlloc()
{
var list = new List<int>(10_000);
for (var i = 0; i < 10_000; i++)
list.Add(i);
return list;
}
}
Then:
dotnet publish --sc -f net9.0 -r osx-x64
dotnet publish --sc -f net9.0 -r osx-arm64
Native (arm64):
Apple M2 Max, 1 CPU, 12 logical and 12 physical cores
[Host] : .NET 9.0.0 (9.0.24.17209), Arm64 RyuJIT AdvSIMD
| Method | Mean | Error | StdDev |
|--------------------- |---------:|---------:|---------:|
| ListAdd10000PreAlloc | 22.07 us | 0.258 us | 0.229 us |
Rosetta (x64) on the same hw
Apple M2 Max 2.40GHz, 1 CPU, 12 logical and 12 physical cores
[Host] : .NET 9.0.0 (9.0.24.17209), X64 RyuJIT SSE4.2
| Method | Mean | Error | StdDev |
|--------------------- |---------:|----------:|----------:|
| ListAdd10000PreAlloc | 7.724 us | 0.0862 us | 0.0764 us |
On an arm64
RPi 4B there is no regression as well:
BenchmarkDotNet v0.13.12, Debian GNU/Linux 12 (bookworm) Unknown processor .NET SDK 9.0.100-preview.3.24204.13 [Host] : .NET 9.0.0 (9.0.24.17209), Arm64 RyuJIT AdvSIMD .NET 8.0 : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD .NET 9.0 : .NET 9.0.0 (9.0.24.17209), Arm64 RyuJIT AdvSIMD
Method | Job | Runtime | Mean | Error | StdDev |
---|---|---|---|---|---|
ListAdd10000PreAlloc | .NET 8.0 | .NET 8.0 | 74.73 us | 0.164 us | 0.128 us |
ListAdd10000PreAlloc | .NET 9.0 | .NET 9.0 | 64.74 us | 0.468 us | 0.415 us |
My expectation remains that this is due to LSE2 making ldp
atomic if it doesn't split a 16-byte window and that's causing the write to stall the otherwise unrelated read.
I expect that if someone has an LSE2 capable Linux box, they'd see the same perf hit. The hard problem is that not many easily testable A64 CPUs have LSE2 currently outside Apple M1 and later.
Are you saying that LSE2 makes it impossible for the hardware to implement ldp
as efficiently as before, leading to this 3.5x slowdown simply from updating to LSE2 capable ARM64 hardware?
It feels hard to believe. I'm hopeful this is some poorly handled case in Apple Silicon that we'll be able to understand better. Regardless it seems we'll need to figure out how exactly we want to disable this peephole.
I've checked that my linux-arm64 box has lse2 (via hwCap & HWCAP_USCAT
) and it doesn't have this problem.
I'm hopeful this is some poorly handled case in Apple Silicon
... or just nicely handled case for ldr
case there 🙂
Are you saying that LSE2 makes it impossible for the hardware to implement ldp as efficiently as before, leading to this 3.5x slowdown simply from updating to LSE2 capable ARM64 hardware?
Yes, for some scenarios. For most cases ldp
will be strictly better/more efficient, but for others it has the chance to introduce a new data dependency due to the LSE2 requirement that it be atomic in some scenarios.
If you have (semi pseudo-code):
label:
ldr w0, [addr1]
ldr w1, [addr1 + #4]
; do something with w1
add w0, w0, #1
str [addr1], w0
br label
Then the code will first iterate through the loop normally, and the second iteration ldr w0, [addr1]
will be scheduled but will be dependent on str [addr1], w0
from the previous iteration completing first. However, ldr w1, [addr2]
is a separate load and therefore can happen independently of the first load and so we can continue onto executing do something with w1
even if the str [addr1], w0
hasn't completed yet.
This all follows the documented Arm64 memory model, which is a weakly ordered model. Where by default:
and where the important part of the general semantics under B2.3 Definition of the Arm memory model can basically be summarized as a read effect E2
will occur after write effect E1
if they are to the same location and cannot be reordered with regards to eachother (it is formally defined as the Local read successor
).
Thus, given the above the follow program is strictly equivalent:
label:
ldp w0, w1, [addr1]
; do something with w1
add w0, w0, #1
str [addr1], w0
br label
LSE2, however, changes the semantics for ldp
such that
If FEAT_LSE2 is implemented, LDP, LDNP, and STP instructions that load or store two 64-bit registers are single-copy atomic when all of the following conditions are true:
- The overall memory access is aligned to 16 bytes.
- Accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory.
If FEAT_LSE2 is implemented, LDP, LDNP, and STP instructions that access fewer than 16 bytes are single-copy atomic when all of the following conditions are true:
- All bytes being accessed are within a 16-byte quantity aligned to 16 bytes.
- Accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory.
This change to make the load a single-copy atomic that encompasses both registers, rather than it being 2 separate single-copy atomic operations (one for each register), really flips how ldp
works (at least based on my understanding of the rules) and it means that due to the memory ordering rules of read effects to write effects and for local read successors
, that the load of data into w1
can no longer happen independently of the str [addr], w0
There's notably, beyond the LSE2 changes, a number of additional changes to the memory model specifically for Armv8.4+ (detailed in B2.2.1.1 Changes to single-copy atomicity in Armv8.4
), including in how coherence works for some types of overlapping reads/writes that are single-copy atomic.
I think @EgorBo hit the nail on the head with moving the increment of _version
last in the Add
method.
Not only does it (seemingly) avoid the CPU stall, I'd argue it's also correct.
Having the increment at the top means it always increments version, even if something later goes wrong, such as terminating with an OOM exception. Now you have placed the list in a conceptually inconsistent state; the List is not modified, but _version
claims it is modified.
Placing the increment last makes sure it's only incremented once everything else is completed.
Performace-wise, having that read-modify-write just before method return could allow the CPU decode/execute the return (even if it may be only be a conceptual return instruction, due to the AggressiveInlining
tag) in parallel with the RMW (assuming superscalar CPU).
Provided that logic is valid, perhaps it could be worth running comparative perf-tests for every ISA supported, with the RMW at head, vs. moved to the very end?
The core issue with the JIT obviously remains, but it would at least fix a logic error in the runtime.
Having the increment at the top means it always increments version, even if something later goes wrong, such as terminating with an OOM exception. Now you have placed the list in a conceptually inconsistent state; the List is not modified, but _version claims it is modified.
If an exception is thrown, you generally have to assume the potential for torn/corrupted state. The same general issue exists for every single one of the field mutations.
You typically want to try and avoid such cases where possible, but this is one of the cases where multi-threaded access is already considered UB and where incrementing the version where the only normal error would be an OOM, which is already considered a "catastrophic failure", is fine.
Now you have placed the list in a conceptually inconsistent state; the List is not modified, but
_version
claims it is modified.
So if I perform two modifications that net to no changes in the list, should the version be put back?
Perhaps this could also serve as a tipping point to finally remove the version checks from List<T>
altogether (or leave them in debug only)? 😄
Perhaps this could also serve as a tipping point to finally remove the version checks from
List<T>
altogether (or leave them in debug only)? 😄
For reference: https://github.com/dotnet/runtime/issues/81523
If an exception is thrown, you generally have to assume the potential for torn/corrupted state.
Nonsense. If you are dealing with "remote" resources (remote from the CPU's POV), e.g. graphics cards, network connections, USB devices et.c. it can be valid, but not for local memory, and certainly not for collection classes. This problem was solved decades ago in C++ std library.
The same general issue exists for every single one of the field mutations.
Again, nonsense. This is a simple matter of order of mutation. You simply do not modify the object until you have performed all potentially-exception-throwing operations, collecting their results in local variable(s). Once that is done, you can mutate this
object fields in whatever order you like.
You typically want to try and avoid such cases where possible, but this is one of the cases where multi-threaded access is already considered UB
I didn't mention multi-threading, since the current design breaks even single-threaded.
Here you expect the iterator to still be valid, and the collection object to be unmodified. With current implementation it is not. With the proposed change it is.
I hope the vast majority of people would prefer at least the basic level of exception-safe code in the core library.
So if I perform two modifications that net to no changes in the list, should the version be put back?
Since that's so silly, I expected to find a smiley at the end. Lacking that, I have to assume it's a serious question.
No, obviously not. _version
is a monotonic counter, intended to keep count of number-of-internal-state-modifications to the object.
Let us try an analogy:
At this time, disregarding potential transaction fees, interest and so on, you expect X == Y. However, what's the bank internal transaction counter for your account (the equivalent of List._version
)? Would you excpt that to roll back on the withdrawal? No, of course not. Heck, it's probably illegal in most civilized nations to do such a rollback.
This problem was solved decades ago in C++ std library.
There are different classifications of errors/exceptions, ranging from "the application messed up" to "the system messed up". While code ideally tries to order things such that you avoid the potential for torn state, you can not always perform updates atomically and may not be able to ensure all exceptions are done in a way that avoids the ability to have torn state. Accordingly, it is not safe to make the assumption that all exceptions are safe to catch. It is not safe to assume how a given piece of external to the module code does its operations and whether or not it could safely handle torn state.
Most code is a best effort and cannot necessarily capture all possible nuance. This is accordingly why C++ explicitly documents behavior as undefined or implementation defined in many cases. It is why the C++ spec explicitly defines valid but unspecified state
early and refers to it regularly. Such a definition is important as it is very different from something which is in a fully unspecified state and therefore may be invalid.
To quote a part of the spec directly, insert
, as it appears for std::vector<T>
in 26.3.11.5 vector modifiers
, explicitly documents that there are cases where there must be no effects, but that there is potential for there to be effects in other scenarios as it is impossible to definitively assert that it can be safe:
If an exception is thrown other than by the copy constructor, move constructor, assignment operator, or move assignment operator of T or by any InputIterator operation there are no effects. If an exception is thrown while inserting a single element at the end and T is CopyInsertable or is_nothrow_move_constructible_v
is true, there are no effects. Otherwise, if an exception is thrown by the move constructor of a non-CopyInsertable T, the effects are unspecified.
Given the above, there are many more cases of potential torn state for inserting into a std::vector
then there are for a List<T>
which namely comes about from the existence of copy/move constructors
and assignment operator
support. Since .NET doesn't itself have those concerns for arbitrary T
, and since Add
functions as inserting a single element at the end
, there are no effects other than an OOM
that can occur in single-threaded code and thus no potential for torn state.
Here you expect the iterator to still be valid, and the collection object to be unmodified.
OOM
, as thrown by the GC, is explicitly documented as a catastrophic exception
. The docs further cover that if you were to catch the exception, you are really expected to do the bare minimal handling and then fail fast
. Much like other types of catastrophic exception
(such as StackOverflow
), it is not expected to be something an application can reasonably recover from nor is something that you may guaranteed to be able to catch and handle.
This documentation is what makes when _version
is updated for Add
, not really matter. One could argue its better for it to happen later since its done after the necessary mutations have been made. One could alternatively argue its better for it to happen sooner, as it may help expose multithreaded access issues sooner. But neither is more or less correct, what happens is undefined behavior for any scenario where it might be an issue.
It is important to note that there are other types of "system thrown exceptions" which are representative of catastrophic failures, such as ExecutionEngineException
or StackOverflowException
. There are also exception types (such as OverflowException
, IndexOutOfRangeException
, and NullReferenceException
) which are not catastrophic but are thrown by the system as an implicit side effect an operation. These all fall into the same general bucket, which is to say they are often representative that it may be unsafe to assume the state of the object is "valid".
We then have the more typical application thrown exceptions, such as ArgumentException
, where it is only ever explicitly thrown by the developer. It is extremely typical for this type of exception to happen up front before any kinds of state mutations have happened and so its often considered okay to presume the state remains "valid, but unspecified" when one is thrown.
There is, of course, code out there that violates these presumptions. There is code that will manually throw system exceptions themselves (despite analyzers existing telling users to not do that) and there is code that will throw application exceptions after state mutations have occurred. In many cases this may be unintentional and could be caused by a myriad of different considerations, such as behavioral or breaking changes made by a dependency. However, it just reiterates that all of this remains a "best effort", that exceptions in general are meant to be exceptional, and that one needs to be considerate when catching or otherwise handling exceptions.
Since that's so silly, I expected to find a smiley at the end. Lacking that, I have to assume it's a serious question.
I guess I should have added the smiley.
We then have the more typical application thrown exceptions, such as
ArgumentException
, where it is only ever explicitly thrown by the developer. It is extremely typical for this type of exception to happen up front before any kinds of state mutations have happened and so its often considered okay to presume the state remains "valid, but unspecified" when one is thrown.
Frankly, anything that hits an ArgumentException
should likely be fail-fast too, since it means the programmer has done something catastrophic themselves.
Frankly, anything that hits an ArgumentException should likely be fail-fast too, since it means the programmer has done something catastrophic themselves.
Right, it's mostly meant to tell the developer "you passed something incorrectly here" and therefore that they should go and fix their code. Things where exceptions might be expected, such as T Parse(string)
where the string is likely user input typically have Try
variants (i.e. bool TryParse(string, out T)
). This doesn't block all exceptions, since we still want to differentiate between "the developer messed up" (passing in completely incompatible parse options) from "the user messed up" (typically some form of typo or not understanding the prompt).
Since it's mostly meant to be to tell the developer, catching the exception is typically not what you want. Catching it should really be a last resort workaround because a relevant Try
API doesn't exist or there is some quirk/bug/etc. Scenarios exist, they should just be very rare
Frankly, anything that hits an
ArgumentException
should likely be fail-fast too, since it means the programmer has done something catastrophic themselves.
You would think so, though as @tannergooding mentions, there are reasons why you might be handling it. One from the comments (and my own experience): Process.GetProcessById()
throws ArgumentException
if the id doesn't exist.
I'd suggest people open API suggestions for cases like this:
Process.GetProcessById() throws ArgumentException if the id doesn't exist.
The general API review process is detailed here: https://github.com/dotnet/runtime/blob/main/docs/project/api-review-process.md
The template for a new API proposal is then: https://github.com/dotnet/runtime/issues/new?assignees=&labels=api-suggestion&projects=&template=02_api_proposal.yml&title=%5BAPI+Proposal%5D%3A+
It really doesn't need to be that complex and can really just detail that something like Process.TryGetProcessById
would be beneficial because the current version throws ArgumentException
if the id
doesn't exist and there is no safe way to deterministically know if a process id remains valid since a separate process is strictly running concurrently and could shut down in between some check and query.
FWIW, proposals for various Try*
methods have been rejected by API review in the past https://github.com/dotnet/runtime/issues/21785#issuecomment-330615069 .
I submitted it as #101582, for whatever it is worth. If anyone sees anything amiss with the proposal, please do feel free to correct me!
So we have a good guess (discussed internally) what's going on here, but it's not clear how to detect bad patterns and disable ldr->ldp optimization. The PRs which added ldr->ldp
optimization only have a very few improvement reports attached to them (except ldp for SIMD regs), so we can, probably, just conservatively disable it for Apple platforms (we've confirmed that this problem is Apple specific at this point).
I think the first thing we have to do is to add macOS-arm64 platform to our dotnet/performance runs (it's currently Linux and Windows only) before we make any changes.
Reminder that Linux on Apple Silicon is a thing, both virtualized and bare metal, and virtual Windows ARM64 on Apple Silicon is also a thing. Please do not gate this on macOS only.
Given the LSE2 theory, I think it makes sense to treat this as a performance issue potentially affecting more CPU platforms (especially future ones), given that there is a plausible explanation for the mechanism and why the code is, for at least some logical implementations, actually sub-optimal. If the gains of this peephole opt are otherwise minimal, and it's not easy to detect the problem code sequences (the ones where it triggers extra stalls) then I think the logical thing to do is just disable it across the board. Alternatively, if you think this is rare enough and the List.Add case is an outlier, then just work around it there (e.g. with the version
change).
But please don't use OS platform as a gate for this, since that makes no sense given multiple OSes exist (you'd have to use CPU implementer instead, and I assume that's not practical given ahead-of-time compilation?).
It looks like part of the discussion was marked off-topic.
Just wanted to re-iterate that this particular regression is still solvable by removing _version
when users target Release
altogether, and it might be a good opportunity to finally explore this option as described in https://github.com/dotnet/runtime/issues/81523
It looks like part of the discussion was marked off-topic.
Just wanted to re-iterate that this particular regression is still solvable by removing
_version
when users targetRelease
altogether, and it might be a good opportunity to finally explore this option as described in #81523
Wouldn’t that only solve the problem for List? Leaving an area for potential other regressions now and the future!?
Don’t get me wrong - one can still lead this discussion but I do feel it should be handled separately (and not only because of performance characteristics)
Minimal repro for the Apple issue using Clang and inline asm: https://gist.github.com/EgorBo/88196a218559ec93a197a7d1d5600548
ldp
makes this "benchmark" ~4-5x slower. An interesting thing is that if we have ldp
and stp
- it's still slow (we thought that the problem reproduces only when we mix str
with ldp
and stp
with ldp
would be fine)
Wouldn't just moving the _version
write as @EgorBo demonstrated just be the least controversial change/fix here (fixes this particular performance regression, avoids arguing about the bigger questions involved with removing version checks altogether, keeps version checks in place even for List.Add()
)?
The performance regressions can be tackled by monitoring macOS benchmark results (the point about Apple Silicon != macOS is well taken, but also somewhat besides the point since Linux (OS) performance is tracked separately and monitoring macOS (OS + hw) performance would have caught this for both newer Macs (OS + hw) and Linux-on-Apple-silicon — you do not need "idealized" performance monitoring where only one factor changes between each monitored configuration) and figuring out where other notable regressions occur?
You probably don't want to blanket disable ldr to ldp optimizations, as this was a pathological case where there was a false data dependency on two adjacent fields but there likely are cases where compiler optimizations can take advantage of the single copy atomicity improvements to net performance gains.
(But I must confess I am confused why this would problem would not manifest on non-Apple silicon aarch64 w/ FEAT_LSE2 present, as reported earlier in this thread, unless the cpu wasn't taking advantage of the available instruction-level parallelism before so there is no regression now?)
We are reviewing this issue thanks for raising it!
The use of LDP instructions is generally encouraged. In this specific case, the structure of the loop (that includes a memory dependence between stores and loads), coupled with the use of LDP instructions prevents certain hardware optimizations from engaging. The small and tight nature of the loop also exacerbates the impact.
Please also see section 4.6.11 of the Apple Silicon CPU optimization guide.
With the latest preview 3 (net9) there seems to be a major performance regression on MacOS 14.4 (M2 Pro) with lists. Given that they are one of the most used types, it should receive special treatment.
Benchmark
Results:
Interestingly even the pre-allocated list is comparably slow.
dotnet --info
Output
dotnet --info .NET SDK: Version: 9.0.100-preview.3.24204.13 Commit: 81f61d8290 Workload version: 9.0.100-manifests.77bb7ba9 MSBuild version: 17.11.0-preview-24178-16+7ca3c98fa Runtime Environment: OS Name: Mac OS X OS Version: 14.4 OS Platform: Darwin RID: osx-arm64 Base Path: /usr/local/share/dotnet/sdk/9.0.100-preview.3.24204.13/ .NET workloads installed: There are no installed workloads to display. Host: Version: 9.0.0-preview.3.24172.9 Architecture: arm64 Commit: 9e6ba1f68c .NET SDKs installed: 6.0.417 [/usr/local/share/dotnet/sdk] 7.0.306 [/usr/local/share/dotnet/sdk] 7.0.404 [/usr/local/share/dotnet/sdk] 8.0.100 [/usr/local/share/dotnet/sdk] 8.0.101 [/usr/local/share/dotnet/sdk] 8.0.200 [/usr/local/share/dotnet/sdk] 9.0.100-preview.2.24121.2 [/usr/local/share/dotnet/sdk] 9.0.100-preview.2.24157.14 [/usr/local/share/dotnet/sdk] 9.0.100-preview.3.24204.13 [/usr/local/share/dotnet/sdk] .NET runtimes installed: Microsoft.AspNetCore.App 6.0.25 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 7.0.14 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 8.0.0 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 8.0.1 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 8.0.2 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 9.0.0-preview.2.24120.6 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 9.0.0-preview.2.24128.4 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 9.0.0-preview.3.24172.13 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App] Microsoft.NETCore.App 6.0.25 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App] Microsoft.NETCore.App 7.0.14 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App] Microsoft.NETCore.App 8.0.0 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App] Microsoft.NETCore.App 8.0.1 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App] Microsoft.NETCore.App 8.0.2 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App] Microsoft.NETCore.App 9.0.0-preview.2.24120.11 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App] Microsoft.NETCore.App 9.0.0-preview.2.24128.5 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App] Microsoft.NETCore.App 9.0.0-preview.3.24172.9 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App] Other architectures found: x64 [/usr/local/share/dotnet/x64] Environment variables: Not set global.json file: Not found Learn more: https://aka.ms/dotnet/info Download .NET: https://aka.ms/dotnet/download