API Proposal: Add Interlocked ops w/ explicit memoryOrder

sdmaclea commented 6 years ago

While trying to stabilize the thread pool for linux-arm64 during the release 2.1 effort, it became apparent that the safest thing would be to assume that existing code assumed an interlocked operation guaranteed barrier to enforce sequential consistency at least with respect to operations before and after the interlocked operations.

While this approach is likely to guarantee functional correctness in the most legacy code, it does come at a significant cost to weakly ordered machines. Also it is actually rare that an Interlocked operation would actually need to guarantee sequential consistency.

This proposal adds a MemoryOrder parameter to each atomic interlocked operation.

The proposal currently does not show the MemoryOrder parameter with a default MemoryOrder memoryOrder = SequentiallyConsistent because those API already exist and can be presumed to continue to exist in order to support NetStandard2.1 and earlier.

namespace System.Threading
{
    public enum MemoryOrder
    {
        SequentiallyConsistent,
        AcquireRelease,
        Release,
        Acquire,
        Consume,
        Relaxed
    }

    public static partial class Interlocked
    {
        public static int Add(ref int location1, int value, MemoryOrder memoryOrder);
        public static long Add(ref long location1, long value, MemoryOrder memoryOrder);
        public static double CompareExchange(ref double location1, double value, double comparand, MemoryOrder memoryOrderSuccess, MemoryOrder memoryOrderFail);
        public static int CompareExchange(ref int location1, int value, int comparand, MemoryOrder memoryOrderSuccess, MemoryOrder memoryOrderFail);
        public static long CompareExchange(ref long location1, long value, long comparand, MemoryOrder memoryOrderSuccess, MemoryOrder memoryOrderFail);
        public static IntPtr CompareExchange(ref IntPtr location1, IntPtr value, IntPtr comparand, MemoryOrder memoryOrderSuccess, MemoryOrder memoryOrderFail);
        public static object CompareExchange(ref object location1, object value, object comparand, MemoryOrder memoryOrderSuccess, MemoryOrder memoryOrderFail);
        public static float CompareExchange(ref float location1, float value, float comparand, MemoryOrder memoryOrderSuccess, MemoryOrder memoryOrderFail);
        public static T CompareExchange<T>(ref T location1, T value, T comparand, MemoryOrder memoryOrderSuccess, MemoryOrder memoryOrderFail);
        public static int Decrement(ref int location, MemoryOrder memoryOrder);
        public static long Decrement(ref long location, MemoryOrder memoryOrder);
        public static double Exchange(ref double location1, double value, MemoryOrder memoryOrder);
        public static int Exchange(ref int location1, int value, MemoryOrder memoryOrder);
        public static long Exchange(ref long location1, long value, MemoryOrder memoryOrder);
        public static IntPtr Exchange(ref IntPtr location1, IntPtr value, MemoryOrder memoryOrder);
        public static object Exchange(ref object location1, object value, MemoryOrder memoryOrder);
        public static float Exchange(ref float location1, float value, MemoryOrder memoryOrder);
        public static T Exchange<T>(ref T location1, T value, MemoryOrder memoryOrder) where T : class;
        public static int Increment(ref int location, MemoryOrder memoryOrder);
        public static long Increment(ref long location, MemoryOrder memoryOrder);
        public static void MemoryBarrier(MemoryOrder memoryOrder);
    }

sdmaclea commented 6 years ago

@kouvel @stephentoub @CarolEidt @eerhardt @RussKeldorph @jkotas

MemoryOrder was copied from dotnet/runtime#17975

sdmaclea commented 6 years ago

Created based on discussion in https://github.com/dotnet/coreclr/pull/17567#issuecomment-381341865 and surrounding

kouvel commented 6 years ago

C++11 atomics seem to have separate memory order in compare_exchange for success case and failure case. I think it would be good to separate them when we're being explicit about the ordering anyway.

benaadams commented 6 years ago

Related proposal Atomic<T> which has memory order https://github.com/dotnet/corefx/issues/10481

mihailik commented 6 years ago

Please do not pollute this moderately-high-level API, create a separate one (potentially in nested class).

A new overload will come up in intellisense and lead developers astray.

Cutting edge low-level gurus who DO understand differences between those options would do fine with typing few more characters.

svick commented 6 years ago

Are all of the proposed values in MemoryOrder actually useful? I don't actually have an opinion, I'm just asking to make sure we're not blindly copying from C++.

sdmaclea commented 6 years ago

Please do not pollute this moderately-high-level API, create a separate one (potentially in nested class).

This overload logically belongs with interlocked operations. The arguments are already implicitly present. This allows code to be explicit. It documents the required ordering. I do not believe it is polluting.

A new overload will come up in intellisense and lead developers astray.

The fact that it comes up in intellisense is a good thing. Developers should make a deliberate decision about what is intended by each Interlocked operation in terms of memory ordering. I would argue hiding this leads them astray.

Cutting edge low-level gurus who DO understand differences between those options would do fine with typing few more characters.

If a developer is not wiling to think about memory ordering, they should probably be using higher level abstractions. (locking, and or thread safe queues...).

sdmaclea commented 6 years ago

Are all of the proposed values in MemoryOrder actually useful? I don't actually have an opinion, I'm just asking to make sure we're not blindly copying from C++.

All the orderings are useful.

sdmaclea commented 6 years ago

I think it would be good to separate them when we're being explicit about the ordering anyway.

Done

mihailik commented 6 years ago

@sdmaclea If a developer is not wiling to think about memory ordering, they should probably be using higher level abstractions. (locking, and or thread safe queues...)

That is: (a) demonstrably not true, (b) uncooperative to the community already using the platform, (c) show narrow focus on your specific use case, to the exclusion of platform strategic goals.

(a) the existing API has been used for 18+ years — without excessive fine-grained detail.

The original PR rationale explicitly mentions that current conservative model. That was enough for 18 years, lots of people built solid software expecting conservative model that is simplistic and easy to reason about.

Your suggestion the current conservative model is somehow flawed, that Interlocked.* must only be used with full knowledge of more subtle error-prone models — that suggestion is incorrect.

(b) existing C# developer community consists of much more than performance-focused tech folks. For a huge majority of C# developers conservative memory model of Interlocked.* is as low a level as they will ever need to get. This API change alongside the current already low-level API will add noticeable risk to that many people, with absolute zero upside to them.

(c) the fact that you understand fine-grained memory model differences does harm your ability to recognise risk here. It's as if I make speaking Ukrainian mandatory for coding in C# — I mean it's sooooo easy to me, everybody should learn too.

To demonstrate my core point, show ArrayList.Sync property getter code (from .NET Framework) to a C# developer and see how many of them will be able to guess the right memory model to use instead of conservative one.

mihailik commented 6 years ago

Suggesting API instead:

Interlocked.Ordered.Increment(ref this.count, Interlocked.Ordered.MemoryOrder.Relaxed);

Require people to type that extra .Ordered so it sits logically together with existing API, but it's impossible to stumble into hot water without looking.

sdmaclea commented 6 years ago

@mihailik No disrespect for your opinion was meant. Since you stated your opinion, it was necessary to state my opposing opinion.

kouvel commented 6 years ago

The original PR rationale explicitly mentions that current conservative model. That was enough for 18 years, lots of people built solid software expecting conservative model that is simplistic and easy to reason about.

It was probably enough because of the platforms it was running on. It seems in most cases the strongest memory order is not actually necessary.

For the API it seems to me like an overload is the natural thing to do and is also what is done elsewhere for more fine-grained control, so it's consistent. It also allows people to discover the overload and think about what they need, and there are options for those who want to keep it simple or don't need to optimize to that degree.

CarolEidt commented 6 years ago

I agree with @kouvel that an overload is the natural thing to do. The fact that ordering has not been a necessary feature of existing APIs doesn't imply that it is not required - especially as we move to support weaker memory models and systems with increasing levels of parallelism. I also think that having the overload without the ordering parameter use the conservative ordering should make it easier for developers who are unsure whether they require ordering - when in doubt, the simpler overload is probably what will be chosen.

mattwarren commented 6 years ago

FYI, I've just come across the RelaSharp project by @nicknash which has implemented something very similar to this API, see it's Generic Interface (it also does a lot more, the whole library looks very cool!)

It may be useful to take a look at how RelaSharp has done things, to see what can be learnt?

mikedn commented 6 years ago

That library seems to be a simulator/checked, not an actual implementation of interlocked operations with specific memory ordering.

You can't quite implement these without runtime support, the existing ones are intrinsics after all. There's also the pesky case of MemoryOrder.Consume - that may deserve a separate discussion as it requires special JIT support.

CarolEidt commented 6 years ago

There's also the pesky case of MemoryOrder.Consume - that may deserve a separate discussion as it requires special JIT support.

I think it's good to include it in MemoryOrder, for consistency and "future proofing", but I presume that we would go the route of the C++ compilers and implement it as Acquire at least for now.

mikedn commented 6 years ago

@sdmaclea Don't you need Interlocked.MemoryBarrier(MemoryOrder memoryOrder) as well? Currently it generates dmb ish but I get from the ARM64 documentation that there's also dmb ishld that could be used for MemoryOrder.Acquire. There's dmb ishst but that doesn't seem to map to any of the existing orderings as it serialized only stores.

mikedn commented 6 years ago

    public enum MemoryOrder
    {
        Relaxed,
        Consume,
        Acquire,
        Release,
        AcquireRelease,
        SequentiallyConsistent
    }

Maybe it would be better to reverse the order of this enum's members to better suggest that SequentiallyConsistent is the default.

mihailik commented 6 years ago

People promoting these overloads and people who would be pained by it are very disconnected. @sdmaclea while you may not disrespect those people intentionally, you eject their needs from consideration too easily.

For majority of the existing C# code using Interlocked.* — these overloads add noticeable maintenance burden. Instead pushing them into a nested avoids the risk at no extra cost to those who code low-level algos.

Consider ArrayList.SyncRoot that exists in its current form since .NET 1.1, and replicated verbatim in both Mono and .NET Core List:

public virtual Object SyncRoot {
    get { 
        if( _syncRoot == null) {
            System.Threading.Interlocked.CompareExchange<Object>(ref _syncRoot, new Object(), null);    
        }
        return _syncRoot; 
    }
}

What is the best MemoryOrder to use? How do you validate that? What share of C# developers can confidently pick the right one?

With your new overloads, what will happen is (as @kouvel said) It also allows people to discover the overload — a good professional C# developer with no C++ baggage will try to improve this kind of code for ARM64.

The downside risk is enormous, as anybody with non-trivial multithread debugging experience knows. Race condition due to incorrect memory ordering tends to manifest itself rarely, under narrow environment conditions — making it extremely hard to diagnose and debug. People will read an article, pick an option, test it thoroughly on their machine — and 2 month later corrupt data and sink heaps of resources tracking the root cause.

That's the downside for an average business C# developer.

Now the upside is (potentially) faster code. In certain conditions on certain hardware.

I appreciate these memory modes may be familiar to C++ devs, and may be very cool and awesome — but they are unbounded downside risk for the majority of C# developers.

That is why the feature should be very cool and useful, but not advertise ultra-low-level features to normal-level C# developers for quick try.

I have suggested a syntax that achieves exactly that goal. Sure, it's 8 character more to type for performance/hardware-minded folks, but that's only fair. Keep the knives away from children.

// current API
Interlocked.Increment(ref this.count);

// proposed overload
Interlocked.Increment(
  ref this.count,
  MemoryOrder.Relaxed);

// improved API
Interlocked.Ordered.Increment(
  ref this.count,
  Interlocked.Ordered.MemoryOrder.Relaxed);

Do you see how it would still participate in code completion and easy typing? Yet it is way more explicit and self-documenting in any code review? That's intentional.

For the higher-level reasoning about this kind of tough API cases, please refer to the famous Pit of Success idea. API should guide people towards best practice, not just serve as a broad menu of options.

stephentoub commented 6 years ago

@mihailik, I hear your concerns, but I also agree with everyone who's suggested these be overloads. It is very common in .NET for overloads to provide additional arguments that allow for someone who knows what they're doing to express more control over the behavior of the operation. Even something as simple as int.Parse, for example, does this. There's the basic int.Parse(string), but if you know what you're doing you can provide a NumberStyle as an argument to tweak how the parsing operation behaves. This is no different. The simple overload of Interlocked.Increment, for example, provides the general behavior we all know and use today, but if you know what you're doing and want more control, you can opt-in to explicitly providing additional arguments to tweak the behavior. I actually think your proposal would lead to more confusion and a higher liklihood that developers who "shouldn't" be using the options would.

mihailik commented 6 years ago

@stephentoub the difference is that incorrect NumberFormat is easy to notice, easy to debug, trivial to test.

Incorrect MemoryOrder is extremely hard to notice, test and endless torture to debug.

BTW, no attachment to specific syntax — as long as the concern is taken into account.

mihailik commented 6 years ago

BTW would you mind using ArrayList.SyncRoot as a straw man here: suggest an improvement with new MemoryOrder and let's see how hard it is to reason about it.

I put it to you, that a general C# developer will not be able to code-review such change.

stephentoub commented 6 years ago

Incorrect MemoryOrder is extremely hard to notice, test and endless torture to debug.

It's trivial to spot in a code review or in code where it shouldn't be. And if you want to outlaw it in a codebase, it's trivial to do so with a Roslyn analyzer. It also takes very explicit action to use: if a developer doesn't know what it means, why would they explicitly type ", MemoryOrder.Acquire"? And if they believe they know what it means and are desiring to use it, then it doesn't matter how it's exposed, they're going to try to use it.

BTW would you mind using ArrayList.SyncRoot as a straw man here: suggest an improvement with new MemoryOrder and let's see how hard it is to reason about it.

I do not understand the comparison. SyncRoot is very high level as far as synchronization goes; if you're using that, you're not going to be replacing it with any usage of Interlocked, regardless of this new MemoryOrder argument. If you are replacing it, you better know what you're doing, going from lock-based code to lock-free code, again regardless of this whole proposal.

sdmaclea commented 6 years ago

Don't you need Interlocked.MemoryBarrier(MemoryOrder memoryOrder) as well?

Using the memory order enumeration for MemoryBarrier didn't make any sense to me, so I punted.

For Interlocked.MemoryBarrier, something like this makes more sense to me.

Interlocked.MemoryBarrier(bool orderLoads, bool orderStores = true)

sdmaclea commented 6 years ago

System.Threading.Volatile needs memory ordering overloads too.

In my mind there are two options

Volatile.Read(*, MemoryOrder order)
Volatile.Read<T>(*, MemoryOrder order)
Volatile.Write(*, MemoryOrder order)
Volatile.Write<T>(*, MemoryOrder order)

Volatile.ReadUnordered(*)
Volatile.ReadUnordered<T>(*)
Volatile.WriteUnordered(*)
Volatile.WriteUnordered<T>(*)

I like the flexibility and extensibility of the first set, but not the complexity.

sdmaclea commented 6 years ago

Perhaps an alternative would be to create several use case specific enumerations AtomicMemoryOrder VolatileMemoryOrder (or ReadMemoryOrder & WriteMemoryOrder) BarrierMemoryOrder

This would allow the different use cases to be separated and unsupported/nonsensical cases easily detected by the compiler.

kouvel commented 6 years ago

For the memory order on volatile, is it just to allow preventing compiler optimizations without an actual barrier (which I think is missing currently), or is there something else?

sdmaclea commented 6 years ago

For the memory order on volatile, is it just to allow preventing compiler optimizations without an actual barrier (which I think is missing currently), or is there something else?

That was my initial concern.

There are potentially cases where a memory barrier before/after/both are needed. I was thinking we would need to fuse them in the API to get them to be atomic. However, since they will eventually map to intrinsic, I think JIT can fuse the memory barriers to Volatile operations in Lowering just fine.

So I think we just need to add Volatile unordered support.

sdmaclea commented 6 years ago

Let me look at the arm64 and arm32 instruction sets and make sure my assessment above is correct. I suspect I am wrong at least with respect to some of the newer instruction set extensions.

kouvel commented 6 years ago

For preventing compiler optimizations (maybe this should be a different review?) how about adding acquire/release versions of compiler-only barriers separately from volatile so that it doesn't have to be tied to a load/store?

kouvel commented 6 years ago

Oh right volatile also prevents tearing, maybe memory order on volatile would suffice for now

mihailik commented 6 years ago

Incorrect MemoryOrder is extremely hard to notice, test and endless torture to debug.

@stephentoub:

It's trivial to spot in a code review or in code where it shouldn't be.

How?? Can you please be more specific?

Imagine ArrayList.SyncRoot was optimised for better performance:

public virtual Object SyncRoot {
    get { 
        if( _syncRoot == null) {
            System.Threading.Interlocked.CompareExchange<Object>(
                ref _syncRoot, new Object(), null,
                MemoryOrder.Acquire, MemoryOrder.AcquireRelease);    
        }
        return _syncRoot; 
    }
}

Are you seriously saying validity here is 'trivial to spot'???

17+ years of writing C# code on business side, I know barely 3 people equipped to get this fine detail right, and I wouldn't bet any one of them can do it in under 10 minutes of mental algebra.

Are there any other advocates for business-level developers involved in BCL API design? I'm seriously baffled why the case is argued completely one-sided from perf/hardware side. Please please find someone with field experience away from the compiler/JIT/GC. Your low-level glasses are skewing the picture to you, and you're going to harm the platform, and the community.

Again, here's a very clear example above. Anybody wanna chip in and 'trivially spot' the right MemoryOrder here in ArrayList.SyncRoot? Measure the time it took you to work out this 'trivial' answer -- and have a guess how long will it take a business developer.

Please don't turn C# into another C++, consider people out there. We have our day jobs, and it's got nothing to do with squeezing an odd nanosecond out of cutting edge ARM64 instruction set.

Hope that helps!

sdmaclea commented 6 years ago

@mihailik When the above patch is reviewed You will see

-                ref _syncRoot, new Object(), null);
+                ref _syncRoot, new Object(), null,
+                MemoryOrder.Acquire, MemoryOrder.AcquireRelease);

You are suggesting this is better

-            System.Threading.Interlocked.CompareExchange<Object>(
-                ref _syncRoot, new Object(), null);
+            System.Threading.Interlocked.Ordering.CompareExchange<Object>(
+                ref _syncRoot, new Object(), null,
+                MemoryOrder.Acquire, MemoryOrder.AcquireRelease);

It is not apparent to me how that changes the reviewers job.

In fact it seems to mislead. It seems to imply the original was not ordered and we are adding a more restrictive ordering. When in fact the opposite is true.

stephentoub commented 6 years ago

How?? Can you please be more specific?

The code changes to include MemoryOrder. That's easy to see. I'm not suggesting it's easy to validate that the right ordering is being used, but it's trivial to see that an ordering is being specified, and the moment one is, that should set off alarm bells.

17+ years of writing C# code on business side, I know barely 3 people equipped to get this fine detail right, and I wouldn't bet any one of them can do it in under 10 minutes of mental algebra.

And I would be skeptical of any Interlocked-based code written by such developers; it's not their area of expertise, nor should it be, and they shouldn't need to use this. I thought you were suggesting replacing usage of SyncRoot (which would mean replacing locks) with Interlocked. If you're talking about implementing SyncRoot, if such a developer were to need to do so (keeping in mind that it's a legacy pattern and folks really shouldn't be implementing it now anyway), I'd want to see them write:

public virtual Object SyncRoot => this;

following the common-case guidance on MSDN for returning the current instance. If that's not appropriate, I'd want to see them just return an initialized instance without any threading impact:

public virtual Object SyncRoot { get; } = new object();

If that allocation really need to be delayed to avoid costs associated with this instance being constructed, I'd want to see them use LazyInitializer, e.g.

private object _syncRoot;
public virtual Object SyncRoot => LazyInitializer.EnsureInitialized(ref _syncRoot);

and let the framework take care of all threading optimizations for them. Nowhere here do they need to be concerned with Interlocked. And if they really did, I'd want to see them use the simple overloads, that don't take MemoryOrder. If they specified a MemoryOrder, that they specified a MemoryOrder would be trivial to see in a code review, and I'd flag it and say "don't do that".

Please please find someone with field experience away from the compiler/JIT/GC. Your low-level glasses are skewing the picture to you, and you're going to harm the platform, and the community.

Please do not cast aspersions or denigrate others in this community. Such behavior will not be tolerated. (And for what it's worth, I have years of field experience away from compiler/JIT/GC.)

sdmaclea commented 6 years ago

maybe this should be a different review?

That was my original intention. I will open a API review for Volatile separately.

sdmaclea commented 6 years ago

Don't you need Interlocked.MemoryBarrier(MemoryOrder memoryOrder) as well?

Using the memory order enumeration for MemoryBarrier didn't make any sense to me, so I punted. For Interlocked.MemoryBarrier, something like this makes more sense to me. Interlocked.MemoryBarrier(bool orderLoads, bool orderStores = true)

Looking at x86/x64. I see mentioned three fences MFENCE, LFENCE, SFENCE. Looks like they correspond to arm/arm64 dmb ish, dmb ishld, and dmb ishst respectively.

Looking at C11 & C++11 atomic_thread_fence looks like it accepts all memory order options, but it is noted that the implementation is more restricted than the requested ordering. Which is consistent with my survey above.

Therefore Interlocked.MemoryBarrier(MemoryOrder memoryOrder) seems reasonable. I'll add it above.

mihailik commented 6 years ago

@stephentoub for the history of this property, and why it is implemented this way — please reach out to Brad Abrams or @KrzysztofCwalina

@sdmaclea with all the strategic approach @stephentoub gave, there is no merit in this feature. Please refer to this paragraph:

And if they really did, I'd want to see them use the simple overloads, that don't take MemoryOrder. If they specified a MemoryOrder, that they specified a MemoryOrder would be trivial to see in a code review, and I'd flag it and say "don't do that".

1) Your reasons for introducing the feature are based on claim of existing API having significant cost to weakly ordered machines — that is not justified with any evidence, nor do you provide any use case to verify your claim.

2) Coming from a through review of a specific use case in very core of BCL itself, @stephentoub suggests MemoryOrder generally has no place in there: "don't do that" is an exact quote, and so is "use simple overloads".

3) We've already established it's "not [...] easy to validate that the right ordering is being used".

Considering the upside is based completely on gut feeling (see 1), downside is very real and (see 3), and the strategic advice is not to use this code even at BCL level (see 2) — facts are strongly against it.

Also, @sdmaclea you didn't show any possible code examples where MemoryOrder would add value. Perhaps if you did, doubts would be resolved. The fact that you don't suggests the niche for it is extremely narrow.

stephentoub commented 6 years ago

with all the strategic approach @stephentoub gave, there is no merit in this feature

Huh? That's not at all what I said!

"don't do that" is an exact quote,

For that specific developer audience you mentioned and that I was responding about. Not in general. Please do not twist my words.

and the strategic advice is not to use this code even at BCL level

Not true. We would definitely use this in the BCL / runtime in specific places, in particular to help with performance on the very platforms @sdmaclea is helping with.

sdmaclea commented 6 years ago

Your reasons for introducing the feature are based on claim of existing API having significant cost to weakly ordered machines — that is not justified with any evidence, nor do you provide any use case to verify your claim.

There is a decade of experience that says this is critical to performance. That is why it was incorporated into the C++11 standard.

This is a low level feature being added to allow platform/framework developers to carefully optimize functionality. It will be useful for many .NET Core internals including the ThreadPool, Concurrent Queues, Locking. It is also anticipated it will be required by other frameworks such as ASP.NET Core.

Coming from a through review of a specific use case in very core of BCL itself, @stephentoub suggests MemoryOrder generally has no place in there: "don't do that" is an exact quote, and so is "use simple overloads".

You are taking the statement out of context. The key phrase you dropped was by such developers. This feature is not intended for those developers. Even the BCL developers should not be using this feature unless warranted. It needs to be used in very specific use cases to create safer abstractionsfor general use.

You'll also note he said roughly "Use LazyInitializer.EnsureInitialized(ref _syncRoot); and let the framework take care of all threading optimizations for them."

The implementation of LazyInitializer.EnsureInitialized(ref _syncRoot); is a place where this may be useful.

As you note this is difficult to reason. Programmers should stay away from the 'Pit of Despair'. The 'Pit of Success' is created by people fighting there way out of the 'Pit of Despair' and creating abstraction so other people no longer have to. C# allows unsafe it is generally discouraged, but it is used in internals to create safer higher level abstractions. This is the same thing. Simple rule, "Don't deliberately jump into the Pit of Despair."

sdmaclea commented 6 years ago

@mihailik

This is no longer a productive exchange. I believe our opinions are all perfectly clear.

This is a critical feature that is needed by framework experts.

Weakly ordered systems, largely represented by ARM, are a very significant part of the worldwide market. They are not going away. C# needs to be competitive in those markets too.

BCL API review board can certainly see and take into account your opinion. They are the decision makers.

stephentoub commented 6 years ago

@sdmaclea, I do have one JIT-related question: presumably there's no problem here in having the MemoryOrder parameters on the overloads and still having them be recognized as JIT intrinsics with direct mappings to the corresponding asm, right? Certainly there would be some additional overhead if the MemoryOrder arguments were supplied in a way where the JIT couldn't see their values statically, in which case the implementation would need to branch to the appropriate implementation, but for cases where the JIT could see their values, I would hope these would all simply compile down to the appropriate asm instruction(s) and nothing more.

Also, from a perf perspective and to help the API review, I do think it would be helpful @sdmaclea if you could share a simple benchmark on ARM highlighting just how impactful these choices can be.

sdmaclea commented 6 years ago

I do have one JIT-related question: presumably there's no problem here in having the MemoryOrder parameters on the overloads and still having them be recognized as JIT intrinsics, right?

I haven't really figure it out yet, but I think that would be the intent. I think it should be doable.

CarolEidt commented 6 years ago

I do have one JIT-related question: presumably there's no problem here in having the MemoryOrder parameters on the overloads and still having them be recognized as JIT intrinsics, right?

I don't see any reason that this would be different from the hw intrinsics that take an immediate operand. Similar to that case, we'd presumably fall back to an IL implementation with a switch over the ordering parameter in the case that it was not called with an immediate (e.g. in the reflection case).

sdmaclea commented 6 years ago

@caroleidt One of my thoughts was that this could be implemented with HW intrinsics.

sdmaclea commented 6 years ago

Also, from a perf perspective and to help the API review, I do think it would be helpful @sdmaclea if you could share a simple benchmark on ARM highlighting just how impactful these choices can be.

Getting definitive absolute perf impact numbers is difficult. There are several complicating factors

Depends on microarchitectural implementation. A53, A57, A72, A73, A9, Falkor, Thunder X2 will all perform very differently.
Nature of the running code. Memory bandwidth, ratio of order and unordered memory, frequency of barriers, number of threads

We are collarorating with ARM to create public benchmarks, but the benchmark suite is incomplete. https://github.com/ARM-software/synchronization-benchmarks.

We have internally studied Linux Kernel, NGINX, MySQL, MariaDB, Java... Anecdotally, we found barriers and atomics are very important. They have a very significant impact on workload scalability.

We published the MaraiDB optimization at a high level in this paper. https://mariadb.org/wp-content/uploads/2018/02/M18-Conference-2018_final-002.pdf

In that paper, it shows in some cases MariaDB (MySQL fork) is 2x faster than MySQL. That is due to more theoretically correct ordering and atomics. Memory ordering changes was a part of that gain.

This JIRA "InnoDB rw-locks: optimize memory barriers" https://jira.mariadb.org/browse/MDEV-14529 has a performance improvement noted specifically for relaxing barriers from Sequentially consistent to Acquire. "I see 3-5% gain in mysqlslap benchmark for doing complex queries and 1-2% benefit in sysbench oltp workloads."

This is a real world database example. Other cases could be much more important.

I would expect greater benefit from change from "Sequentially Consistent" to "Relaxed" for use cases like AtomicAdd for shared performance statistics.

sdmaclea commented 6 years ago

I would expect greater benefit from change from "Sequentially Consistent" to "Relaxed" for use cases like AtomicAdd for shared performance statistics.

We drafted a quick microbenchmark to look at this. For this worst case, the atomicAdd w/ SequentiallyConsistent took >2x Relaxed on the one platform we tested.

mihailik commented 6 years ago

@sdmaclea @stephentoub sorry I've dropped out of the conversation for a while.

Your suggestions for a specific use case (very trivial, well known, defined and clear use case too) would not pass the code review. Your usage of Lazy allocates too much to be used in such a commonly used class as List or ArrayList. A cursory glance of an experienced C# developer would show that.

The fact that you went ahead and suggested those kind of changes to SyncRoot getter tells that you have not spent any effort considering this use case. Lacking any other use cases to start with, no effort whatsoever was spent on C# use case considerations.

Neither when PR suggested, nor when asked on PR, nor when one such use case (and relevant, simple and clear too!) was given to you.

Unfortunately, there was no good faith discussion of the issues I have raised. It's been unhelpfulin fact: all concerns are brushed off with no actual consideration. Attitude from get go such as 'if a developer is not wiling to think about memory ordering, they should probably...' is unnecessarily hostile to external feedback.

I feel discouraged from giving further feedback to CLR team. If BCL API review board, or any of you, or any of your managers wish to mend it, I'll be glad to participate. In the meantime, unsubscribing from the thread. Thanks!

stephentoub commented 6 years ago

Your usage of Lazy allocates too much

I didn't suggest using Lazy. I suggested using LazyInitializer.EnsureInitialized. It does not allocate too much. You might want to review the implementation to confirm for yourself: https://github.com/dotnet/coreclr/blob/b12c344020ba4cc5bccff377c8922f5434aa293e/src/System.Private.CoreLib/shared/System/Threading/LazyInitializer.cs#L50-L51

you have not spent any effort considering this use case

I'm sorry you feel that way. I disagree. I believe I've spent a lot of time considering these use cases and your arguments. Just because I don't agree with you doesn't mean I haven't considered your position.

there was no good faith discussion of the issues I have raised

Again, I'm sorry you feel that way. I very much disagree.

terrajobst commented 6 years ago

Quick notes.

Knee jerk reaction is that these APIs feel a bit heavy handed and are likely over the top or the vast majority of Interlocked users, which is already a small set.
Granted these are all overloads so it doesn't really extend the concept count.
We should review these APIs with codegen folks in the room. On ARM, the performance gains certainly look interesting.

We should add this to our discussion about intrinsics, even though there are strictly speaking abstractions.

dotnet / runtime

API Proposal: Add Interlocked ops w/ explicit memoryOrder #26092