Open benaadams opened 7 years ago
@clrjunkie @whoisj here is an example of using tagged pointers from a recent Mono change http://www.mono-project.com/news/2016/08/16/lock-free-gc-handles/
@benaadams, (old news) MSVCRT as of VS2015 Update 3 has also implemented full set of C++11, 85%+ of C++14 and ~30% of C++17, 95% of C99 and 18% of C11 (all ISO standards). So std::atomic is available there as well: https://msdn.microsoft.com/en-us/library/hh567368.aspx#atomics :wink:
@whoisj:
The Windows operating system exposes these constructs, which means it's very possible to implement them on Intel and ARM. https://msdn.microsoft.com/en-us/library/windows/desktop/ms686360(v=vs.85).aspx Looks like they support quite a number of "interlocked" operations, up to 128 bits.
It looks like InterlockedCompareExchange128 is only available on x64 on Windows: https://msdn.microsoft.com/en-us/library/26td21ds.aspx https://msdn.microsoft.com/en-us/library/bb514094.aspx
Anything involving mutating a "reference structure" (ie any thing that is or contains a reference) cannot be treated as flat memory, therefore there are severe limiting factors here.
What is the issue with mutating a struct containing a reference? Why can it not be treated as flat memory?
I like the proposal, but I think it would be good to provide only a value type atomic (have Atomic<T>
be a value type) and let the user determine how to store and use that data correctly so that multiple threads can share it. That would simplify it a bit and eliminate the interface, which I feel mostly wraps boxing, only marginally improves the usability, and may even add some confusion.
It looks like we would need the following:
@benaadams, are you aware of any real-world scenarios that would greatly benefit from this feature? For instance, are there scenarios that are blocked from a performance perspective due to the inability of being able to do interlocked operations on structs?
CC @jkotas @gkhanna79 @cmckinsey @vancem
@kouvel there are a number of lock free algorithms that are blocked from not having it.
So for example the lock-free Concurrent collections and the threadpool queues can't be fully non-allocating and use pooled nodes as they would suffer from the ABA problem. With the 128 bit CAS then you can add a state/tag/counter to the pointer to resolve this.
A 128 bit CAS by itself https://github.com/dotnet/corefx/issues/10478 will only work on certain CPU classes; so the Atomic is better as it provides a lock fallback so can work on all platforms; though obviously less than ideal on some. For example it may be preferable to allocate and still be lock free if locking is involved - hence the static IsLockFree
bool; which allows you to choose the algorithm if that is important.
There is a fair bit of detail here, and I have not really given it too much thought, but I have some observations / concerns.
1) It is VERY easy to get low-lock code correct, so ideally uses of Atomic would be RARE. Thus the 'benefit' part of a cost-benefit analysis is not going to be high (since relatively few users benefit). This argues AGAINST doing anything tricky (e.g. special JIT support etc). It also argues to keep the implementation 'light' (simple/small). This argues against the IAtomic interface unless we can clearly articulate the extra value it adds.
2) It is unclear how the alignment restrictions of a CAS128 would be handled. To guarantee alignment, we need the GC heap to be as aligned (because objects move around and they have to continue to be aligned) The more aligned the GC heap is, the more 'waste' you have, so we really don't want to align the GC heap more than we have to (e.g. pointer sized). It is unclear how to resolve this tradeoff.
3) Given that I am arguing that simplicity is king we have to compare this against the 'trivial' solution of defining an Int128 and defining
Interlocked.InterlockedCompareExchange(ref Int128, Int128, Int128)
Note that even if we did decide to do implement Atomic
Having said all this, I am aware that most other languages/systems do implement something very much like Atomic
@vancem https://github.com/dotnet/corefx/issues/10478 is for 128 bit InterlockedCompareExchange
; but then you still need alignment for the Int128s which can't be guaranteed for arbitrary structs and has no fallback for when there is either no cpu support (x86, some arm) or platform support (not implemented in the clr yet)
Yes, it seems that magic is needed to solve the alignment issues for CAS128, which makes its cost higher and thus its cost-benefit proposition worse. Right now we don't really have a thought-through design for CAS128, and it 'feels' like it is probably too expensive for its value (so far at least...).
If we want to proceed with this, we need to clearly articulate the cost (how we would solve the alignment issue and how expensive that would be), as well as the value (how much faster would these lock-free algorithms be from what we have today.
P.S: (slightly off topic, but is relevant in deciding how much better CAS128 algorithms are from Cas64 algorithms.
For what it is worth in garbage collected systems when you are playing with GC references (which you typically are) the ABA problem is sometimes not as problematic, because, unlike native code, a reference can't change identity (can't be deleted and that same memory morphed into a different object), as long as there is an object reference pointing at it (which you typically naturally have in the code doing the update).
For example normally implementing a linked list stack with PUSH and POP operations would suffer from the ABA problem if POP was implemented in the 'obvious' way with CAS. However that implementation DOES work in a GC environment if the list never tries to reuse link cells because the linked list element being POPed can't be recycled (and thus mutated) while the POP operation has a reference to it.
@vancem yes, I mean you can't do object pooling + CAS on x64 in .NET currently; you can only do GC + CAS on x64.
You could in theory do object pooling + CAS on x86 in .NET though I haven't investigated that particularly as its not relevant to my domain (need Vector ops too much)
Even javascript is getting Atomic 😱 https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Atomics
What's the current status on this? If it's unlikely to be done in the near future I'm going to reopen the proposal for the extra bitwise APIs on Interlocked
given that there are immediate use cases for it and we're already making changes to that type.
It seems unlikely to happen in the near future. Interlocked
should probably have the equivalent bare operations despite this type anyway.
A year has passed but nothing has moved )
An array backed concurrent stack such as the Stroustrup one would need the CAS128 instruction to avoid ABA. Our problem with the LinkedList based concurrent stack implementation is that we have several million long lived items, and they get popped and pushed over time but the ones lower in the stack generally don't change too much. These items don't get cleaned up until Gen2 runs and the few millions that haven't changed still need its pointers traced by the GC. Stroustrup's implementation creates a lot of short lived objects if you have contention but these will never get past ephemeral gens. If CAS128 is available, Stroustrup's algo could be a good replacement for the current ConcurrentStack implementation,
@zachsawcs Do you have reference for that impl?
http://www.stroustrup.com/lock-free-vector.pdf
I implemented it from the paper above, but ran into all sorts of trouble with C# - Interlocked.CompareExchange was the first problem preventing me from creating a generic typed ConcurrentStack
But the biggest problem was C# just doesn't support 128-bit compare exchange, preventing me from solving the ABA problem for 64-bit data types. There's currently no way to implement such an algo in C#.
This would help ManualResetValueTaskSourceCore where it is theoretically possible for a race condition to cause the wrong state object to be used (even though that would be user error anyway). And the extra null check could be removed.
[Edit] Or I guess #31911 would solve the same problem.
Any update on this? Lack of a threadsafe lockfree wrapper type (e.g. over bool), that is intuitive to work with, seems to me like a significant defect in the language, considering even c++ has had std::atomic since c++11
Atomic
Background
x64 has been able to do 128bit/16 byte Interlocked CompareExchanges for a while and its a cpu requirement to be able install Window 8.1 x64 and Windows 10 x64. Its also available on MacOS/OSX and Linux. Its availability would allow for more interesting lock-less algorithms.
Was trying to think if there was an easy fallback for https://github.com/dotnet/corefx/issues/10478 but since its struct-based there is nothing to lock.
So additionally it would be good to have a type Atomic which mirrors the C++ 11 atomic type, adds some .NET magic and can fallback to locks when the data width is not supported on the platform.
Proposed Api
Interface, struct, class, extenstions (and some reference manipulation types)
Atomic struct
Atomic class (struct wrapper)
Numeric Extensions
Bool Extensions
Atomic Flagged References
Atomic Versioned References
Atomic Tagged References
Atomic Double Reference
Transforms
The transforms give the flexibility to compose more complex atomic data structures; for example the
int
Add
,Increment
andIncrement
to limit can be implemented asOr an entirely new type that hadn't previously supported atomic updates, with new operations
When the struct was 16 bytes
_data
would need to be 16 byte aligned to allow lock free use withLOCK CMPXCHG16b
where available. Might be easier to enforce alignment than with https://github.com/dotnet/corefx/issues/10478?VersionedReference
andFlaggedReference
should be 16 byte aligned inAtomic
(don't need to be outside), as shouldTaggedReference
when the struct is <= 16 bytes.