Closed theodorzoulias closed 2 years ago
Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.
Author: | theodorzoulias |
---|---|
Assignees: | - |
Labels: | `area-System.Threading`, `untriaged` |
Milestone: | - |
I think although volatile reads are not guaranteed to get the latest value, they should still get "reasonably fresh" value. Similarly, although volatile writes are not guaranteed to be visible to other processors immediately, they should be visible to other processors in "reasonably short" time.
I think although volatile reads are not guaranteed to get the latest value, they should still get "reasonably fresh" value. Similarly, although volatile writes are not guaranteed to be visible to other processors immediately, they should be visible to other processors in "reasonably short" time.
@tfenise this is my assumption too. But the documentation is like trying to convince me otherwise: that volatile
is just about preventing reordering of instructions, and I should have no expectations about how stale the retrieved data might be (on multiprocessor systems at least).
The documentation surrounding volatile
, the Volatile
/Interlocked
class, volatile.
CIL opcode, etc has always been remarkably bad.
The TL;DR here is that all methods on the Volatile
class are atomic and impose acquire/release barriers.
The documentation surrounding
volatile
, theVolatile
/Interlocked
class,volatile.
CIL opcode, etc has always been remarkably bad.The TL;DR here is that all methods on the
Volatile
class are atomic and impose acquire/release barriers.
@alexrp according to the C# specification:
Reads and writes of other types, including
long
,ulong
,double
, anddecimal
, as well as user-defined types, need not be atomic.
So, as far as I understand, the Volatile.Read(Double)
(among other methods) is probably not atomic. A thread using Volatile.Read(Double)
is not protected from reading a torn double
value, if another thread uses the Volatile.Write(Double)
at the same time without synchronization.
I would agree that the volatile-related documentation is far from perfect, but on the other hand it's probably not an easy concept to document and explain. I don't think that I would have done a better job if I was tasked to write/improve these docs, even if I was an expert on this subject (apparently I am not).
The volatile
keyword is not the same as the Volatile
class. They have different semantics.
The volatile
keyword compiles to the volatile.
CIL prefix opcode which imposes acquire/release barriers. That's it. If the size of the data being read/written happens to be less than or equal to the native word size, then the operation is also atomic; otherwise, it's just the barriers. (The atomicity guarantee, in this case, comes from a completely unrelated memory model guarantee in Ecma 335 - it's not explicitly part of the definition of volatile
/volatile.
.)
The Volatile
class, on the other hand, always performs atomic operations along with the barriers, regardless of data size.
The Volatile
class does not seem to be documented that it is always atomic, although the current implementation is indeed atomic. Interlocked.Read(Int64)
is documented to be atomic, though.
The
volatile
keyword is not the same as theVolatile
class. They have different semantics.
@alexrp hmm, my impression is that the volatile
keyword is nothing more than a shortcut for the Volatile.Read
/Volatile.Write
methods. This impression comes from this paragraph in the volatile
documentation:
Some languages, such as Visual Basic, do not recognize the concept of volatile memory operations. The
Volatile
class provides that functionality in such languages.
But I might be wrong.
@alexrp hmm, my impression is that the
volatile
keyword is nothing more than a shortcut for theVolatile.Read
/Volatile.Write
methods. This impression comes from this paragraph in thevolatile
documentation:
That, too, is wrong. Like I said, the documentation here is remarkably bad and misleading.
cc @VSadov
The C# keyword volatile
does guarantee atomicity, simply because it can only be applied to those data types (int
, float
, IntPtr
, etc.) that guarantee atomic access even without volatile
. It cannot be applied to long
or double
.
The
Volatile
class does not seem to be documented that it is always atomic, although the current implementation is indeed atomic.Interlocked.Read(Int64)
is documented to be atomic, though.
@tfenise the current Volatile.Read(Long)
implementation is marked as Intrinsic
, which means that it can be potentially replaced/optimized by the JITter. So the source code might be a strong indicator that atomicity is intended, but it's not a definitive proof of actual atomic behavior.
FWIW, some time ago, I wrote these notes which sum up my understanding of atomics/barriers in .NET land based on my experience implementing/fixing atomic intrinsics in Mono. I think they're still a fairly accurate picture of reality, so you might find them useful.
Thanks @alexrp for the notes. I read them, they are quite valuable. My problem is that I can't write code based on them, I must write code based on the documentation. Because if my code breaks in some future .NET version, and my program starts producing incorrect results, everyone will throw the blame on me for following blog posts and defying the official documentation...
For example take a look at this issue. As much as I would like to assume that the ConcurrentDictionary
cannot emit the same key twice during a single enumeration, and although the current implementation provides this guarantee, the documentation does not provide it. So I am obliged to code defensively against a currently impossible scenario.
My problem is that I can't write code based on them, I must write code based on the documentation. Because if my code breaks in some future .NET version, and my program starts producing incorrect results, everyone will throw the blame on me for following blog posts and defying the official documentation...
Yep, I completely understand.
As far as I'm concerned, the documentation for atomics and memory model issues in .NET should just be tossed out and rewritten using C++11 terms (since that is what the programming world has largely standardized on at this point).
The best I can do is give you the assurance that what you see in my notes is basically what the official documentation should be - and hopefully also will be at some point.
Indeed - "the latest value written to that memory location by any processor" is a very vague term. What does it even mean?
Guarantees of volatile
are better understood in terms of effects made and observed by read and write operations and the ordering of the effects, not in terms of absolute "lateness".
The documentation is too vague and will need to be updated.
For the example given - an effect of volatile read cannot be elided.
The while (!_shouldStop)
will exit when another thread changes the value.
There is no such guarantee when _shouldStop
is a regular, not volatile, field. Compiler can, and often does, move ordinary reads outside of the loop if the current thread cannot possibly change the value inside the loop body.
And by "fresh" I mean as fresh as it would be a read inside a lock, or a read performed with Interlocked.CompareExchange(ref x, 0, 0).
Yes, execution on a uniprocessor is sequentially-consistent since all threads run on the same core, which will preserve the illusion of every CPU instruction running one at a time, in program order. (Whatever that is, after any potential compile-time reordering.)
On a uniprocessor system, num++
might still compile to multiple instructions (inevitably for RISC-like ISAs), in which case a context-switch between the load and store could still mean a later write of an old value happens after other stuff. But that's a new write, not an old write delayed by the CPU. But if, on a CISC like x86, it does happen to compile to add dword [rdi], 1
or something, that whole RMW is atomic wrt. interrupts and therefore context switches, without needing a lock
prefix.
On a multi-core system, it's obviously not simple. Most of the details about how things becoming visible depend on the CPU you're running on, not the language. The language only has control over ordering of visibility (and loads wrt. stores), and obviously not keeping values in registers itself to be done much later.
There's not much a language spec can definitively say without being wrong on some current or future ISA, or including a tutorial on cache coherency, store buffers, and how data propagates between cores on current CPU architectures. That would seem way out of place in a formal language spec. That leaves the details up to the hardware you run on. In real life, all CPUs that we run threads across have cache-coherent shared memory with something like MESI to maintain coherency, so the guarantees on actual CPU hardware are pretty good. But usually this is just a performance thing, not a correctness issue; correctness just comes from ordering.
It's impossible for data to become "immediately" visible, unless you start measuring from the time the core doing the store actually commits to L1d cache from its store buffer. In that case, yes, that is the point of global visibility for a store. But that happens well after execution, and other cores may have been doing loads early anyway. (Various causes of memory-reordering, like store buffers, mean it's not meaningful to think of what each thread does in terms of executing one instruction at a time, either load, store, or RMW. Memory reordering can't be explained purely in terms of reordering operations in the source.)
It barely makes sense to think about execution in two separate threads being in lockstep with each other, or for time across threads to really be meaningful. (A bit like relativity in physics; it's hard to talk about time somewhere else.) I took a stab at it for C++ executing on real hardware in an SO answer a couple years ago.
ISO C++ for example chooses to say
An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.
And that
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
Note that these are both "should", not normative requirements. But C++ is famously weak in terms of leaving open the possibility of DeathStation 9000 implementations that are technically conforming but unusable. Perhaps they wanted to leave open the possibility of cooperative multi-tasking, in which case it could be possible for a store to be infinitely delayed if things go wrong?
As discussed in comments on Stack Overflow, it's partly a question of quality-of-implementation thing.
If C# doesn't have anything even hinting at that, leaving it all up to people knowing "of course CPUs work this way", then probably a good idea to say something about a reasonable or finite amount of time, which would guarantee that setting a volatile exit_now
flag will be seen eventually by a thread spinning on a volatile load from it.
I don't really know C# very well myself, just kind of got dragged into this by Stack Overflow comments, but hopefully that helps.
As far as what one should expect in practice, see https://stackoverflow.com/questions/61591287/does-hardware-memory-barrier-make-visibility-of-atomic-operations-faster-in-addi - once a store executes, the CPU will be doing its best to commit the store to L1d cache, making it visible to other cores (and thus threads), partly because that will free up the store-buffer entry which the CPU needs to recycle to keep executing later code.
And by "fresh" I mean as fresh as it would be a read inside a lock, or a read performed with Interlocked.CompareExchange(ref x, 0, 0).
Yes, execution on a uniprocessor is sequentially-consistent since all threads run on the same core, which will preserve the illusion of every CPU instruction running one at a time, in program order.
@pcordes thanks for your comment. This "Yes" is the vital information that I am searching for! I hope that it was possible to derive it from the documentation somehow.
For the example given - an effect of volatile read cannot be elided.
The
while (!_shouldStop)
will exit when another thread changes the value.
@VSadov this is reassuring. So the "not guaranteed" in the documentation does not mean "you can't have any expectations whatsoever when you use the volatile
keyword". There are indeed some reasonable expectations that I am allowed to have: The workerThread
eventually will see the updated value of the _shouldStop
, and will stop spinning.
Please note that I am not looking for a guarantee about an absolute maximum latency in nanoseconds. I am just looking for a confirmation that by using the volatile
keyword or the Volatile.Read
method I can transfer a signal to another thread as fast as I would have done if I have used lock
, or Interlocked
, or some other threading mechanism instead. In other words I want to know that by choosing to use the volatile-related APIs I am not handicapping my application by making it less responsive to changes occurring in its internal state, compared to using any other threading mechanism that is available in .NET. This is "fast enough" for my needs.
If I could derive this confirmation by reading the documentation, I could make informed decisions about the code that I write. I would be confident that I am using the correct API for the job, and I would be able to respond effectively to criticism like: "Volatile doesn't really make any guarantees about how stale the data is" or "it works only because of undocumented side-effects" or "the use of volatile is, in most cases, a mistake".
@theodorzoulias wrote:
Please note that I am not looking for a guarantee about an absolute maximum latency in nanoseconds. I am just looking for a confirmation that by using the
volatile
keyword or theVolatile.Read
method I can transfer a signal to another thread as fast as I would have done if I have usedlock
, orInterlocked
, or some other threading mechanism instead. In other words I want to know that by choosing to use the volatile-related APIs I am not handicapping my application by making it less responsive to changes occurring in its internal state, compared to using any other threading mechanism that is available in .NET. This is "fast enough" for my needs.
Ok, that makes sense, and is an interesting question separate from whether there's a guarantee of eventually being visible.
If Interlocked
is visible faster on any real HW, it's not by a lot, and due to secondary effects. I haven't measured, but as a conservative guess there might sometimes be 5 or 10% more latency without full barriers that stop this core from doing anything else after a store. (e.g. later loads might contend for off-core traffic and result in a minor delay in getting ownership of the cache line to allow the store). Some of my linked SO answers mentioned this possibility, which BeeOnRope (Travis Downs) pointed out.
That depends on the machine and the workload, and sometimes there'd be no delay. IDK what the absolute worst case would be on a modern Xeon, or a many-core AArch64 for example; maybe even more than 10% in artificial code designed to do that, rather than be part of a useful program. More importantly, IDK what kind of distribution the penalty would follow, like whether it's under 1% in more than 90% of the cases or something.
So a language standard isn't going to want to give a hard guarantee of them being the same.
I'm pretty sure the benefit is smallish, and that the cost of fully stalling other memory operations on the writing core isn't worth it. Most parallel algorithms can hopefully move on to doing useful work after writing, and don't spend lots of their time spin-waiting for stores from other cores.
If you did have an application like that, you'd want to benchmark on the HW you care about, if you can't redesign it to be less sensitive to contention and make better use of the CPU time it spends on each core. Most applications aren't like that, except for corner cases of fine-grained locking; releasing a lock can use just a plain release store, with no barrier after it. But some lock implementations do use an x86 lock
prefix to make it a full barrier on release. That may help in that case at the expense of throughput for the thread doing the unlocking.
So it's a tradeoff; you're almost never wrong to use volatile
if that's all you need, but it's not because it's necessarily exactly the same inter-core latency.
I'd be interested if anyone has benchmarks of getting a speedup in some artificial microbenchmark by using Interlocked.exchange instead of a release-store (or even a relaxed store on AArch64).
You're almost never wrong to use
volatile
if that's all you need.
@pcordes that's a brave thing to say, when influential people have said things like:
Frankly, I discourage you from ever making a
volatile
field. Volatile fields are a sign that you are doing something downright crazy: you’re attempting to read and write the same value on two different threads without putting a lock in place.
I am keeping a link to your comment handy, to use as an antidote. 😃
You're almost never wrong to use
volatile
if that's all you need.@pcordes that's a brave thing to say, when influential people have said things like:
Frankly, I discourage you from ever making a
volatile
field. Volatile fields are a sign that you are doing something downright crazy: you’re attempting to read and write the same value on two different threads without putting a lock in place.
Clearly Eric is talking about writing lock-free code at all. Using Interlocked.Exchange to do the same thing doesn't make it better, you're still not using a mutex. He's talking about actual locking, not x86's lock
prefix. He's right, you definitely have to know what you're doing to use this correctly, and for it perform better than locking. (A trivial case like an exit_now
flag is kind of an exception; trivial to get right, and simple + easy to do with volatile if there's a natural place to check it each iteration.)
My comment was assuming a context of having already decided to write lockless code with atomic operations, with all the extra challenges that creates for writing correct code. Once you've decided that you want an atomic release-store, something equivalent to C++ std::atomic<T>
with .store(val, memory_order_release)
is usually what you want, not store + full barrier.
(C++ defaults to memory_order_seq_cst
for all operations instead of acq_rel
, e.g. on x86 using xchg
for stores instead of just mov
, but that makes stores more expensive. Very few algorithms actually need to stop StoreLoad reordering, or need a total order of all loads+stores across all threads. Even with lock-free atomics, some like Herb Sutter caution against using anything weaker than seq_cst
.)
Eric isn't suggesting Interlocked.Exchange
instead of volatile
for code that does decide to use lock-free algorithms and data structures, at least not in that quote. He's talking about avoiding the whole idea entirely.
Hi! I am in a state of confusion regarding the volatility in .NET, and I would appreciate a couple of clarifications. I am reading in the documentation of the
Volatile
class that:So based on the documentation, on a multiprocessor system there is no guarantee about how stale a volatile read can be. On the other hand on a uniprocessor system:
My understanding of this paragraph is that on a system with a single CPU having one or multiple cores, a volatile read is guaranteed to be fresh. And by "fresh" I mean as fresh as it would be a read inside a
lock
, or a read performed withInterlocked.CompareExchange(ref x, 0, 0)
. This is my first question. Is my understanding about volatility in uniprocessor systems correct?My second question involves the example in the docs of the
volatile
keyword. I am copy-pasting it here for convenience:Based in the lack of guarantees about the freshness of volatile reads on multiprocessor systems, if the above program is executed on a multiprocessor system it's not guaranteed to complete, ever. The volatile
_shouldStop
field, updated totrue
by the main thread, might be seen asfalse
by theworkerThread
for minutes, hours, days or years later. And as long as theworkerThread
sees it asfalse
, it will continue spinning. And as long as theworkerThread
is spinning, the program will not terminate. Is my understanding about this example correct? Does the correctness of the above program depends on whether it runs on a system with one or many processors?There are no similar remarks about uniprocessor/multiprocessor systems in the documentation of other threading mechanisms, like the
Interlocked
or thelock
for example. Which makes me think that the volatile operations are somewhat inferior tolock
/Interlocked
operations at guaranteeing freshness. Are my concerns justified, or the lack of guarantees about freshness on multiprocessor systems applies equally to all other threading mechanisms as well?