Confusion regarding Volatile.Read guarantees, and example in the volatile docs

dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.

https://docs.microsoft.com/dotnet/core/

MIT License

15.27k stars 4.73k forks source link

Confusion regarding Volatile.Read guarantees, and example in the volatile docs #67330

Closed theodorzoulias closed 2 years ago

theodorzoulias commented 2 years ago

Hi! I am in a state of confusion regarding the volatility in .NET, and I would appreciate a couple of clarifications. I am reading in the documentation of the Volatile class that:

On a multiprocessor system, a volatile read operation is not guaranteed to obtain the latest value written to that memory location by any processor. Similarly, a volatile write operation does not guarantee that the value written would be immediately visible to other processors.

So based on the documentation, on a multiprocessor system there is no guarantee about how stale a volatile read can be. On the other hand on a uniprocessor system:

On a uniprocessor system, volatile reads and writes ensure that a value is read or written to memory and not cached (for example, in a processor register). Thus, you can use these operations to synchronize access to a field that can be updated by another thread or by hardware.

My understanding of this paragraph is that on a system with a single CPU having one or multiple cores, a volatile read is guaranteed to be fresh. And by "fresh" I mean as fresh as it would be a read inside a lock, or a read performed with Interlocked.CompareExchange(ref x, 0, 0). This is my first question. Is my understanding about volatility in uniprocessor systems correct?

My second question involves the example in the docs of the volatile keyword. I am copy-pasting it here for convenience:

public class Worker
{
    // This method is called when the thread is started.
    public void DoWork()
    {
        bool work = false;
        while (!_shouldStop)
        {
            work = !work; // simulate some work
        }
        Console.WriteLine("Worker thread: terminating gracefully.");
    }
    public void RequestStop()
    {
        _shouldStop = true;
    }
    // Keyword volatile is used as a hint to the compiler that this data
    // member is accessed by multiple threads.
    private volatile bool _shouldStop;
}

public class WorkerThreadExample
{
    public static void Main()
    {
        // Create the worker thread object. This does not start the thread.
        Worker workerObject = new Worker();
        Thread workerThread = new Thread(workerObject.DoWork);

        // Start the worker thread.
        workerThread.Start();
        Console.WriteLine("Main thread: starting worker thread...");

        // Loop until the worker thread activates.
        while (!workerThread.IsAlive)
            ;

        // Put the main thread to sleep for 500 milliseconds to
        // allow the worker thread to do some work.
        Thread.Sleep(500);

        // Request that the worker thread stop itself.
        workerObject.RequestStop();

        // Use the Thread.Join method to block the current thread
        // until the object's thread terminates.
        workerThread.Join();
        Console.WriteLine("Main thread: worker thread has terminated.");
    }
    // Sample output:
    // Main thread: starting worker thread...
    // Worker thread: terminating gracefully.
    // Main thread: worker thread has terminated.
}

Based in the lack of guarantees about the freshness of volatile reads on multiprocessor systems, if the above program is executed on a multiprocessor system it's not guaranteed to complete, ever. The volatile _shouldStop field, updated to true by the main thread, might be seen as false by the workerThread for minutes, hours, days or years later. And as long as the workerThread sees it as false, it will continue spinning. And as long as the workerThread is spinning, the program will not terminate. Is my understanding about this example correct? Does the correctness of the above program depends on whether it runs on a system with one or many processors?

There are no similar remarks about uniprocessor/multiprocessor systems in the documentation of other threading mechanisms, like the Interlocked or the lock for example. Which makes me think that the volatile operations are somewhat inferior to lock/Interlocked operations at guaranteeing freshness. Are my concerns justified, or the lack of guarantees about freshness on multiprocessor systems applies equally to all other threading mechanisms as well?

ghost commented 2 years ago

Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.

Issue Details

Hi! I am in a state of confusion regarding the volatility in .NET, and I would appreciate a couple of clarifications. I am reading in [the documentation](https://docs.microsoft.com/en-us/dotnet/api/system.threading.volatile#remarks) of the `Volatile` class that: > On a multiprocessor system, a volatile read operation is not guaranteed to obtain the latest value written to that memory location by any processor. Similarly, a volatile write operation does not guarantee that the value written would be immediately visible to other processors. So based on the documentation, on a multiprocessor system there is no guarantee about how stale a volatile read can be. On the other hand on a uniprocessor system: > On a uniprocessor system, volatile reads and writes ensure that a value is read or written to memory and not cached (for example, in a processor register). Thus, you can use these operations to synchronize access to a field that can be updated by another thread or by hardware. My understanding of this paragraph is that on system with a single CPU having one or multiple cores, a volatile read is guaranteed to be fresh. And be "fresh" I mean as fresh as it would be a read inside a `lock`, or a read performed with `Interlocked.CompareExchange(ref x, 0, 0)`. This is my first question. Is my understanding about volatility in uniprocessor systems correct? My second question involves [the example](https://docs.microsoft.com/en-us/dotnet/csharp/language-reference/keywords/volatile#example) in the docs of the `volatile` keyword. I am copy-pasting here for convenience: ```C# public class Worker { // This method is called when the thread is started. public void DoWork() { bool work = false; while (!_shouldStop) { work = !work; // simulate some work } Console.WriteLine("Worker thread: terminating gracefully."); } public void RequestStop() { _shouldStop = true; } // Keyword volatile is used as a hint to the compiler that this data // member is accessed by multiple threads. private volatile bool _shouldStop; } public class WorkerThreadExample { public static void Main() { // Create the worker thread object. This does not start the thread. Worker workerObject = new Worker(); Thread workerThread = new Thread(workerObject.DoWork); // Start the worker thread. workerThread.Start(); Console.WriteLine("Main thread: starting worker thread..."); // Loop until the worker thread activates. while (!workerThread.IsAlive) ; // Put the main thread to sleep for 500 milliseconds to // allow the worker thread to do some work. Thread.Sleep(500); // Request that the worker thread stop itself. workerObject.RequestStop(); // Use the Thread.Join method to block the current thread // until the object's thread terminates. workerThread.Join(); Console.WriteLine("Main thread: worker thread has terminated."); } // Sample output: // Main thread: starting worker thread... // Worker thread: terminating gracefully. // Main thread: worker thread has terminated. } ``` Based in the lack of guarantees about the freshness of volatile reads on multiprocessor systems, if the above program is executed on a multiprocessor system it's not guaranteed to complete, ever. The volatile `_shouldStop` field, updated to `true` by the main thread, might seen as `false` by the `workerThread` for minutes, hours, days or years later. And as long as the `workerThread` sees it as `false`, it will continue spinning. And as long as the `workerThread` is spinning, the program will not terminate. Is my understanding about this example correct? Does the correctness of the above program depends on whether it runs on a system with one or many processors? There are no similar remarks about uniprocessor/multiprocessor systems in the documentation of other threading mechanisms, like the [`Interlocked`](https://docs.microsoft.com/en-us/dotnet/api/system.threading.interlocked) or the [`lock`](https://docs.microsoft.com/en-us/dotnet/csharp/language-reference/statements/lock) for example. Which makes me think that the volatile operations are somewhat inferior to `lock`/`Interlocked` operations at guaranteeing freshness. Are my concerns justified, or the lack of guarantees about freshness on multiprocessor systems applies equally to all other threading mechanisms as well?

Author:	theodorzoulias
Assignees:	-
Labels:	`area-System.Threading`, `untriaged`
Milestone:	-

tfenise commented 2 years ago

I think although volatile reads are not guaranteed to get the latest value, they should still get "reasonably fresh" value. Similarly, although volatile writes are not guaranteed to be visible to other processors immediately, they should be visible to other processors in "reasonably short" time.

theodorzoulias commented 2 years ago

I think although volatile reads are not guaranteed to get the latest value, they should still get "reasonably fresh" value. Similarly, although volatile writes are not guaranteed to be visible to other processors immediately, they should be visible to other processors in "reasonably short" time.

@tfenise this is my assumption too. But the documentation is like trying to convince me otherwise: that volatile is just about preventing reordering of instructions, and I should have no expectations about how stale the retrieved data might be (on multiprocessor systems at least).

alexrp commented 2 years ago

The documentation surrounding volatile, the Volatile/Interlocked class, volatile. CIL opcode, etc has always been remarkably bad.

The TL;DR here is that all methods on the Volatile class are atomic and impose acquire/release barriers.

theodorzoulias commented 2 years ago

The documentation surrounding volatile, the Volatile/Interlocked class, volatile. CIL opcode, etc has always been remarkably bad.

The TL;DR here is that all methods on the Volatile class are atomic and impose acquire/release barriers.

@alexrp according to the C# specification:

Reads and writes of other types, including long, ulong, double, and decimal, as well as user-defined types, need not be atomic.

So, as far as I understand, the Volatile.Read(Double) (among other methods) is probably not atomic. A thread using Volatile.Read(Double) is not protected from reading a torn double value, if another thread uses the Volatile.Write(Double) at the same time without synchronization.

I would agree that the volatile-related documentation is far from perfect, but on the other hand it's probably not an easy concept to document and explain. I don't think that I would have done a better job if I was tasked to write/improve these docs, even if I was an expert on this subject (apparently I am not).

alexrp commented 2 years ago

The volatile keyword is not the same as the Volatile class. They have different semantics.

The volatile keyword compiles to the volatile. CIL prefix opcode which imposes acquire/release barriers. That's it. If the size of the data being read/written happens to be less than or equal to the native word size, then the operation is also atomic; otherwise, it's just the barriers. (The atomicity guarantee, in this case, comes from a completely unrelated memory model guarantee in Ecma 335 - it's not explicitly part of the definition of volatile/volatile..)

The Volatile class, on the other hand, always performs atomic operations along with the barriers, regardless of data size.

tfenise commented 2 years ago

The Volatile class does not seem to be documented that it is always atomic, although the current implementation is indeed atomic. Interlocked.Read(Int64) is documented to be atomic, though.

theodorzoulias commented 2 years ago

The volatile keyword is not the same as the Volatile class. They have different semantics.

@alexrp hmm, my impression is that the volatile keyword is nothing more than a shortcut for the Volatile.Read/Volatile.Write methods. This impression comes from this paragraph in the volatile documentation:

Some languages, such as Visual Basic, do not recognize the concept of volatile memory operations. The Volatile class provides that functionality in such languages.

But I might be wrong.

alexrp commented 2 years ago

@alexrp hmm, my impression is that the volatile keyword is nothing more than a shortcut for the Volatile.Read/Volatile.Write methods. This impression comes from this paragraph in the volatile documentation:

That, too, is wrong. Like I said, the documentation here is remarkably bad and misleading.

jkotas commented 2 years ago

cc @VSadov

tfenise commented 2 years ago

The C# keyword volatile does guarantee atomicity, simply because it can only be applied to those data types (int, float, IntPtr, etc.) that guarantee atomic access even without volatile. It cannot be applied to long or double.

theodorzoulias commented 2 years ago

The Volatile class does not seem to be documented that it is always atomic, although the current implementation is indeed atomic. Interlocked.Read(Int64) is documented to be atomic, though.

@tfenise the current Volatile.Read(Long) implementation is marked as Intrinsic, which means that it can be potentially replaced/optimized by the JITter. So the source code might be a strong indicator that atomicity is intended, but it's not a definitive proof of actual atomic behavior.

alexrp commented 2 years ago

FWIW, some time ago, I wrote these notes which sum up my understanding of atomics/barriers in .NET land based on my experience implementing/fixing atomic intrinsics in Mono. I think they're still a fairly accurate picture of reality, so you might find them useful.

theodorzoulias commented 2 years ago

Thanks @alexrp for the notes. I read them, they are quite valuable. My problem is that I can't write code based on them, I must write code based on the documentation. Because if my code breaks in some future .NET version, and my program starts producing incorrect results, everyone will throw the blame on me for following blog posts and defying the official documentation...

For example take a look at this issue. As much as I would like to assume that the ConcurrentDictionary cannot emit the same key twice during a single enumeration, and although the current implementation provides this guarantee, the documentation does not provide it. So I am obliged to code defensively against a currently impossible scenario.

alexrp commented 2 years ago

My problem is that I can't write code based on them, I must write code based on the documentation. Because if my code breaks in some future .NET version, and my program starts producing incorrect results, everyone will throw the blame on me for following blog posts and defying the official documentation...

Yep, I completely understand.

As far as I'm concerned, the documentation for atomics and memory model issues in .NET should just be tossed out and rewritten using C++11 terms (since that is what the programming world has largely standardized on at this point).

The best I can do is give you the assurance that what you see in my notes is basically what the official documentation should be - and hopefully also will be at some point.

VSadov commented 2 years ago

Indeed - "the latest value written to that memory location by any processor" is a very vague term. What does it even mean?

If the variable was written to right after being read, did we read the latest value?
What if the new value is still traveling through the wires and no thread has seen the updated value, did it change yet?
How could the program observe the difference?

Guarantees of volatile are better understood in terms of effects made and observed by read and write operations and the ordering of the effects, not in terms of absolute "lateness".

The documentation is too vague and will need to be updated.

VSadov commented 2 years ago

For the example given - an effect of volatile read cannot be elided.

The while (!_shouldStop) will exit when another thread changes the value.

There is no such guarantee when _shouldStop is a regular, not volatile, field. Compiler can, and often does, move ordinary reads outside of the loop if the current thread cannot possibly change the value inside the loop body.

pcordes commented 2 years ago

And by "fresh" I mean as fresh as it would be a read inside a lock, or a read performed with Interlocked.CompareExchange(ref x, 0, 0).

Yes, execution on a uniprocessor is sequentially-consistent since all threads run on the same core, which will preserve the illusion of every CPU instruction running one at a time, in program order. (Whatever that is, after any potential compile-time reordering.)

On a uniprocessor system, num++ might still compile to multiple instructions (inevitably for RISC-like ISAs), in which case a context-switch between the load and store could still mean a later write of an old value happens after other stuff. But that's a new write, not an old write delayed by the CPU. But if, on a CISC like x86, it does happen to compile to add dword [rdi], 1 or something, that whole RMW is atomic wrt. interrupts and therefore context switches, without needing a lock prefix.

On a multi-core system, it's obviously not simple. Most of the details about how things becoming visible depend on the CPU you're running on, not the language. The language only has control over ordering of visibility (and loads wrt. stores), and obviously not keeping values in registers itself to be done much later.

There's not much a language spec can definitively say without being wrong on some current or future ISA, or including a tutorial on cache coherency, store buffers, and how data propagates between cores on current CPU architectures. That would seem way out of place in a formal language spec. That leaves the details up to the hardware you run on. In real life, all CPUs that we run threads across have cache-coherent shared memory with something like MESI to maintain coherency, so the guarantees on actual CPU hardware are pretty good. But usually this is just a performance thing, not a correctness issue; correctness just comes from ordering.

It's impossible for data to become "immediately" visible, unless you start measuring from the time the core doing the store actually commits to L1d cache from its store buffer. In that case, yes, that is the point of global visibility for a store. But that happens well after execution, and other cores may have been doing loads early anyway. (Various causes of memory-reordering, like store buffers, mean it's not meaningful to think of what each thread does in terms of executing one instruction at a time, either load, store, or RMW. Memory reordering can't be explained purely in terms of reordering operations in the source.)

It barely makes sense to think about execution in two separate threads being in lockstep with each other, or for time across threads to really be meaningful. (A bit like relativity in physics; it's hard to talk about time somewhere else.) I took a stab at it for C++ executing on real hardware in an SO answer a couple years ago.

So what can a language usefully say without talking about CPU architecture?

ISO C++ for example chooses to say

An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.

And that

Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.

Note that these are both "should", not normative requirements. But C++ is famously weak in terms of leaving open the possibility of DeathStation 9000 implementations that are technically conforming but unusable. Perhaps they wanted to leave open the possibility of cooperative multi-tasking, in which case it could be possible for a store to be infinitely delayed if things go wrong?

As discussed in comments on Stack Overflow, it's partly a question of quality-of-implementation thing.

If C# doesn't have anything even hinting at that, leaving it all up to people knowing "of course CPUs work this way", then probably a good idea to say something about a reasonable or finite amount of time, which would guarantee that setting a volatile exit_now flag will be seen eventually by a thread spinning on a volatile load from it.

I don't really know C# very well myself, just kind of got dragged into this by Stack Overflow comments, but hopefully that helps.

As far as what one should expect in practice, see https://stackoverflow.com/questions/61591287/does-hardware-memory-barrier-make-visibility-of-atomic-operations-faster-in-addi - once a store executes, the CPU will be doing its best to commit the store to L1d cache, making it visible to other cores (and thus threads), partly because that will free up the store-buffer entry which the CPU needs to recycle to keep executing later code.

theodorzoulias commented 2 years ago

And by "fresh" I mean as fresh as it would be a read inside a lock, or a read performed with Interlocked.CompareExchange(ref x, 0, 0).

Yes, execution on a uniprocessor is sequentially-consistent since all threads run on the same core, which will preserve the illusion of every CPU instruction running one at a time, in program order.

@pcordes thanks for your comment. This "Yes" is the vital information that I am searching for! I hope that it was possible to derive it from the documentation somehow.

For the example given - an effect of volatile read cannot be elided.

The while (!_shouldStop) will exit when another thread changes the value.

@VSadov this is reassuring. So the "not guaranteed" in the documentation does not mean "you can't have any expectations whatsoever when you use the volatile keyword". There are indeed some reasonable expectations that I am allowed to have: The workerThread eventually will see the updated value of the _shouldStop, and will stop spinning.

Please note that I am not looking for a guarantee about an absolute maximum latency in nanoseconds. I am just looking for a confirmation that by using the volatile keyword or the Volatile.Read method I can transfer a signal to another thread as fast as I would have done if I have used lock, or Interlocked, or some other threading mechanism instead. In other words I want to know that by choosing to use the volatile-related APIs I am not handicapping my application by making it less responsive to changes occurring in its internal state, compared to using any other threading mechanism that is available in .NET. This is "fast enough" for my needs.

If I could derive this confirmation by reading the documentation, I could make informed decisions about the code that I write. I would be confident that I am using the correct API for the job, and I would be able to respond effectively to criticism like: "Volatile doesn't really make any guarantees about how stale the data is" or "it works only because of undocumented side-effects" or "the use of volatile is, in most cases, a mistake".

pcordes commented 2 years ago

@theodorzoulias wrote:

Please note that I am not looking for a guarantee about an absolute maximum latency in nanoseconds. I am just looking for a confirmation that by using the volatile keyword or the Volatile.Read method I can transfer a signal to another thread as fast as I would have done if I have used lock, or Interlocked, or some other threading mechanism instead. In other words I want to know that by choosing to use the volatile-related APIs I am not handicapping my application by making it less responsive to changes occurring in its internal state, compared to using any other threading mechanism that is available in .NET. This is "fast enough" for my needs.

Ok, that makes sense, and is an interesting question separate from whether there's a guarantee of eventually being visible.

If Interlocked is visible faster on any real HW, it's not by a lot, and due to secondary effects. I haven't measured, but as a conservative guess there might sometimes be 5 or 10% more latency without full barriers that stop this core from doing anything else after a store. (e.g. later loads might contend for off-core traffic and result in a minor delay in getting ownership of the cache line to allow the store). Some of my linked SO answers mentioned this possibility, which BeeOnRope (Travis Downs) pointed out.

That depends on the machine and the workload, and sometimes there'd be no delay. IDK what the absolute worst case would be on a modern Xeon, or a many-core AArch64 for example; maybe even more than 10% in artificial code designed to do that, rather than be part of a useful program. More importantly, IDK what kind of distribution the penalty would follow, like whether it's under 1% in more than 90% of the cases or something.

So a language standard isn't going to want to give a hard guarantee of them being the same.

I'm pretty sure the benefit is smallish, and that the cost of fully stalling other memory operations on the writing core isn't worth it. Most parallel algorithms can hopefully move on to doing useful work after writing, and don't spend lots of their time spin-waiting for stores from other cores.

If you did have an application like that, you'd want to benchmark on the HW you care about, if you can't redesign it to be less sensitive to contention and make better use of the CPU time it spends on each core. Most applications aren't like that, except for corner cases of fine-grained locking; releasing a lock can use just a plain release store, with no barrier after it. But some lock implementations do use an x86 lock prefix to make it a full barrier on release. That may help in that case at the expense of throughput for the thread doing the unlocking.

So it's a tradeoff; you're almost never wrong to use volatile if that's all you need, but it's not because it's necessarily exactly the same inter-core latency.

I'd be interested if anyone has benchmarks of getting a speedup in some artificial microbenchmark by using Interlocked.exchange instead of a release-store (or even a relaxed store on AArch64).

theodorzoulias commented 2 years ago

You're almost never wrong to use volatile if that's all you need.

@pcordes that's a brave thing to say, when influential people have said things like:

Frankly, I discourage you from ever making a volatile field. Volatile fields are a sign that you are doing something downright crazy: you’re attempting to read and write the same value on two different threads without putting a lock in place.

I am keeping a link to your comment handy, to use as an antidote. 😃

pcordes commented 2 years ago

You're almost never wrong to use volatile if that's all you need.

@pcordes that's a brave thing to say, when influential people have said things like:

Frankly, I discourage you from ever making a volatile field. Volatile fields are a sign that you are doing something downright crazy: you’re attempting to read and write the same value on two different threads without putting a lock in place.

Clearly Eric is talking about writing lock-free code at all. Using Interlocked.Exchange to do the same thing doesn't make it better, you're still not using a mutex. He's talking about actual locking, not x86's lock prefix. He's right, you definitely have to know what you're doing to use this correctly, and for it perform better than locking. (A trivial case like an exit_now flag is kind of an exception; trivial to get right, and simple + easy to do with volatile if there's a natural place to check it each iteration.)

My comment was assuming a context of having already decided to write lockless code with atomic operations, with all the extra challenges that creates for writing correct code. Once you've decided that you want an atomic release-store, something equivalent to C++ std::atomic<T> with .store(val, memory_order_release) is usually what you want, not store + full barrier.

(C++ defaults to memory_order_seq_cst for all operations instead of acq_rel, e.g. on x86 using xchg for stores instead of just mov, but that makes stores more expensive. Very few algorithms actually need to stop StoreLoad reordering, or need a total order of all loads+stores across all threads. Even with lock-free atomics, some like Herb Sutter caution against using anything weaker than seq_cst.)

Eric isn't suggesting Interlocked.Exchange instead of volatile for code that does decide to use lock-free algorithms and data structures, at least not in that quote. He's talking about avoiding the whole idea entirely.