[API Proposal]: Volatile barrier APIs

hamarb123 commented 4 months ago

Background and motivation

This API proposal exposes methods to perform non-atomic volatile memory operations. Our volatile semantics are explained in our memory model, but I will outline the tl;dr of the relevant parts here:

Usually memory accesses can be re-ordered and combined by the C# compiler, JIT, and CPU
Volatile writes disallow the write to occur earlier than specified (i.e., prior memory accesses must complete before this executes)
Volatile reads disallow the read to occur later than specified (i.e., subsequent memory accesses can only begin after this completes)
Memory accesses to primitive types are always atomic if they're properly aligned (unless unaligned. is used), and either 1) the size of the type is at most the size of the pointer, or 2) a method on Volatile or Interlocked such as Volatile.Write(double&, double) has been called

Currently, we expose APIs on Volatile. for the atomic memory accesses, but there is no way to perform the equivalent operations for non-atomic types. If we have Volatile barrier APIs, they will be easy to write, and it should make it clear which memory operations can move past the barrier in which ways.

API Proposal

namespace System.Threading;

public static class Volatile
{
    public static void ReadBarrier();
    public static void WriteBarrier();
}

=== Desired semantics:

Volatile.ReadBarrier()

Provides a Read-ReadWrite barrier. All reads preceding the barrier will need to complete before any subsequent memory operation.

Volatile.ReadBarrier() matches the semantics of Volatile.Read in terms of ordering reads, relative to all subsequent, in program order, operations.

The important difference from Volatile.Read(ref x) is that Volatile.ReadBarrier() has effect on all preceeding reads and not just a particular single read of x.

Volatile.WriteBarrier()

Provides a ReadWrite-Write barrier. All memory operations preceding the barrier will need to complete before any subsequent write.

Volatile.WriteBarrier() matches the semantics of Volatile.Write in terms of ordering writes, relative to all preceeding, in program order, operations.

The important difference from Volatile.Write(ref x) is that Volatile.WriteBarrier() has effect on all subsequent writes and not just a particular single write of x.

The actual implementation will depend on underlying platform.

On TSO architectures (x86/x64) it would have only compiler optimization-preventing effects.
On weaker architectures some ordering instructions will be emitted in addition - as appropriate and available in the given ISA.

API Usage

The runtime uses an internal API Interlocked.ReadMemoryBarrier() in 2 places (here and here) to batch multiple reads on both CoreCLR and NativeAOT, and is supported on all platforms. This ability is also useful to third-party developers (such as me, in my example below), but is currently not possible to write efficiently.

An example where non-atomic volatile operations would be useful is as follows. Consider a game which wants to save its state, ideally while continuing to run; these are the most obvious options:

Save on the main thread (can cause lag spikes if lots of data to save)
Create a copy of the data and then save on a separate thread (less time spent on main thread, but potentially still non-trivial, and a large number of allocations most likely)
Use locks around access to the memory (massive overhead)
Just save it on another thread without any sync (saves an inconsistent state)

But there is actually another option which utilises non-atomic volatile semantics:

Has low overhead, especially on x86 and x64
Saves a consistent state

//main threads sets IsSaving to true and increments SavingVersion before starting the saving thread, and to false once it's definitely done (e.g., on next frame)
//saving thread performs a full memory barrier before starting (when required, since starting a brand new thread every time isn't ideal), to ensure that _value is up-to-date
//memory synchronisation works because _value is always read before any saving information, and it's always written after the saving information
//if the version we read on the saving thread is not the current version, then our read from _value is correct, otherwise our read from _savingValue will be correct
//in the rare case that we loop to saving version == 0, then we can manually write all _savingVersion values to 0, skip to version == 1, and go from there (excluded from here though for clarity)

static class SavingState
{
    public static bool IsSaving { get; set; }
    public static nuint SavingVersion { get; set; }
}

struct SaveableHolder<T>
{
    nuint _savingVersion;
    T _value;
    T _savingValue;

    //Called only from main thread
    public T Value
    {
        get => _value;
        set
        {
            if (SavingState.IsSaving)
            {
                if (SavingVersion != SavingState.SavingVersion)
                {
                    _savingValue = _value;

                    //ensure the saving value is written before the saving version, so that we read it in the correct order
                    Volatile.Write(ref _savingVersion, SavingState.SavingVersion);
                }

                //_value can only become torn or incorrect after we have written our saving value and version
                Volatile.WriteBarrier();
                _value = value; //write must occur after prior writes
            }
            else
            {
                _value = value;
            }
        }
    }

    //Called only from saving thread while SavingState.IsSaving with a higher SavingState.SavingVersion than last time
    public T SavingValue
    {
        get
        {
            var value = Value; //read must occur before reads
            Volatile.ReadBarrier();

            //_savingVersion must be read after _value is, so if it's visibly changed/changing then we will either catch it here
            if (Volatile.Read(in _savingVersion) != SavingState.SavingVersion) return value;

            //volatile read on _savingVersion ensures we get an up-to-date _savingValue since it's written first
            return _savingValue;
        }
    }
}

Alternative Designs

We could expose read/write APIs instead:
```
namespace System.Threading;
```

public static class Volatile { public static T ReadNonAtomic(ref readonly T location) where T : allows ref struct { //ldarg.0 //volatile. //ldobj !!T }

public static void WriteNonAtomic<T>(ref T location, T value) where T : allows ref struct
{
    //ldarg.0
    //ldarg.1
    //volatile.
    //stobj !!T
}

}


We do have IL instructions, but they're currently broken and not exposed, see https://github.com/dotnet/runtime/issues/91530 - the proposal here was originally to expose APIs for `volatile. ldobj` and `volatile. stobj` + the unaligned variants (as seen aobe), and fix the instructions (or implement these without the instructions and have the instructions call these APIs - not much of a difference really). It was changed based on feedback to expose barrier APIs, which can provide equivalent semantics, but also allow additional scenarios. It is also clearer which memory operations can be reordered with the barrier APIs.

- We could expose APIs on Interlocked instead:
```csharp
public static class Interlocked
{
    // Existing API
    public static void MemoryBarrier();
    // New APIs
    public static void MemoryBarrierAcquire(); //volatile read semantics
    public static void MemoryBarrierRelease(); //volatile write semantics
}

We could expose the APIs on Unsafe instead:


namespace System.Runtime.CompilerServices;

public static class Unsafe { public static T ReadVolatile(ref readonly T location) where T : allows ref struct; public static void WriteVolatile(ref T location, T value) where T : allows ref struct; }

- We could add unaligned overloads:
```csharp
namespace System.Runtime.CompilerServices;

public static class Unsafe
{
    public static T ReadVolatileUnaligned<T>(ref readonly byte location) where T : allows ref struct;
    public static void WriteVolatileUnaligned<T>(ref byte location, T value) where T : allows ref struct;
}

We could also expose APIs for other operations which allow volatile. - initblk and cpblk, people may have use for these also:
```
namespace System.Runtime.CompilerServices;
```

public static class Unsafe { public static void CopyBlockVolatile(ref byte destination, ref readonly byte source, uint byteCount); public static void CopyBlockVolatileUnaligned(ref byte destination, ref readonly byte source, uint byteCount); public static void InitBlockVolatile(ref byte startAddress, byte value, uint byteCount); public static void InitBlockVolatileUnaligned(ref byte startAddress, byte value, uint byteCount); }


- We could expose APIs similar to what C++ has: https://en.cppreference.com/w/cpp/atomic/memory_order

### Open Questions

There is a question as to whether we should have `Read-ReadWrite`/`ReadWrite-Write` barriers or `Read-Read`/`Write-Write` barriers. I was initially in favour of the former (as it matches our current memory model), but now think the latter is probably better, since there are many scenarios (including in my example API usage, and the runtime's uses too) where the additional guarantees provided by the former are unnecessary, and thus may cause unnecessary overhead. We could also just provide both if we think they're both useful.

### Risks

No more than other volatile/interlocked APIs really, other than potential misunderstanding of what they do.

ghost commented 4 months ago

Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.

Issue Details

### Background and motivation This API proposal exposes methods to perform non-atomic volatile memory operations. Our volatile semantics are explained in our [memory model](https://github.com/dotnet/runtime/blob/main/docs/design/specs/Memory-model.md#order-of-memory-operations), but I will outline the tl;dr of the relevant parts here: - Usually memory accesses can be re-ordered and combined by the C# compiler, JIT, and CPU - Volatile writes disallow the write to occur earlier than specified (i.e., prior memory accesses must complete before this executes) - Volatile reads disallow the read to occur later than specified (i.e., subsequent memory accesses can only begin after this completes) - Memory accesses to primitive types are always atomic if they're properly aligned (unless `unaligned.` is used), and either 1) the size of the type is at most the size of the pointer, or 2) a method on `Volatile` or `Interlocked` such as `Volatile.Write(double&, double)` has been called Currently, we expose APIs on `Volatile.` for the atomic memory accesses, but there is no way to perform the equivalent operations for non-atomic types. We do have IL instructions, but they're currently broken and not exposed, see https://github.com/dotnet/runtime/issues/91530 - the proposal here is to expose APIs for `volatile. ldobj` and `volatile. stobj` + the unaligned variants, and fix the instructions (or implement these without the instructions and have the instructions call these APIs - not much of a difference really). ### API Proposal ```csharp namespace System.Threading; public static class Volatile { public static T ReadNonAtomic(ref readonly T location) { //ldarg.0 //volatile. //ldobj !!T } public static void WriteNonAtomic(ref T location, T value) { //ldarg.0 //ldarg.1 //volatile. //stobj !!T } public static T ReadUnalignedNonAtomic(ref readonly T location) { //ldarg.0 //unaligned. 1 //volatile. //ldobj !!T } public static void WriteUnalignedNonAtomic(ref T location, T value) { //ldarg.0 //ldarg.1 //unaligned. 1 //volatile. //stobj !!T } } ``` ### API Usage An example where non-atomic volatile operations would be useful is as follows. Consider a game which wants to save its state, ideally while continuing to run; these are the most obvious options: - Save on the main thread (can cause lag spikes if lots of data to save) - Create a copy of the data and then save on a separate thread (less time spent on main thread, but potentially still non-trivial, and a large number of allocations most likely) - Use locks around access to the memory (massive overhead) - Just save it on another thread without any sync (saves an inconsistent state) But there is actually another option which utilises non-atomic volatile semantics: - Has low overhead, especially on x86 and x64 - Saves a consistent state ```csharp //main threads sets IsSaving to true and increments SavingVersion before starting the saving thread, and to false once it's definitely done (e.g., on next frame) //saving thread performs a full memory barrier before starting (when required, since starting a brand new thread every time isn't ideal), to ensure that _value is up-to-date //memory synchronisation works because _value is always read before any saving information, and it's always written after the saving information //if the version we read on the saving thread is not the current version, then our read from _value is correct, otherwise our read from _savingValue will be correct //in the rare case that we loop to saving version == 0, then we can manually write all _savingVersion values to 0, skip to version == 1, and go from there (excluded from here though for clarity) static class SavingState { public static bool IsSaving { get; set; } public static nuint SavingVersion { get; set; } } struct SaveableHolder { nuint _savingVersion; T _value; T _savingValue; //Called only from main thread public T Value { get => _value; set { if (SavingState.IsSaving) { if (SavingVersion != SavingState.SavingVersion) { _savingVersion = SavingState.SavingVersion; _savingValue = _value; } //_value can only become torn or incorrect after we have written our saving value and version Volatile.WriteNonAtomic(ref _value, value); //write must occur after prior operations } else { _value = value; } } } //Called only from saving thread while SavingState.IsSaving with a higher SavingState.SavingVersion than last time public T SavingValue { get { var value = Volatile.ReadNonAtomic(in Value); //read must occur before subsequent code //_savingVersion must be read after _value is, so if it's visibly changed/changing then we will either catch it here if (_savingVersion != SavingState.SavingVersion) return value; return _savingValue } } } ``` ### Alternative Designs - We could expose barrier APIs instead: ```csharp namespace System.Threading; public static class Volatile { public static void ReadBarrier(); public static void WriteBarrier(); } ``` - We could expose the APIs on `Unsafe` instead. - We could also expose APIs for other operations which allow `volatile.` - `initblk` and `cpblk`, people may have use for these also ### Risks No more than other volatile APIs really, other than the lack of atomicity (which is in the name).

Author:	hamarb123
Assignees:	-
Labels:	`api-suggestion`, `area-System.Threading`, `untriaged`
Milestone:	-

jkotas commented 4 months ago

ReadUnalignedNonAtomic(ref readonly T location)

ref T is expected to be aligned. API for manipulation of unaligned memory should not take ref T. I do not think it belongs on System.Threading.Volatile.

I think it is very rare to actually need volatile and misaligned memory together. It would make more sense to cover that scenario by Volatile.Read/WriteBarrier.

jkotas commented 4 months ago

Another alternative design:

class Interlocked
{
    // Existing API
    void MemoryBarrier();
    // New APIs
    void ReadMemoryBarrier();
    void WriteMemoryBarrier();
}

alexrp commented 4 months ago

@jkotas to clarify, do you mean something like MemoryBarrierAcquire()/MemoryBarrierRelease()? I ask because read/write barriers aren't typically defined to have the semantics as acquire/release barriers, and I think it would be a mistake to introduce read/write barriers as an additional memory model concept that users have to understand.

alexrp commented 4 months ago

It's not great that .NET's atomic APIs are split across Volatile and Interlocked. On the one hand, MemoryBarrierAcquire()/MemoryBarrierRelease() seem like they should sit next to MemoryBarrier() on Interlocked... but then on the other hand, Interlocked currently only deals in sequential consistency semantics, so conceptual consistency for the API would be hurt. So maybe putting them on Volatile is better. But then again, there's a discoverability argument for putting them on Interlocked...

Unfortunate situation all around. Really makes me wish we could slap [EB(Never)] on Interlocked/Volatile and unify them in an Atomic class modelled mostly after C/C++11 atomics. But I digress.

I guess I lean 51/49 in favor of putting them on Volatile. 🤷

jkotas commented 4 months ago

do you mean something like MemoryBarrierAcquire()/MemoryBarrierRelease()?

Right, I do not opinion about the exact names. The naming is all over the place as you have pointed out.

I assume that the only difference between Volatile.Read/WriteBarrier and Interlocked.MemoryBarrierAcquire/Release alternatives are the names. There is no functional difference. Correct?

hamarb123 commented 4 months ago

I assume that the only difference between Volatile.Read/WriteBarrier and Interlocked.MemoryBarrierAcquire/Release alternatives are the names. There is no functional difference. Correct?

That is the idea, yes. They just emit the barrier that Volatile.Read/Write emits, without necessarily doing the operation (by "without necessarily doing the operation", I mean that you could also do the non-volatile operation and the JIT could combine it).

hamarb123 commented 4 months ago

@mangod9 @kouvel (or anyone else who can mark it), what needs to be done to get this https://github.com/dotnet/runtime/labels/api-ready-for-review?

jkotas commented 4 months ago

I have looked at your API usage example - it appears to be either incomplete or buggy. Consider this interleaving of foreground and saving thread:

                   _savingVersion = SavingState.SavingVersion;
---
            var value = Volatile.ReadNonAtomic(in Value); //read must occur before subsequent code

            //_savingVersion must be read after _value is, so if it's visibly changed/changing then we will either catch it here
            if (_savingVersion != SavingState.SavingVersion) return value; <- this will be false - the foreground just assigned the _savingVersion
            return _savingValue; <- _savingValue has not been assigned by the foreground thread yet, this will return stale or uninitialized _savingValue

---
                   _savingValue = _value;

hamarb123 commented 4 months ago

I have looked at your API usage example - it appears to be either incomplete or buggy. Consider this interleaving of foreground and saving thread:

You are correct - I needed the read/write to version to be volatile also, and the write to version/saving-value to be swapped. I'll update it shortly. I'm pretty sure the volatile operations on value are still needed though.

Thanks for noticing that :)

hamarb123 commented 3 months ago

@jkotas was there anything else I need to fix or change?

jkotas commented 3 months ago

The area owners should take it from here.

@kouvel @VSadov @mangod9 Thoughts about this API proposal?

VSadov commented 3 months ago

I’d prefer read/write barriers.

Ordering accesses that are themselves nonatomic would have complicated semantics. Reasoning about barriers is easier and they can be used to construct “nonatomic” volatile accesses.

we probably want to match the ordering behavior of volatile accesses. It would be less confusing if, for example, a volatile write could be simulated by a barrier followed by an ordinary write.
We already have a read barrier. Just need to make it public.
The write barrier on arm64 would need to be emitted as full-fence (dmb). The dmb st is weaker than a volatile/release fence as it only waits for writes, while volatile write waits for reads too.

hamarb123 commented 3 months ago

The write barrier on arm64 would need to be emitted as full-fence (dmb). The dmb st is weaker than a volatile/release fence as it only waits for writes, while volatile write waits for reads too.

We shouldn't need to emit a dmb ish for the following code would be my only concern:

Volatile.WriteBarrier(); //write barrier would be used on stlur for the first alignof(T) write of the following line
location = value;

And I would expect similar combining for read barrier where possible too.

And on X86 it'd be even better assembly code, obviously.

Other than that, I'd be 100% happy with the barriers solution, I changed it to the ReadNonAtomic and WriteNonAtomic APIs when writing the proposal since that'd be easier to guarantee that they get combined when possible (since they're always combined in these APIs obviously).

kouvel commented 3 months ago

I also prefer a barrier approach. I assume the barriers would be no-op on x86/x64.

The write barrier on arm64 would need to be emitted as full-fence (dmb). The dmb st is weaker than a volatile/release fence as it only waits for writes, while volatile write waits for reads too.

That doesn't seem to be the case from what I can tell. The docs on "Barrier-ordered-before" say the following:

A read or a write RW1 is Barrier-ordered-before a read or a write RW2 from the same Observer if and only if RW1 appears in program order before RW2 and any of the following cases apply:

...

RW1 is a write W1 generated by an instruction with Release semantics and RW2 is a read R2 generated by an instruction with Acquire semantics.

RW1 is a read R1 and either:

R1 appears in program order before a DMB LD that appears in program order before RW2.

R1 is generated by an instruction with Acquire or AcquirePC semantics.

RW2 is a write W2 and either:

RW1 is a write W1 appearing in program order before a DMB ST that appears in program order before W2.

W2 is generated by an instruction with Release semantics.

Which seems to suggest that stlr would have the same ordering guarantees as dmb st (or dmb ishst, which I assume is what we'd want) followed by a write.

In any case, from a spec perspective we probably shouldn't specify any stronger ordering guarantees for the volatile barriers than are necessary.

kouvel commented 3 months ago

As for this part:

RW1 is a write W1 generated by an instruction with Release semantics and RW2 is a read R2 generated by an instruction with Acquire semantics.

Is that something specific to arm/arm64? I'm not sure at the moment if other archs offer the same guarantee with acquire/release loads/stores. My inclination is to not specify that, unless it's a typical thing to be expected.

Anyway it seems the barriers don't guarantee that behavior, only the acquire/release load/store instructions, so may not be relevant here.

VSadov commented 3 months ago

We shouldn't need to emit a dmb ish for the following code would be my only concern:

Volatile.WriteBarrier(); //write barrier would be used on stlur for the first alignof(T) write of the following line location = value; And I would expect similar combining for read barrier where possible too.

Optimization that fuses a barrier with the following/preceding ordinary access into an acquiring/releasing access does not work in general. That is why barriers are useful even when releasing/acquiring accesses are also available. Barriers order multiple accesses at once, while an acq/rel accesses order just one access against all preceding/following (in program order.)

Example:

R1, R2, ReadBarrier R3, R4

Here R4 is "ordered after" R2 and R1 Now if we apply "fusing" optimization, we have

R1, R2, VolatileRead(R3), R4

Now R4 is not ordered with R2 and R1. Thus the optimization is invalid.

R1, VolatileRead(R2), R3, R4

Still R4 is not ordered with R1.

Same goes for writes.

kouvel commented 3 months ago

R1, R2, VolatileRead(R3), R4 Now R4 is not ordered with R2 and R1. Thus the optimization is invalid.

R1, VolatileRead(R2), R3, R4 Still R4 is not ordered with R1. Same goes for writes.

Why not? I thought that ordering guarantee was exactly what the VolatileRead was supposed to offer.

VSadov commented 3 months ago

I also prefer a barrier approach. I assume the barriers would be no-op on x86/x64.

There is effect on x86/x64. The barriers must act as a compiler fence even if hardware is strong ordered to not require any instructions. While hardware would not reorder accesses, compiler could, so that must be prevented.

In the current implementation of the Interlocked.ReadMemoryBarrier(); the JIT needs to treat the barrier as present in the code until the very end where on x86/x64 it does not emit anything into the instruction stream. That is because a fence has optimization-suppressing effects.

VSadov commented 3 months ago

Which seems to suggest that stlr would have the same ordering guarantees as dmb st (or dmb ishst, which I assume is what we'd want) followed by a write.

The spec is terse and should be read very carefully to notice that it is not symmetrical. DMB ST has no effect on preceding reads. This is just a quirk of DMB ST, that it is a Write-Write barrier, not a ReadWrite-Write barrier.

Arm64 does not have a barrier instruction that is ReadWrite-Write. I think RISC-V has.

As for this part:

RW1 is a write W1 generated by an instruction with Release semantics and RW2 is a read R2 generated by an instruction with Acquire semantics.

Is that something specific to arm/arm64? I'm not sure at the moment if other archs offer the same guarantee with acquire/release loads/stores. My inclination is to not specify that, unless it's a typical thing to be expected.

I think it is another implementation quirk of arm64 ISA that can be ignored here.

What arm64 calls "Acquire" is slightly stronger than what dotnet memory model specifies for volatile reads. "AcquirePC" is a closer match. (this is why we use ldapr instead of ldar, when available)

kouvel commented 3 months ago

The barriers must act as a compiler fence

Agreed, I meant there wouldn't be any special instructions for these on x86/x64.

VSadov commented 3 months ago

To summarise:

Interlocked.ReadMemoryBarrier(); (existing name and class, but could change, this is not a public API right now) Provides a Read-ReadWrite barrier. All reads preceding the barrier will need to complete before any subsequent memory operation.
Interlocked.WriteMemoryBarrier(); (the name is just for consistency with above) Provides a ReadWrite-Write barrier. All memory operations preceding the barrier will need to complete before any subsequent write.

The barriers match semantics of acquire/release/volatile in terms of ordering reads or writes, correspondingly, relative to all operations.

The barriers are stronger than acquire/release/volatile in terms of ordering multiple reads or writes, correspondingly, not just a particular single read or write.

The actual implementation will depend on underlying platform.

On TSO architectures (x86/x64) it would have only compiler optimization-preventing effects.
On weaker architectures some ordering instructions will be emitted in addition - as appropriate and available in the given ISA.

kouvel commented 3 months ago

The spec is terse and should be read very carefully to notice that it is not symmetrical. DMB ST has no effect on preceding reads. This is just a quirk of DMB ST, that it is a Write-Write barrier, not a ReadWrite-Write barrier.

~~Unless I'm misreading it, stlr also appears to only guarantee Write-Write ordering.~~ [Edit] Looks like I misread it, stlr appears to offer ReadWrite-Write ordering.

Do ordinary memory operations on x86/x64 guarantee Read-ReadWrite and ReadWrite-Write ordering for acquire/release semantics? I believe stores are not reordered there, but I thought loads could be reordered after an ordinary store (which has release semantics).

Wonder if these should just be Read-Read and Write-Write barriers? I see the benefits of the stronger guarantees, but it seems the implementation would not be much more efficient.

kouvel commented 3 months ago

Is the main benefit of these APIs for unaligned or larger-than-atomic reads/writes?

VSadov commented 3 months ago

One example of existing code using a read barrier could be found here: https://github.com/dotnet/runtime/blob/4cf19ee05d3efc5372b8f308240688bacce09208/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastCache.cs#L181

In that scenario we do following reads read version, [ read source, read targetResult ], read version again

The order between read source and read targetResult does not matter, but they both must be read after we read the version for the first time and before we read it second time.

To ensure that the first read happens before source/targetResult tuple, we just need to Volatile.Read(version). To ensure that source and targetResult are both read before we read version for the second time, both must be volatile read, since an ordinary read can be arbitrary delayed even in the presence of acquiring reads.

So with only acquires we would have Volatile.Read(version), [ Volatile.Read(source), Volatile.Read(targetResult) ], read version again

That is 3 volatile reads.

If we can have a read barrier, we can do:

Volatile.Read(version), [ read source, read targetResult ], [ReadMemoryBarrier] read version again

That is one barrier replacing 2 volatile reads. Benchmarks pointed that even that is profitable.

However there is also a generic version of the same cache. https://github.com/dotnet/runtime/blob/4cf19ee05d3efc5372b8f308240688bacce09208/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/GenericCache.cs#L165

In the generic version the entry ( basically what goes into [ ... ] and needs to be "sandwiched" between two version reads) contains {_key, _value} which are generic, so reading the entry takes unknown number of reads. We definitely want a read barrier in the generic implementation.

Volatile.Read(version), [ read TKey _key, read TValue _value ], [ReadMemoryBarrier] read version again

kouvel commented 3 months ago

The spec is terse and should be read very carefully to notice that it is not symmetrical. DMB ST has no effect on preceding reads.

Ah ok, I see what you mean now, I misread it.

VSadov commented 3 months ago

Is the main benefit of these APIs for unaligned or larger-than-atomic reads/writes?

Or to order multiple accesses at once.

kouvel commented 3 months ago

So with only acquires we would have Volatile.Read(version), [ Volatile.Read(source), Volatile.Read(targetResult) ], read version again

That is 3 volatile reads.

If we can have a read barrier, we can do:

Volatile.Read(version), [ read source, read targetResult ], [ReadMemoryBarrier] read version again

Why not the following? Volatile.Read(version), [ read source, Volatile.Read(targetResult) ], read version again

VSadov commented 3 months ago

Do ordinary memory operations on x86/x64 guarantee Read-ReadWrite and ReadWrite-Write ordering for acquire/release semantics?

Yes. We do not emit anything special for volatile accesses on xarch. Everything is effectively in program order, except Reads after Writes. Reads can move ahead of Writes on x86/x64. Volatile/half-fences do not address such reordering. You'd need a full fence to prevent that on any platform, including xarch.

kouvel commented 3 months ago

Why not the following? Volatile.Read(version), [ read source, Volatile.Read(targetResult) ], read version again

I see you explained here:

To ensure that source and targetResult are both read before we read version for the second time, both must be volatile read, since an ordinary read can be arbitrary delayed even in the presence of acquiring reads.

But in that case, the following:

If we can have a read barrier, we can do: Volatile.Read(version), [ read source, read targetResult ], [ReadMemoryBarrier] read version again

I imagine would also not work because the ordinary read of source could similarly be ordered after the ReadMemoryBarrier, is that not the case?

kouvel commented 3 months ago

Everything is effectively in program order, except Reads after Writes. Reads can move ahead of Writes on x86/x64.

So if I understand correctly, that means Volatile.Write currently on x86/x64 is a Write-Write barrier since reads can be reordered after it, is that correct?

VSadov commented 3 months ago

Assuming that we agree in general on https://github.com/dotnet/runtime/issues/98837#issuecomment-1994903129 , I think we should update the proposal and the example. And probably rename the proposal - to be about barriers.

The GenericCache could work as another motivating example - for the read barrier.

Then we can move it to ready-for-review to settle on details like naming, etc.. I do not think we want a new proposal though. We should keep this one. There is a lot of history here.

VSadov commented 3 months ago

Everything is effectively in program order, except Reads after Writes. Reads can move ahead of Writes on x86/x64.

So if I understand correctly, that means Volatile.Write currently on x86/x64 is a Write-Write barrier since reads can be reordered after it, is that correct?

It is a ReadWrite-Write. For ordinary writes as well. (at hardware level - ignoring compiler optimization).

Reads would not be delayed after a following write on xarch, but doing reading before the preceding write has completed (prefetching) is allowed.

VSadov commented 3 months ago

If we can have a read barrier, we can do: Volatile.Read(version), [ read source, read targetResult ], [ReadMemoryBarrier] read version again

I imagine would also not work because the ordinary read of source could similarly be ordered after the ReadMemoryBarrier, is that not the case?

No. That is the key difference of a barrier. It does not say "this read must happen before any access that follows" - because it can't, as a barrier does not designate "this read". Unlike an acquiring read, the barrier must say "all the reads before me must happen before any access that follows"

VSadov commented 3 months ago

Wonder if these should just be Read-Read and Write-Write barriers? I see the benefits of the stronger guarantees, but it seems the implementation would not be much more efficient.

I do not think it is a good idea to differ from acq/rel semantics, which is Read-ReadWrite and ReadWrite-Write. The subtle differences would be a source of subtle bugs.

For example releasing a lock could be a Volatile.Write(ref locked, 0);.

Now if that is replaced with WriteBarrier(); locked = 0;

and if that WriteBarrier is a Write-Write (like dmb.st ), you would have a subtle issue because now the reads inside the lock, can "leak" to after the release sequence - delayed and happen after the lock is no longer yours.

alexrp commented 3 months ago

Sorry if I missed it somewhere in the discussion, but is there a compelling reason to use the "read/write barrier" terminology in the API naming vs "acquire/release barrier"? As I alluded to way above, I see several cons to this and no apparent pros.

VSadov commented 3 months ago

I think we carefully sidestepped the question of naming so far :-) Acquire/Release in the name could make sense.

Also if this goes onto Volatile, then, perhaps it would be obvious that Volatile.ReadBarrier actually matches the acquire/volatile semantics.
I guess it could depend on where the methods live.

kouvel commented 3 months ago

I do not think it is a good idea to differ from acq/rel semantics, which is Read-ReadWrite and ReadWrite-Write. The subtle differences would be a source of subtle bugs.

Yea I understand the benefits of the stronger ordering as I said before, it would just be a bit unfortunate to spec it that way. A lock may require stronger guarantees (or not, depending on its spec), but other cases may not require such strong ordering.

From a brief read, x86/x64's sfence also does not seem to be strongly ordered with loads, only stores.

VSadov commented 3 months ago

Actually I kind of like having these under Volatile

The full-fence Interlocked.MemoryBarrier could stay under Interlocked where everything is a full fence. But Read/Write barriers could be under Volatile where everything is either acquire or release.

kouvel commented 3 months ago

No. That is the key difference of a barrier. It does not say "this read must happen before any access that follows" - because it can't, as a barrier does not designate "this read". Unlike an acquiring read, the barrier must say "all the reads before me must happen before any access that follows"

This so happens to work on x86/arm. I wouldn't bet on this working everywhere though, and I don't think it should be specified to work as such.

hamarb123 commented 3 months ago

Just woke up, going to clarify a bunch of stuff:

We are naming them: Volatile.ReadBarrier and Volatile.WriteBarrier?

Today, we emit volatile read / write as simple mov on x86/x64: link, and we emit volatile read / write as ldur / stlur on arm64 link; therefore, when a preceding read or subsequent write is not performed we would have to fall back to something stronger, but when it is adjacent we should be combining them. Obviously this is in codegen, we still need to disable certain optimisations in the JIT.

We have the following barriers in .NET currently (from my original issue comment, tl;dr version of memory model):

Usually memory accesses can be re-ordered and combined by the C# compiler, JIT, and CPU
Volatile writes disallow the write to occur earlier than specified (i.e., prior memory accesses must complete before this executes)
Volatile reads disallow the read to occur later than specified (i.e., subsequent memory accesses can only begin after this completes)
Memory accesses to primitive types are always atomic if they're properly aligned (unless unaligned. is used), and either 1) the size of the type is at most the size of the pointer, or 2) a method on Volatile or Interlocked such as Volatile.Write(double&, double) has been called

We want to keep these volatile semantics for this API I would assume (and is what I proposed) - if we add options for more memory barriers, we should expose an API like C++ that allows it to be done in general, with probably the same names C++ uses.

Therefore, we need to guarantee that all memory accesses do not pass in the certain direction. But we can make the following optimisations:

Volatile.WriteBarrier();
location = value;

-> (x86)

mov ...

-> (arm64)

stlur ...

and

value = location
Volatile.ReadBarrier();

-> (x86)

mov ...

-> (arm64)

ldur ...

Notably, these optimisations are not always possible on either platform:

Volatile.WriteBarrier();
Volatile.ReadBarrier();
//the above is effectively a full barrier, which cannot just be omitted on either platform
//the cause here is the lack of adjacent memory access to combine with

Volatile.WriteBarrier(); //can't elide here on x86/etc assuming we don't know what the method does
NotInlinedMethod();

In terms of more generic fallbacks, like when we call a method instead of do a memory operation, we may have some half barrier instructions on some platforms (I haven't looked into it in general), but on arm64 it seems like there is no store barrier that does what we want, so we'd need to use dmb ish or similar instead.

If we're all happy with the above, please react with 👍 so I know, and I'll update the description to use the new proposed APIs, otherwise we can continue discussing and points of disagreement / misunderstandings 😀.

VSadov commented 3 months ago

This so happens to work on x86/arm. I wouldn't bet on this working everywhere though, and I don't think it should be specified to work as such.

Interlocked.ReadMemoryBarrier(); works correctly as an acquire fence on all supported platforms.

I think matching acquire/release is important. That is the way of "least surprise", which is nice in API design. We should document barriers as such and make it work to the spec everywhere.

Platforms that need fences, but do not have appropriate instructions will have to do full fences - as usual. This is what we do on ARM32. Interlocked.ReadMemoryBarrier(); emits dmb on ARM32. That is the closest thing that is available on ARM32.

kouvel commented 3 months ago

I think matching acquire/release is important. That is the way of "least surprise", which is nice in API design. We should document barriers as such and make it work to the spec everywhere.

Just to be clear, we're only discussing acquire/release semantics, there are variants on how they can be specified that apparently we're differing on.

hamarb123 commented 3 months ago

Just to be clear, we're only discussing acquire/release semantics, there are variants on how they can be specified that apparently we're differing on.

We should match it with the rest of .NET semantics imo. If we want a more general solution we could do something like Interlocked.Barrier and expose the following enum to chose from: https://en.cppreference.com/w/cpp/atomic/memory_order; otherwise we should just match the rest of .NET.

hamarb123 commented 3 months ago

I have updated the issue description - if we're all happy with it, someone can mark it as https://github.com/dotnet/runtime/labels/api-ready-for-review :) thanks

kouvel commented 3 months ago

We should match it with the rest of .NET semantics imo. If we want a more general solution we could do something like Interlocked.Barrier and expose the following enum to chose from: https://en.cppreference.com/w/cpp/atomic/memory_order; otherwise we should just match the rest of .NET.

The .NET memory model for volatile accesses is clear. These are new APIs and I'd like to consider options. This is also an interesting read: https://en.cppreference.com/w/cpp/atomic/memory_order#Release-Acquire_ordering. It only makes very specific guarantees, and unless I misread it, does not make the guarantees that are equivalent to those offered by the .NET memory model for volatile accesses. Interestingly, the summary descriptions of memory_order_acquire and memory_order_release (link) offer stronger guarantees than the detailed descriptions, of course I might have missed something.

kouvel commented 3 months ago

I can appreciate the value in keeping these barriers aligned with the .NET memory model for volatile accesses, I just wanted to point out some things. They would be more expensive (perhaps unnecessarily in some cases), but I can understand that it would fit better.

hamarb123 commented 3 months ago

We could expose one like this:

[Flags]
public enum MemoryOrderKind
{
    ReadEarlier = 1,
    WriteEarlier = 2,
    ReadLater = 4,
    WriteLater = 8,
}

void MemoryBarrier(MemoryOrderKind blockedOps);

There's not really a way to expose whether it should be per memory location, unless we also expose relevant read / write apis.

My current proposed APIs would just be MemoryBarrier(MemoryOrderKind.ReadEarlier | MemoryOrderKind.WriteEarlier) and MemoryBarrier(MemoryOrderKind.ReadLater | MemoryOrderKind.WriteLater) with these.

kouvel commented 3 months ago

public enum MemoryOrderKind

It would be nice but maybe not necessary yet? We could always add an overload.

VSadov commented 3 months ago

Regarding exposing all possible memory ordering semantics - I think "never say never", but the need for that would be hard to demonstrate, while API costs are considerable.

It would entail introducing an enum for various memory ordering kinds and adding APIs to barriers/reads/writes/interlocked that would take extra parameters for ordering. Considering that things like CAS may have different ordering on pass and fail branches, that is a lot of combinations. Most of them would make difference only in extreme niche scenarios on platforms that have direct support for those.

Adding two volatile barriers would not close the door to going all the way of std::memory_order, if that will ever be needed enough.

dotnet / runtime