[API Proposal]: Volatile barrier APIs

hamarb123 commented 5 months ago

Background and motivation

This API proposal exposes methods to perform non-atomic volatile memory operations. Our volatile semantics are explained in our memory model, but I will outline the tl;dr of the relevant parts here:

Usually memory accesses can be re-ordered and combined by the C# compiler, JIT, and CPU
Volatile writes disallow the write to occur earlier than specified (i.e., prior memory accesses must complete before this executes)
Volatile reads disallow the read to occur later than specified (i.e., subsequent memory accesses can only begin after this completes)
Memory accesses to primitive types are always atomic if they're properly aligned (unless unaligned. is used), and either 1) the size of the type is at most the size of the pointer, or 2) a method on Volatile or Interlocked such as Volatile.Write(double&, double) has been called

Currently, we expose APIs on Volatile. for the atomic memory accesses, but there is no way to perform the equivalent operations for non-atomic types. If we have Volatile barrier APIs, they will be easy to write, and it should make it clear which memory operations can move past the barrier in which ways.

API Proposal

namespace System.Threading;

public static class Volatile
{
    public static void ReadBarrier();
    public static void WriteBarrier();
}

=== Desired semantics:

Volatile.ReadBarrier()

Provides a Read-ReadWrite barrier. All reads preceding the barrier will need to complete before any subsequent memory operation.

Volatile.ReadBarrier() matches the semantics of Volatile.Read in terms of ordering reads, relative to all subsequent, in program order, operations.

The important difference from Volatile.Read(ref x) is that Volatile.ReadBarrier() has effect on all preceeding reads and not just a particular single read of x.

Volatile.WriteBarrier()

Provides a ReadWrite-Write barrier. All memory operations preceding the barrier will need to complete before any subsequent write.

Volatile.WriteBarrier() matches the semantics of Volatile.Write in terms of ordering writes, relative to all preceeding, in program order, operations.

The important difference from Volatile.Write(ref x) is that Volatile.WriteBarrier() has effect on all subsequent writes and not just a particular single write of x.

The actual implementation will depend on underlying platform.

On TSO architectures (x86/x64) it would have only compiler optimization-preventing effects.
On weaker architectures some ordering instructions will be emitted in addition - as appropriate and available in the given ISA.

API Usage

The runtime uses an internal API Interlocked.ReadMemoryBarrier() in 2 places (here and here) to batch multiple reads on both CoreCLR and NativeAOT, and is supported on all platforms. This ability is also useful to third-party developers (such as me, in my example below), but is currently not possible to write efficiently.

An example where non-atomic volatile operations would be useful is as follows. Consider a game which wants to save its state, ideally while continuing to run; these are the most obvious options:

Save on the main thread (can cause lag spikes if lots of data to save)
Create a copy of the data and then save on a separate thread (less time spent on main thread, but potentially still non-trivial, and a large number of allocations most likely)
Use locks around access to the memory (massive overhead)
Just save it on another thread without any sync (saves an inconsistent state)

But there is actually another option which utilises non-atomic volatile semantics:

Has low overhead, especially on x86 and x64
Saves a consistent state

//main threads sets IsSaving to true and increments SavingVersion before starting the saving thread, and to false once it's definitely done (e.g., on next frame)
//saving thread performs a full memory barrier before starting (when required, since starting a brand new thread every time isn't ideal), to ensure that _value is up-to-date
//memory synchronisation works because _value is always read before any saving information, and it's always written after the saving information
//if the version we read on the saving thread is not the current version, then our read from _value is correct, otherwise our read from _savingValue will be correct
//in the rare case that we loop to saving version == 0, then we can manually write all _savingVersion values to 0, skip to version == 1, and go from there (excluded from here though for clarity)

static class SavingState
{
    public static bool IsSaving { get; set; }
    public static nuint SavingVersion { get; set; }
}

struct SaveableHolder<T>
{
    nuint _savingVersion;
    T _value;
    T _savingValue;

    //Called only from main thread
    public T Value
    {
        get => _value;
        set
        {
            if (SavingState.IsSaving)
            {
                if (SavingVersion != SavingState.SavingVersion)
                {
                    _savingValue = _value;

                    //ensure the saving value is written before the saving version, so that we read it in the correct order
                    Volatile.Write(ref _savingVersion, SavingState.SavingVersion);
                }

                //_value can only become torn or incorrect after we have written our saving value and version
                Volatile.WriteBarrier();
                _value = value; //write must occur after prior writes
            }
            else
            {
                _value = value;
            }
        }
    }

    //Called only from saving thread while SavingState.IsSaving with a higher SavingState.SavingVersion than last time
    public T SavingValue
    {
        get
        {
            var value = Value; //read must occur before reads
            Volatile.ReadBarrier();

            //_savingVersion must be read after _value is, so if it's visibly changed/changing then we will either catch it here
            if (Volatile.Read(in _savingVersion) != SavingState.SavingVersion) return value;

            //volatile read on _savingVersion ensures we get an up-to-date _savingValue since it's written first
            return _savingValue;
        }
    }
}

Alternative Designs

We could expose read/write APIs instead:
```
namespace System.Threading;
```

public static class Volatile { public static T ReadNonAtomic(ref readonly T location) where T : allows ref struct { //ldarg.0 //volatile. //ldobj !!T }

public static void WriteNonAtomic<T>(ref T location, T value) where T : allows ref struct
{
    //ldarg.0
    //ldarg.1
    //volatile.
    //stobj !!T
}

}


We do have IL instructions, but they're currently broken and not exposed, see https://github.com/dotnet/runtime/issues/91530 - the proposal here was originally to expose APIs for `volatile. ldobj` and `volatile. stobj` + the unaligned variants (as seen aobe), and fix the instructions (or implement these without the instructions and have the instructions call these APIs - not much of a difference really). It was changed based on feedback to expose barrier APIs, which can provide equivalent semantics, but also allow additional scenarios. It is also clearer which memory operations can be reordered with the barrier APIs.

- We could expose APIs on Interlocked instead:
```csharp
public static class Interlocked
{
    // Existing API
    public static void MemoryBarrier();
    // New APIs
    public static void MemoryBarrierAcquire(); //volatile read semantics
    public static void MemoryBarrierRelease(); //volatile write semantics
}

We could expose the APIs on Unsafe instead:


namespace System.Runtime.CompilerServices;

public static class Unsafe { public static T ReadVolatile(ref readonly T location) where T : allows ref struct; public static void WriteVolatile(ref T location, T value) where T : allows ref struct; }

- We could add unaligned overloads:
```csharp
namespace System.Runtime.CompilerServices;

public static class Unsafe
{
    public static T ReadVolatileUnaligned<T>(ref readonly byte location) where T : allows ref struct;
    public static void WriteVolatileUnaligned<T>(ref byte location, T value) where T : allows ref struct;
}

We could also expose APIs for other operations which allow volatile. - initblk and cpblk, people may have use for these also:
```
namespace System.Runtime.CompilerServices;
```

public static class Unsafe { public static void CopyBlockVolatile(ref byte destination, ref readonly byte source, uint byteCount); public static void CopyBlockVolatileUnaligned(ref byte destination, ref readonly byte source, uint byteCount); public static void InitBlockVolatile(ref byte startAddress, byte value, uint byteCount); public static void InitBlockVolatileUnaligned(ref byte startAddress, byte value, uint byteCount); }


- We could expose APIs similar to what C++ has: https://en.cppreference.com/w/cpp/atomic/memory_order

### Open Questions

There is a question as to whether we should have `Read-ReadWrite`/`ReadWrite-Write` barriers or `Read-Read`/`Write-Write` barriers. I was initially in favour of the former (as it matches our current memory model), but now think the latter is probably better, since there are many scenarios (including in my example API usage, and the runtime's uses too) where the additional guarantees provided by the former are unnecessary, and thus may cause unnecessary overhead. We could also just provide both if we think they're both useful.

### Risks

No more than other volatile/interlocked APIs really, other than potential misunderstanding of what they do.

kouvel commented 4 months ago

I don't think it would be a big deal to add different ordering mechanics, it's not there already because there hasn't been sufficient interest. That said, we shouldn't confuse ease of development with design.

kouvel commented 4 months ago

And that would be a way forward for those cases where the .NET memory model is too strict.

hamarb123 commented 4 months ago

What about:

void ReadBarrier();
void WriteBarrier();
void PartialReadBarrier(); // or Full based on which we want to be the default / HalfReadBarrier / etc.
void PartialWriteBarrier(); // same as above

? I don't have an issue on providing the ones with less guarantees that only stop the same operation, not all operations (they would suffice in my code example for example).

What I have an issue with is not also providing one's that match our current memory model, whether we only provide these, or provide these in addition to others.

VSadov commented 4 months ago

@hamarb123 - Regarding the API proposal. A few suggestions:

I am not sure it is worth mentioning about possible optimizations. I'd leave that to implementors to figure whether and when which optimization is applicable. (I think "fusing" optimizations would not be valid in general, for example, but maybe there are cases where they would work).
By definition optimizations should not be observable via program effects, so we should just specify the desired semantics and JIT developers will do whatever is needed/possible.
One more use of the barriers is to order a batch of multiple reads or writes. That also includes ordering non-atomic reads/writes like with structs. It is worth mentioning. This is the reason why Interlocked.ReadMemoryBarrier() exists. The fact that we have it in two runtimes (CoreCLR and NativeAOT) and on all supported platforms is a good evidence that someone else might use the barriers for similar purposes.
Not sure we need to go in details about what we are no longer proposing - like volatile. ldobj and cpblk and initblk. Not in the beginning of the proposal anyways. This is a complicated topic - imagine the reader is losing focus a little with every word they read. We do not want to spend that attention on cpblk. :-) Maybe worth mentioning in alternative designs, but not in the actual proposal.

kouvel commented 4 months ago

Marked as ready-for-review, we can discuss further there.

hamarb123 commented 4 months ago

@VSadov I will update it when I'm able later today :) Thanks

kouvel commented 4 months ago

I do think the semantics of the operations need to be clearly defined. For instance, though it would be unfortunate, the difference indicated here would need to be clearly specified.

kouvel commented 4 months ago

It may be a matter of documentation, but we should have a clear understanding of what the aim is for now from the OP.

VSadov commented 4 months ago

For the desired semantics. I think we should start with:

Volatile.ReadBarrier() Provides a Read-ReadWrite barrier. All reads preceding the barrier will need to complete before any subsequent memory operation.

Volatile.ReadBarrier() matches the semantics of Volatile.Read in terms of ordering reads, relative to all subsequent, in program order, operations.

The important difference from Volatile.Read(ref x) is that Volatile.ReadBarrier() has effect on all preceeding reads and not just a particular single read of x.

Volatile.WriteBarrier() Provides a ReadWrite-Write barrier. All memory operations preceding the barrier will need to complete before any subsequent write.

Volatile.WriteBarrier() matches the semantics of Volatile.Write in terms of ordering writes, relative to all preceeding, in program order, operations.

The important difference from Volatile.Write(ref x) is that Volatile.WriteBarrier() has effect on all subsequent writes and not just a particular single write of x.

The actual implementation will depend on underlying platform.

On TSO architectures (x86/x64) it would have only compiler optimization-preventing effects.
On weaker architectures some ordering instructions will be emitted in addition - as appropriate and available in the given ISA.

hamarb123 commented 4 months ago

It may be a matter of documentation, but we should have a clear understanding of what the aim is for now from the OP.

My main aim is to enable the API usage I have as an example. It would also be nice if we could fix volatile. prefixes, but this can be done separately if desired.

Notably, I wouldn't actually need ReadWrite-Write or Read-ReadWrite barriers I think, I believe Write-Write and Read-Read should be enough for this.

hamarb123 commented 4 months ago

@VSadov I've updated it, can you double check that it's fine?

VSadov commented 4 months ago

@hamarb123 Looks very good! Thanks!

I will add the expected semantics to the proposed entry points. But we will see where we will land with those after reviews.

hamarb123 commented 4 months ago

Btw @VSadov, both my example and the runtime's usages only seem to need Read-Read/Write-Write, so I think it'd be good to get overloads for those if we also keep the Read-ReadWrite/ReadWrite-Write ones that match our current memory model, since they should have lower overhead and seem to be all that would be required most of the time. It's in the open questions section, but just thought I'd mention it so you're aware if you hadn't seen it.

kouvel commented 4 months ago

An alternative may be to overload Interlocked.MemoryBarrier with a MemoryConstraint enum or something like that, somewhat like in C++. The enum values could be something like ReadWrite (similar to full), Acquire, Release, and perhaps in the future if needed, Read and Write, which would be Read-Read and Write-Write respectively. Another enum value that may be useful for CAS operations could be None (similar to relaxed), if we were to expand those APIs with similar overloads. The APIs being on the Volatile class may imply that they have volatile semantics, which are very specific, and overloading them with options of different semantics may appear odd.

kouvel commented 4 months ago

For instance, there are already use cases in Lock that could benefit from acquire/release/relaxed semantics for CAS operations. Enabling more granular barriers has also been proposed before.

VSadov commented 4 months ago

Btw @VSadov, both my example and the runtime's usages only seem to need Read-Read/Write-Write, so I think it'd be good to get overloads for those if we also keep the Read-ReadWrite/ReadWrite-Write ones that match our current memory model, since they should have lower overhead and seem to be all that would be required most of the time. It's in the open questions section, but just thought I'd mention it so you're aware if you hadn't seen it.

Yes, I noticed. It is a common thing with volatile. While volatile orders relatively to all accesses, some cases, typically involving a chain of several volatile accesses when you have just writes or just reads in a row, could use a weaker fence. This is a case in both scenarios that you mention.

The main impact of a fence is forbidding optimizations at hardware level. They would not necessarily make the memory accesses to cost more. The level of cache that is being used is likely a lot more impactful than forcing a particular order of accesses. Intuitively, with everything else the same, a weaker barrier would be cheaper, but I am not sure by how much in reality - 10%? 1%?

Figuring the minimum strength required would be even more difficult and error-prone task than figuring when Volatile is needed. Honestly - sometimes people just put volatile on everything accessed from different threads - because it is not that expensive, compared to bugs that could happen once a week and a year after something shipped, just because there is a new chip on the market and it does something different from where the code was originally tested.

I think going all the way of std::memory_order is possible, but being possible might not be enough reason to do it.

VSadov commented 4 months ago

I think one datapoint that could be useful for the ReadWrite-Write vs. Write-Write discussion, could be the performance difference of dmb ish vs. dmb ishst on a few arm64 implementations - just to have a practical perspective on potential wins.

kouvel commented 4 months ago

The perf differences may be more apparent in memory-intensive situations where the extra ordering constraints would disable some optimizations and impose extra work on the processor / cache. It may be difficult to measure the difference in typical microbenchmarks, though perhaps it would become more apparent by somehow folding in some memory pressure and measuring maybe not just the operation in question but also latency of other memory operations.

jkotas commented 4 months ago

I think going all the way of std::memory_order is possible, but being possible might not be enough reason to do it.

I agree. I think we should start with simple barriers that are aligned with .NET memory model, and wait for evidence that we need more.

It is a non-goal for .NET programs to express everything that is possible. We strike a balance between simplicity and what may be possible in theory.

hamarb123 commented 4 months ago

I think going all the way of std::memory_order is possible, but being possible might not be enough reason to do it.

I agree. I think we should start with simple barriers that are aligned with .NET memory model, and wait for evidence that we need more.

It is worth mentioning, out of my use case, and the 2 uses of ReadBarrier in the runtime, both only require Read-Read/Write-Write barriers, whereas the ones matching our memory model would be Read-ReadWrite/ReadWrite-Write. This would result in throwing away some performance on arm for no reason other than lack of APIs (although I do not know precisely how much). I do think this is evidence that the full Read-ReadWrite/ReadWrite-Write barriers are probably less commonly needed than just the Read-Read/Write-Write barriers.

Edit: I'd still be happy if we just ended up with the ones that matched our memory model, but I'd obviously be more happy if we got the Read-Read/Write-Write ones, since they'd perform slightly better and be all I require.

bartonjs commented 1 month ago

Video

Looks good as proposed.

There was a very long discussion about memory models, what the barrier semantics are, and whether we want to do something more generalized in this release. In the end, we accepted the original proposal.

namespace System.Threading;

public static class Volatile
{
    public static void ReadBarrier();
    public static void WriteBarrier();
}

dotnet / runtime