`volatile.` prefix doesn't work on `ldobj` or `stobj`

hamarb123 commented 1 year ago

Description

I'm probably the first person insane enough to try this :)

In IL code you can use the volatile. prefix for any of ldind, stind, ldfld, stfld, ldobj, stobj, initblk, cpblk, ldsfld, and stsfld (III.2.6). Notably, nowhere does it place any form of restrictions on what the type can be for APIs like ldobj and stobj; in fact it even specifies behaviour for atomicity, which would not make sense if it wasn't supposed to work on non-builtin types They do not provide atomicity, other than that guaranteed by the specification of §I.12.6.6. (I.12.6.7).

I.12.6.6 Atomic reads and writes specifies A conforming CLI shall guarantee that read and write access to properly aligned memory locations no larger than the native word size (the size of type native int) is atomic (see §I.12.6.2) when all the write accesses to a location are the same size. which would mean that for custom types which are structs with multiple fields, it should just do the applicable volatile operation for each field (in an unspecified order).

The point of the above is that something like:

struct A
{
    int a, b;
}

should work with:

volatile. ldobj !!0
//and:
volatile. stobj !!0

when A or int or any other type is the generic parameter !!0; and for A it should generate something similar to what it would as if I'd done it directly with each field instead.

To be clear, I ran into this as a result of determining that this operation would actually be useful to something I was trying to implement lock-free.

Reproduction Steps

struct A
{
    public int a, b;
}

static T VolatileRead<T>(ref T reference)
{
    //ldarg.0
    //volatile.
    //ldobj !!0
    //ret
}

static void VolatileWrite<T>(ref T reference, T value)
{
    //ldarg.0
    //ldarg.1
    //volatile.
    //stobj !!0
    //ret
}

static A ReadVolatile1(ref A value) => VolatileRead(ref value);
static int ReadVolatile1(ref int value) => VolatileRead(ref value);
static A ReadVolatile2(ref A value)
{
    A result;
    result.a = Volatile.Read(ref value.a);
    result.b = Volatile.Read(ref value.b);
    return result;
}
static int ReadVolatile2(ref int value) => Volatile.Read(ref value);

static void WriteVolatile1(ref A value, A value2) => VolatileWrite(ref value, value2);
static void WriteVolatile1(ref int value, int value2) => VolatileWrite(ref value, value2);
static void WriteVolatile2(ref A value, A value2)
{
    Volatile.Write(ref value2.a, value.a);
    Volatile.Write(ref value2.b, value.b);
}
static void WriteVolatile2(ref int value, int value2) => Volatile.Write(ref value, value2);

Expected behavior

ReadVolatile1(int&) and ReadVolatile2(int&) have the same (or very similar) codegen, and similarly with A&. And same with Write.

Actual behavior

I get the following codegen from crossgen2

```Asm ; Assembly listing for method Program:ReadVolatile1(byref):A ; Emitting BLENDED_CODE for generic ARM64 CPU - Windows ; ReadyToRun compilation ; optimized code ; fp based frame ; fully interruptible ; No PGO data ; Final local variable assignments ; ; V00 arg0 [V00,T00] ( 3, 3 ) byref -> x0 single-def ;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace" ; ; Lcl frame size = 0 G_M15271_IG01: ;; offset=0000H A9BF7BFD stp fp, lr, [sp,#-0x10]! 910003FD mov fp, sp ;; size=8 bbWeight=1 PerfScore 1.50 G_M15271_IG02: ;; offset=0008H 9000000B adrp x11, [HIGH RELOC #0x40000000004292b8] // function address 9100016B add x11, x11, [LOW RELOC #0x40000000004292b8] F9400161 ldr x1, [x11] ;; size=12 bbWeight=1 PerfScore 4.00 G_M15271_IG03: ;; offset=0014H A8C17BFD ldp fp, lr, [sp],#0x10 D61F0020 br x1 ;; size=8 bbWeight=1 PerfScore 2.00 ; Total bytes of code 28, prolog size 8, PerfScore 10.30, instruction count 7, allocated bytes for code 28 (MethodHash=d3c5c458) for method Program:ReadVolatile1(byref):A ; ============================================================ ; Assembly listing for method Program:ReadVolatile2(byref):A ; Emitting BLENDED_CODE for generic ARM64 CPU - Windows ; ReadyToRun compilation ; optimized code ; fp based frame ; partially interruptible ; No PGO data ; 0 inlinees with PGO data; 2 single block inlinees; 0 inlinees without PGO data ; Final local variable assignments ; ; V00 arg0 [V00,T00] ( 4, 4 ) byref -> x0 single-def ; V01 loc0 [V01 ] ( 3, 3 ) struct ( 8) [fp+18H] do-not-enreg[S] ld-addr-op ;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace" ;* V03 tmp1 [V03 ] ( 0, 0 ) byref -> zero-ref "Inlining Arg" ;* V04 tmp2 [V04 ] ( 0, 0 ) byref -> zero-ref "Inlining Arg" ; V05 tmp3 [V05,T01] ( 2, 2 ) int -> [fp+18H] do-not-enreg[] V01.a(offs=0x00) P-DEP "field V01.a (fldOffset=0x0)" ; V06 tmp4 [V06,T02] ( 2, 2 ) int -> [fp+1CH] do-not-enreg[] V01.b(offs=0x04) P-DEP "field V01.b (fldOffset=0x4)" ; ; Lcl frame size = 16 G_M36164_IG01: ;; offset=0000H A9BE7BFD stp fp, lr, [sp,#-0x20]! 910003FD mov fp, sp ;; size=8 bbWeight=1 PerfScore 1.50 G_M36164_IG02: ;; offset=0008H 88DFFC01 ldar w1, [x0] B9001BA1 str w1, [fp,#0x18] // [V05 tmp3] B9400400 ldr w0, [x0,#0x04] D50339BF dmb ishld B9001FA0 str w0, [fp,#0x1C] // [V06 tmp4] F9400FA0 ldr x0, [fp,#0x18] // [V01 loc0] ;; size=24 bbWeight=1 PerfScore 20.00 G_M36164_IG03: ;; offset=0020H A8C27BFD ldp fp, lr, [sp],#0x20 D65F03C0 ret lr ;; size=8 bbWeight=1 PerfScore 2.00 ; Total bytes of code 40, prolog size 8, PerfScore 27.50, instruction count 10, allocated bytes for code 40 (MethodHash=85ab72bb) for method Program:ReadVolatile2(byref):A ; ============================================================ ; Assembly listing for method Program:ReadVolatile1(byref):int ; Emitting BLENDED_CODE for generic ARM64 CPU - Windows ; ReadyToRun compilation ; optimized code ; fp based frame ; fully interruptible ; No PGO data ; Final local variable assignments ; ; V00 arg0 [V00,T00] ( 3, 3 ) byref -> x0 single-def ;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace" ; ; Lcl frame size = 0 G_M27221_IG01: ;; offset=0000H A9BF7BFD stp fp, lr, [sp,#-0x10]! 910003FD mov fp, sp ;; size=8 bbWeight=1 PerfScore 1.50 G_M27221_IG02: ;; offset=0008H 9000000B adrp x11, [HIGH RELOC #0x40000000004292e8] // function address 9100016B add x11, x11, [LOW RELOC #0x40000000004292e8] F9400161 ldr x1, [x11] ;; size=12 bbWeight=1 PerfScore 4.00 G_M27221_IG03: ;; offset=0014H A8C17BFD ldp fp, lr, [sp],#0x10 D61F0020 br x1 ;; size=8 bbWeight=1 PerfScore 2.00 ; Total bytes of code 28, prolog size 8, PerfScore 10.30, instruction count 7, allocated bytes for code 28 (MethodHash=dc4895aa) for method Program:ReadVolatile1(byref):int ; ============================================================ ; Assembly listing for method Program:ReadVolatile2(byref):int ; Emitting BLENDED_CODE for generic ARM64 CPU - Windows ; ReadyToRun compilation ; optimized code ; fp based frame ; partially interruptible ; No PGO data ; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data ; Final local variable assignments ; ; V00 arg0 [V00,T00] ( 3, 3 ) byref -> x0 single-def ;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace" ; ; Lcl frame size = 0 G_M42934_IG01: ;; offset=0000H A9BF7BFD stp fp, lr, [sp,#-0x10]! 910003FD mov fp, sp ;; size=8 bbWeight=1 PerfScore 1.50 G_M42934_IG02: ;; offset=0008H 88DFFC00 ldar w0, [x0] ;; size=4 bbWeight=1 PerfScore 3.00 G_M42934_IG03: ;; offset=000CH A8C17BFD ldp fp, lr, [sp],#0x10 D65F03C0 ret lr ;; size=8 bbWeight=1 PerfScore 2.00 ; Total bytes of code 20, prolog size 8, PerfScore 8.50, instruction count 5, allocated bytes for code 20 (MethodHash=9e525849) for method Program:ReadVolatile2(byref):int ; ============================================================ ; Assembly listing for method Program:WriteVolatile1(byref,A) ; Emitting BLENDED_CODE for generic ARM64 CPU - Windows ; ReadyToRun compilation ; optimized code ; fp based frame ; fully interruptible ; No PGO data ; Final local variable assignments ; ; V00 arg0 [V00,T00] ( 3, 3 ) byref -> x0 single-def ; V01 arg1 [V01,T01] ( 3, 3 ) struct ( 8) [fp+18H] do-not-enreg[S] single-def ;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace" ; ; Lcl frame size = 16 G_M5886_IG01: ;; offset=0000H A9BE7BFD stp fp, lr, [sp,#-0x20]! 910003FD mov fp, sp F9000FA1 str x1, [fp,#0x18] ;; size=12 bbWeight=1 PerfScore 2.50 G_M5886_IG02: ;; offset=000CH F9400FA1 ldr x1, [fp,#0x18] 9000000B adrp x11, [HIGH RELOC #0x4000000000429358] // function address 9100016B add x11, x11, [LOW RELOC #0x4000000000429358] F9400162 ldr x2, [x11] ;; size=16 bbWeight=1 PerfScore 6.00 G_M5886_IG03: ;; offset=001CH A8C27BFD ldp fp, lr, [sp],#0x20 D61F0040 br x2 ;; size=8 bbWeight=1 PerfScore 2.00 ; Total bytes of code 36, prolog size 12, PerfScore 14.10, instruction count 9, allocated bytes for code 36 (MethodHash=7866e901) for method Program:WriteVolatile1(byref,A) ; ============================================================ ; Assembly listing for method Program:WriteVolatile2(byref,A) ; Emitting BLENDED_CODE for generic ARM64 CPU - Windows ; ReadyToRun compilation ; optimized code ; fp based frame ; partially interruptible ; No PGO data ; 0 inlinees with PGO data; 2 single block inlinees; 0 inlinees without PGO data ; Final local variable assignments ; ; V00 arg0 [V00,T00] ( 4, 4 ) byref -> x0 single-def ; V01 arg1 [V01 ] ( 4, 4 ) struct ( 8) [fp+18H] do-not-enreg[XS] addr-exposed ld-addr-op single-def ;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace" ; V03 tmp1 [V03,T01] ( 2, 4 ) int -> x1 "Inlining Arg" ; V04 tmp2 [V04,T02] ( 2, 4 ) int -> x0 "Inlining Arg" ; ; Lcl frame size = 16 G_M11869_IG01: ;; offset=0000H A9BE7BFD stp fp, lr, [sp,#-0x20]! 910003FD mov fp, sp F9000FA1 str x1, [fp,#0x18] ;; size=12 bbWeight=1 PerfScore 2.50 G_M11869_IG02: ;; offset=000CH B9400001 ldr w1, [x0] D5033BBF dmb ish B9001BA1 str w1, [fp,#0x18] B9400400 ldr w0, [x0,#0x04] D5033BBF dmb ish B9001FA0 str w0, [fp,#0x1C] ;; size=24 bbWeight=1 PerfScore 28.00 G_M11869_IG03: ;; offset=0024H A8C27BFD ldp fp, lr, [sp],#0x20 D65F03C0 ret lr ;; size=8 bbWeight=1 PerfScore 2.00 ; Total bytes of code 44, prolog size 8, PerfScore 36.90, instruction count 11, allocated bytes for code 44 (MethodHash=597cd1a2) for method Program:WriteVolatile2(byref,A) ; ============================================================ ; Assembly listing for method Program:WriteVolatile1(byref,int) ; Emitting BLENDED_CODE for generic ARM64 CPU - Windows ; ReadyToRun compilation ; optimized code ; fp based frame ; fully interruptible ; No PGO data ; Final local variable assignments ; ; V00 arg0 [V00,T00] ( 3, 3 ) byref -> x0 single-def ; V01 arg1 [V01,T01] ( 3, 3 ) int -> x1 single-def ;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace" ; ; Lcl frame size = 0 G_M16972_IG01: ;; offset=0000H A9BF7BFD stp fp, lr, [sp,#-0x10]! 910003FD mov fp, sp ;; size=8 bbWeight=1 PerfScore 1.50 G_M16972_IG02: ;; offset=0008H 9000000B adrp x11, [HIGH RELOC #0x4000000000429388] // function address 9100016B add x11, x11, [LOW RELOC #0x4000000000429388] F9400162 ldr x2, [x11] ;; size=12 bbWeight=1 PerfScore 4.00 G_M16972_IG03: ;; offset=0014H A8C17BFD ldp fp, lr, [sp],#0x10 D61F0040 br x2 ;; size=8 bbWeight=1 PerfScore 2.00 ; Total bytes of code 28, prolog size 8, PerfScore 10.30, instruction count 7, allocated bytes for code 28 (MethodHash=2db1bdb3) for method Program:WriteVolatile1(byref,int) ; ============================================================ ; Assembly listing for method Program:WriteVolatile2(byref,int) ; Emitting BLENDED_CODE for generic ARM64 CPU - Windows ; ReadyToRun compilation ; optimized code ; fp based frame ; partially interruptible ; No PGO data ; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data ; Final local variable assignments ; ; V00 arg0 [V00,T00] ( 3, 3 ) byref -> x0 single-def ; V01 arg1 [V01,T01] ( 3, 3 ) int -> x1 single-def ;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace" ; ; Lcl frame size = 0 G_M63279_IG01: ;; offset=0000H A9BF7BFD stp fp, lr, [sp,#-0x10]! 910003FD mov fp, sp ;; size=8 bbWeight=1 PerfScore 1.50 G_M63279_IG02: ;; offset=0008H 889FFC01 stlr w1, [x0] ;; size=4 bbWeight=1 PerfScore 1.00 G_M63279_IG03: ;; offset=000CH A8C17BFD ldp fp, lr, [sp],#0x10 D65F03C0 ret lr ;; size=8 bbWeight=1 PerfScore 2.00 ; Total bytes of code 20, prolog size 8, PerfScore 6.50, instruction count 5, allocated bytes for code 20 (MethodHash=abe008d0) for method Program:WriteVolatile2(byref,int) ; ============================================================ ```

as can be seen above, the 1 variants don't appear to be volatile at all. But they should all have extremely similar codegen to the matching 2 variants.

Regression?

Probably not :)

Known Workarounds

For specific cases, it's obviously easy enough to work around if you have access to all the fields, or if it's simply a class type or unmanaged, but it's basically unsolvable in the general case since gc-refs could be anywhere in it, so you can't just loop over the memory.

Configuration

I got the codegen using disasmo. It's for .NET 7 on arm64 using crossgen2. I haven't got a locally built .NET 8, so I haven't checked it, but I doubt it's much better :) (it would be good if someone else was able to test this in .NET 8). I was planning to use this on a .NET 6 project I have, so much for that idea lol. It probably won't be fixed until at least 9 at this point would be my guess (it would be great if we could fix it on 6, 7, and 8 too though 😄).

Other information

I haven't tested all of the other instructions, but it could easily be an issue specific to ldobj and stobj and the rest could be fine, this should be checked.

ghost commented 1 year ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

Issue Details

### Description I'm probably the first person insane enough to try this :) In IL code you can use the `volatile.` prefix for any of `ldind`, `stind`, `ldfld`, `stfld`, `ldobj`, `stobj`, `initblk`, `cpblk`, `ldsfld`, and `stsfld` (III.2.6). Notably, nowhere does it place any form of restrictions on what the type can be for APIs like `ldobj` and `stobj`; in fact it even specifies behaviour for atomicity, which would not make sense if it wasn't supposed to work on non-builtin types `They do not provide atomicity, other than that guaranteed by the specification of §I.12.6.6.` (I.12.6.7). `I.12.6.6 Atomic reads and writes` specifies `A conforming CLI shall guarantee that read and write access to properly aligned memory locations no larger than the native word size (the size of type native int) is atomic (see §I.12.6.2) when all the write accesses to a location are the same size.` which would mean that for custom types which are structs with multiple fields, it should just do the applicable volatile operation for each field (in an unspecified order). The point of the above is that something like: ```csharp struct A { int a, b; } ``` should work with: ```msil volatile. ldobj !!0 //and: volatile. stobj !!0 ``` when `A` is the generic parameter `!!0`; and it should generate something similar to what it would as if I'd done it directly with each field instead. ### Reproduction Steps ```csharp struct A { public int a, b; } static T VolatileRead(ref T reference) { //ldarg.0 //volatile. //ldobj !!0 //ret } static void VolatileWrite(ref T reference, T value) { //ldarg.0 //ldarg.1 //volatile. //stobj !!0 //ret } static A ReadVolatile1(ref A value) => VolatileRead(ref value); static int ReadVolatile1(ref int value) => VolatileRead(ref value); static A ReadVolatile2(ref A value) { A result; result.a = Volatile.Read(ref value.a); result.b = Volatile.Read(ref value.b); return result; } static int ReadVolatile2(ref int value) => Volatile.Read(ref value); static void WriteVolatile1(ref A value, A value2) => VolatileWrite(ref value, value2); static void WriteVolatile1(ref int value, int value2) => VolatileWrite(ref value, value2); static void WriteVolatile2(ref A value, A value2) { Volatile.Write(ref value2.a, value.a); Volatile.Write(ref value2.b, value.b); } static void WriteVolatile2(ref int value, int value2) => Volatile.Write(ref value, value2); ``` ### Expected behavior `ReadVolatile1(int&)` and `ReadVolatile2(int&)` have the same (or very similar) codegen, and similarly with `A&`. And same with Write. ### Actual behavior I get the following codegen from crossgen2: ```Asm ; Assembly listing for method Program:ReadVolatile1(byref):A ; Emitting BLENDED_CODE for generic ARM64 CPU - Windows ; ReadyToRun compilation ; optimized code ; fp based frame ; fully interruptible ; No PGO data ; Final local variable assignments ; ; V00 arg0 [V00,T00] ( 3, 3 ) byref -> x0 single-def ;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace" ; ; Lcl frame size = 0 G_M15271_IG01: ;; offset=0000H A9BF7BFD stp fp, lr, [sp,#-0x10]! 910003FD mov fp, sp ;; size=8 bbWeight=1 PerfScore 1.50 G_M15271_IG02: ;; offset=0008H 9000000B adrp x11, [HIGH RELOC #0x40000000004292b8] // function address 9100016B add x11, x11, [LOW RELOC #0x40000000004292b8] F9400161 ldr x1, [x11] ;; size=12 bbWeight=1 PerfScore 4.00 G_M15271_IG03: ;; offset=0014H A8C17BFD ldp fp, lr, [sp],#0x10 D61F0020 br x1 ;; size=8 bbWeight=1 PerfScore 2.00 ; Total bytes of code 28, prolog size 8, PerfScore 10.30, instruction count 7, allocated bytes for code 28 (MethodHash=d3c5c458) for method Program:ReadVolatile1(byref):A ; ============================================================ ; Assembly listing for method Program:ReadVolatile2(byref):A ; Emitting BLENDED_CODE for generic ARM64 CPU - Windows ; ReadyToRun compilation ; optimized code ; fp based frame ; partially interruptible ; No PGO data ; 0 inlinees with PGO data; 2 single block inlinees; 0 inlinees without PGO data ; Final local variable assignments ; ; V00 arg0 [V00,T00] ( 4, 4 ) byref -> x0 single-def ; V01 loc0 [V01 ] ( 3, 3 ) struct ( 8) [fp+18H] do-not-enreg[S] ld-addr-op ;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace" ;* V03 tmp1 [V03 ] ( 0, 0 ) byref -> zero-ref "Inlining Arg" ;* V04 tmp2 [V04 ] ( 0, 0 ) byref -> zero-ref "Inlining Arg" ; V05 tmp3 [V05,T01] ( 2, 2 ) int -> [fp+18H] do-not-enreg[] V01.a(offs=0x00) P-DEP "field V01.a (fldOffset=0x0)" ; V06 tmp4 [V06,T02] ( 2, 2 ) int -> [fp+1CH] do-not-enreg[] V01.b(offs=0x04) P-DEP "field V01.b (fldOffset=0x4)" ; ; Lcl frame size = 16 G_M36164_IG01: ;; offset=0000H A9BE7BFD stp fp, lr, [sp,#-0x20]! 910003FD mov fp, sp ;; size=8 bbWeight=1 PerfScore 1.50 G_M36164_IG02: ;; offset=0008H 88DFFC01 ldar w1, [x0] B9001BA1 str w1, [fp,#0x18] // [V05 tmp3] B9400400 ldr w0, [x0,#0x04] D50339BF dmb ishld B9001FA0 str w0, [fp,#0x1C] // [V06 tmp4] F9400FA0 ldr x0, [fp,#0x18] // [V01 loc0] ;; size=24 bbWeight=1 PerfScore 20.00 G_M36164_IG03: ;; offset=0020H A8C27BFD ldp fp, lr, [sp],#0x20 D65F03C0 ret lr ;; size=8 bbWeight=1 PerfScore 2.00 ; Total bytes of code 40, prolog size 8, PerfScore 27.50, instruction count 10, allocated bytes for code 40 (MethodHash=85ab72bb) for method Program:ReadVolatile2(byref):A ; ============================================================ ; Assembly listing for method Program:ReadVolatile1(byref):int ; Emitting BLENDED_CODE for generic ARM64 CPU - Windows ; ReadyToRun compilation ; optimized code ; fp based frame ; fully interruptible ; No PGO data ; Final local variable assignments ; ; V00 arg0 [V00,T00] ( 3, 3 ) byref -> x0 single-def ;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace" ; ; Lcl frame size = 0 G_M27221_IG01: ;; offset=0000H A9BF7BFD stp fp, lr, [sp,#-0x10]! 910003FD mov fp, sp ;; size=8 bbWeight=1 PerfScore 1.50 G_M27221_IG02: ;; offset=0008H 9000000B adrp x11, [HIGH RELOC #0x40000000004292e8] // function address 9100016B add x11, x11, [LOW RELOC #0x40000000004292e8] F9400161 ldr x1, [x11] ;; size=12 bbWeight=1 PerfScore 4.00 G_M27221_IG03: ;; offset=0014H A8C17BFD ldp fp, lr, [sp],#0x10 D61F0020 br x1 ;; size=8 bbWeight=1 PerfScore 2.00 ; Total bytes of code 28, prolog size 8, PerfScore 10.30, instruction count 7, allocated bytes for code 28 (MethodHash=dc4895aa) for method Program:ReadVolatile1(byref):int ; ============================================================ ; Assembly listing for method Program:ReadVolatile2(byref):int ; Emitting BLENDED_CODE for generic ARM64 CPU - Windows ; ReadyToRun compilation ; optimized code ; fp based frame ; partially interruptible ; No PGO data ; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data ; Final local variable assignments ; ; V00 arg0 [V00,T00] ( 3, 3 ) byref -> x0 single-def ;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace" ; ; Lcl frame size = 0 G_M42934_IG01: ;; offset=0000H A9BF7BFD stp fp, lr, [sp,#-0x10]! 910003FD mov fp, sp ;; size=8 bbWeight=1 PerfScore 1.50 G_M42934_IG02: ;; offset=0008H 88DFFC00 ldar w0, [x0] ;; size=4 bbWeight=1 PerfScore 3.00 G_M42934_IG03: ;; offset=000CH A8C17BFD ldp fp, lr, [sp],#0x10 D65F03C0 ret lr ;; size=8 bbWeight=1 PerfScore 2.00 ; Total bytes of code 20, prolog size 8, PerfScore 8.50, instruction count 5, allocated bytes for code 20 (MethodHash=9e525849) for method Program:ReadVolatile2(byref):int ; ============================================================ ; Assembly listing for method Program:WriteVolatile1(byref,A) ; Emitting BLENDED_CODE for generic ARM64 CPU - Windows ; ReadyToRun compilation ; optimized code ; fp based frame ; fully interruptible ; No PGO data ; Final local variable assignments ; ; V00 arg0 [V00,T00] ( 3, 3 ) byref -> x0 single-def ; V01 arg1 [V01,T01] ( 3, 3 ) struct ( 8) [fp+18H] do-not-enreg[S] single-def ;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace" ; ; Lcl frame size = 16 G_M5886_IG01: ;; offset=0000H A9BE7BFD stp fp, lr, [sp,#-0x20]! 910003FD mov fp, sp F9000FA1 str x1, [fp,#0x18] ;; size=12 bbWeight=1 PerfScore 2.50 G_M5886_IG02: ;; offset=000CH F9400FA1 ldr x1, [fp,#0x18] 9000000B adrp x11, [HIGH RELOC #0x4000000000429358] // function address 9100016B add x11, x11, [LOW RELOC #0x4000000000429358] F9400162 ldr x2, [x11] ;; size=16 bbWeight=1 PerfScore 6.00 G_M5886_IG03: ;; offset=001CH A8C27BFD ldp fp, lr, [sp],#0x20 D61F0040 br x2 ;; size=8 bbWeight=1 PerfScore 2.00 ; Total bytes of code 36, prolog size 12, PerfScore 14.10, instruction count 9, allocated bytes for code 36 (MethodHash=7866e901) for method Program:WriteVolatile1(byref,A) ; ============================================================ ; Assembly listing for method Program:WriteVolatile2(byref,A) ; Emitting BLENDED_CODE for generic ARM64 CPU - Windows ; ReadyToRun compilation ; optimized code ; fp based frame ; partially interruptible ; No PGO data ; 0 inlinees with PGO data; 2 single block inlinees; 0 inlinees without PGO data ; Final local variable assignments ; ; V00 arg0 [V00,T00] ( 4, 4 ) byref -> x0 single-def ; V01 arg1 [V01 ] ( 4, 4 ) struct ( 8) [fp+18H] do-not-enreg[XS] addr-exposed ld-addr-op single-def ;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace" ; V03 tmp1 [V03,T01] ( 2, 4 ) int -> x1 "Inlining Arg" ; V04 tmp2 [V04,T02] ( 2, 4 ) int -> x0 "Inlining Arg" ; ; Lcl frame size = 16 G_M11869_IG01: ;; offset=0000H A9BE7BFD stp fp, lr, [sp,#-0x20]! 910003FD mov fp, sp F9000FA1 str x1, [fp,#0x18] ;; size=12 bbWeight=1 PerfScore 2.50 G_M11869_IG02: ;; offset=000CH B9400001 ldr w1, [x0] D5033BBF dmb ish B9001BA1 str w1, [fp,#0x18] B9400400 ldr w0, [x0,#0x04] D5033BBF dmb ish B9001FA0 str w0, [fp,#0x1C] ;; size=24 bbWeight=1 PerfScore 28.00 G_M11869_IG03: ;; offset=0024H A8C27BFD ldp fp, lr, [sp],#0x20 D65F03C0 ret lr ;; size=8 bbWeight=1 PerfScore 2.00 ; Total bytes of code 44, prolog size 8, PerfScore 36.90, instruction count 11, allocated bytes for code 44 (MethodHash=597cd1a2) for method Program:WriteVolatile2(byref,A) ; ============================================================ ; Assembly listing for method Program:WriteVolatile1(byref,int) ; Emitting BLENDED_CODE for generic ARM64 CPU - Windows ; ReadyToRun compilation ; optimized code ; fp based frame ; fully interruptible ; No PGO data ; Final local variable assignments ; ; V00 arg0 [V00,T00] ( 3, 3 ) byref -> x0 single-def ; V01 arg1 [V01,T01] ( 3, 3 ) int -> x1 single-def ;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace" ; ; Lcl frame size = 0 G_M16972_IG01: ;; offset=0000H A9BF7BFD stp fp, lr, [sp,#-0x10]! 910003FD mov fp, sp ;; size=8 bbWeight=1 PerfScore 1.50 G_M16972_IG02: ;; offset=0008H 9000000B adrp x11, [HIGH RELOC #0x4000000000429388] // function address 9100016B add x11, x11, [LOW RELOC #0x4000000000429388] F9400162 ldr x2, [x11] ;; size=12 bbWeight=1 PerfScore 4.00 G_M16972_IG03: ;; offset=0014H A8C17BFD ldp fp, lr, [sp],#0x10 D61F0040 br x2 ;; size=8 bbWeight=1 PerfScore 2.00 ; Total bytes of code 28, prolog size 8, PerfScore 10.30, instruction count 7, allocated bytes for code 28 (MethodHash=2db1bdb3) for method Program:WriteVolatile1(byref,int) ; ============================================================ ; Assembly listing for method Program:WriteVolatile2(byref,int) ; Emitting BLENDED_CODE for generic ARM64 CPU - Windows ; ReadyToRun compilation ; optimized code ; fp based frame ; partially interruptible ; No PGO data ; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data ; Final local variable assignments ; ; V00 arg0 [V00,T00] ( 3, 3 ) byref -> x0 single-def ; V01 arg1 [V01,T01] ( 3, 3 ) int -> x1 single-def ;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace" ; ; Lcl frame size = 0 G_M63279_IG01: ;; offset=0000H A9BF7BFD stp fp, lr, [sp,#-0x10]! 910003FD mov fp, sp ;; size=8 bbWeight=1 PerfScore 1.50 G_M63279_IG02: ;; offset=0008H 889FFC01 stlr w1, [x0] ;; size=4 bbWeight=1 PerfScore 1.00 G_M63279_IG03: ;; offset=000CH A8C17BFD ldp fp, lr, [sp],#0x10 D65F03C0 ret lr ;; size=8 bbWeight=1 PerfScore 2.00 ; Total bytes of code 20, prolog size 8, PerfScore 6.50, instruction count 5, allocated bytes for code 20 (MethodHash=abe008d0) for method Program:WriteVolatile2(byref,int) ; ============================================================ ``` as can be seen above, the `1` variants don't appear to be volatile at all. But they should all have extremely similar codegen to the matching `2` variants. ### Regression? Probably not :) ### Known Workarounds For specific cases, it's obviously easy enough to work around if you have access to all the fields, or if it's simply a class type or `unmanaged`, but it's basically unsolvable in the general case since gc-refs could be anywhere in it, so you can't just loop over the memory. ### Configuration I got the codegen using disasmo. It's for .NET 7 on arm64 using crossgen2. I haven't got a locally built .NET 8, so I haven't checked it, but I doubt it's much better :) (it would be good if someone else was able to test this in .NET 8). I was planning to use this on a .NET 6 project I have, so much for that idea lol. It probably won't be fixed until at least 9 at this point would be my guess. ### Other information I haven't tested all of the other instructions, but it could easily be an issue specific to `ldobj` and `stobj` and the rest could be fine, this should be checked.

Author:	hamarb123
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

EgorBo commented 1 year ago

It's probably doable with some efforts accross all runtimes, but since it's not exposed in C# (or any other .NET language) it's probably better to leave it as is and refer to this ECMA's update: https://github.com/dotnet/runtime/blob/main/docs/design/specs/Ecma-335-Augments.md#atomic-reads-and-writes

hamarb123 commented 1 year ago

probably better to leave it as is

Not sure what you're referring to here - I don't expect atomicity guarantees (except for the builtin primitive types & pointers obviously), the issue is that the volatile guarantees are not met.

EgorBo commented 1 year ago

probably better to leave it as is

Not sure what you're referring to here - I don't expect atomicity guarantees (except for the builtin primitive types & pointers obviously), the issue is that the volatile guarantees are not met.

~~Does it make any sense without atomicity guarantees?~~

Also, I don't see from your codegen what's wrong - your *1 variants have calls which may or may not provide volatile guarantees - how did you know?

hamarb123 commented 1 year ago

Does it make any sense without atomicity guarantees?

Yes, I had the example on the discord earlier (am happy to post it here if you would like).

Also, I don't see from your codegen what's wrong - your *1 variants have calls which may or may not provide volatile guarantees - how did you know?

I must have missed that, do you know what function it's calling (all mine here are aggressiveinlining)? Is there a way I could test whether it's working?

public static int Method1(void* address, int a) => VolatileRead(ref *(int*)address);
public static int Method1(void* address, short a) => *(int*)address;
public static int Method1(void* address, byte a) => System.Threading.Volatile.Read(ref *(int*)address);

gives

; Assembly listing for method Program:Method1(ulong,int):int
G_M25106_IG01:              ;; offset=0000H
        A9BF7BFD          stp     fp, lr, [sp,#-0x10]!
        910003FD          mov     fp, sp
                        ;; size=8 bbWeight=1    PerfScore 1.50
G_M25106_IG02:              ;; offset=0008H
        9000000B          adrp    x11, [HIGH RELOC #0x40000000004292a8]      // function address
        9100016B          add     x11, x11, [LOW RELOC #0x40000000004292a8]
        F9400161          ldr     x1, [x11]
                        ;; size=12 bbWeight=1    PerfScore 4.00
G_M25106_IG03:              ;; offset=0014H
        A8C17BFD          ldp     fp, lr, [sp],#0x10
        D61F0020          br      x1
                        ;; size=8 bbWeight=1    PerfScore 2.00

; Assembly listing for method Program:Method1(ulong,short):int
G_M46963_IG01:              ;; offset=0000H
        A9BF7BFD          stp     fp, lr, [sp,#-0x10]!
        910003FD          mov     fp, sp
                        ;; size=8 bbWeight=1    PerfScore 1.50
G_M46963_IG02:              ;; offset=0008H
        B9400000          ldr     w0, [x0]
                        ;; size=4 bbWeight=1    PerfScore 3.00
G_M46963_IG03:              ;; offset=000CH
        A8C17BFD          ldp     fp, lr, [sp],#0x10
        D65F03C0          ret     lr
                        ;; size=8 bbWeight=1    PerfScore 2.00

; Assembly listing for method Program:Method1(ulong,ubyte):int
G_M27038_IG01:              ;; offset=0000H
        A9BF7BFD          stp     fp, lr, [sp,#-0x10]!
        910003FD          mov     fp, sp
                        ;; size=8 bbWeight=1    PerfScore 1.50
G_M27038_IG02:              ;; offset=0008H
        88DFFC00          ldar    w0, [x0]
                        ;; size=4 bbWeight=1    PerfScore 3.00
G_M27038_IG03:              ;; offset=000CH
        A8C17BFD          ldp     fp, lr, [sp],#0x10
        D65F03C0          ret     lr
                        ;; size=8 bbWeight=1    PerfScore 2.00

why is the one which calls VolatileRead seemingly so much worse? Shouldn't it just use ldar too?

EgorBo commented 1 year ago

why is the one which calls VolatileRead seemingly so much worse?

Presumably, it's a cross-assembly inlining in crossgen limitation (I assume your VolatileRead is written in pure IL in a separate module).

hamarb123 commented 1 year ago

Presumably, it's a cross-assembly inlining in crossgen limitation

I think you're right, it seems to generate the correct code when I use clrjit_universal_arm64_x64 instead for int.

1 other thing to check, the expected thing for VolatileCopyBlock would be dmb ish before the call to bl CORINFO_HELP_MEMCPY & the rest is the same as without volatile. & similar for InitBlock?

It does seem to be not right for this random type I came up with though:

; Method Program:Method1(ulong,int):System.ValueTuple`3[int,ubyte,long]
G_M62410_IG01:              ;; offset=0000H
        A9BE7BFD          stp     fp, lr, [sp,#-0x20]!
        910003FD          mov     fp, sp
                        ;; size=8 bbWeight=1    PerfScore 1.50

G_M62410_IG02:              ;; offset=0008H
        3DC00010          ldr     q16, [x0]
        3D8007B0          str     q16, [fp,#0x10]
        F9400BA0          ldr     x0, [fp,#0x10]
        F9400FA1          ldr     x1, [fp,#0x18]
                        ;; size=16 bbWeight=1    PerfScore 8.00

G_M62410_IG03:              ;; offset=0018H
        A8C27BFD          ldp     fp, lr, [sp],#0x20
        D65F03C0          ret     lr
                        ;; size=8 bbWeight=1    PerfScore 2.00
; Total bytes of code: 32

; Method Program:Method1(ulong,short):System.ValueTuple`3[int,ubyte,long]
G_M11179_IG01:              ;; offset=0000H
        A9BE7BFD          stp     fp, lr, [sp,#-0x20]!
        910003FD          mov     fp, sp
                        ;; size=8 bbWeight=1    PerfScore 1.50

G_M11179_IG02:              ;; offset=0008H
        3DC00010          ldr     q16, [x0]
        3D8007B0          str     q16, [fp,#0x10]
        F9400BA0          ldr     x0, [fp,#0x10]
        F9400FA1          ldr     x1, [fp,#0x18]
                        ;; size=16 bbWeight=1    PerfScore 8.00

G_M11179_IG03:              ;; offset=0018H
        A8C27BFD          ldp     fp, lr, [sp],#0x20
        D65F03C0          ret     lr
                        ;; size=8 bbWeight=1    PerfScore 2.00
; Total bytes of code: 32

for

public static (int, byte, long) Method1(void* address, int a) => VolatileRead(ref *((int, byte, long)*)address);
public static (int, byte, long) Method1(void* address, short a) => *((int, byte, long)*)address;

(feel free to inform me that I'm wrong if I am)

hamarb123 commented 1 year ago

Just so it's clear to anyone reading, specifically what I would expect to happen with the volatile operations on non-primitives/pointers, or when it's unaligned (since volatile can also be combined with unaligned.), is:

The ordering guarantees (these are the most important)
The read/write operation can be considered as one whole operation, therefore the fields can be read/written in any order whichever is most practical
It provides the normal volatile guarantees: e.g. lack of elimination, etc.
It doesn't need to be atomic since that wouldn't be guaranteed for a normal read/write either for these types

Implementing the above (for cases which aren't already handled like primitive types & pointers) isn't too difficult, as you need only put the applicable half barrier before/after the standard read/write code - or use the applicable store/write instruction if there's only 1 of them (or the first for write, last for read) and there's a volatile version available of that operation.

hamarb123 commented 1 year ago

For volatile. cpblk/initblk, on arm we are currently using dmb ish when we could be using dmb ishst.

Additional notes from when I investigated it last: also volatile. initblk fails when it gets inlined, and volatile. cpblk seems to provide only a store barrier, which doesn't seem right.

dotnet / runtime