Open hamarb123 opened 1 year ago
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
Author: | hamarb123 |
---|---|
Assignees: | - |
Labels: | `area-CodeGen-coreclr` |
Milestone: | - |
It's probably doable with some efforts accross all runtimes, but since it's not exposed in C# (or any other .NET language) it's probably better to leave it as is and refer to this ECMA's update: https://github.com/dotnet/runtime/blob/main/docs/design/specs/Ecma-335-Augments.md#atomic-reads-and-writes
probably better to leave it as is
Not sure what you're referring to here - I don't expect atomicity guarantees (except for the builtin primitive types & pointers obviously), the issue is that the volatile guarantees are not met.
probably better to leave it as is
Not sure what you're referring to here - I don't expect atomicity guarantees (except for the builtin primitive types & pointers obviously), the issue is that the volatile guarantees are not met.
Does it make any sense without atomicity guarantees?
Also, I don't see from your codegen what's wrong - your *1 variants have calls which may or may not provide volatile guarantees - how did you know?
Does it make any sense without atomicity guarantees?
Yes, I had the example on the discord earlier (am happy to post it here if you would like).
Also, I don't see from your codegen what's wrong - your *1 variants have calls which may or may not provide volatile guarantees - how did you know?
I must have missed that, do you know what function it's calling (all mine here are aggressiveinlining
)? Is there a way I could test whether it's working?
public static int Method1(void* address, int a) => VolatileRead(ref *(int*)address);
public static int Method1(void* address, short a) => *(int*)address;
public static int Method1(void* address, byte a) => System.Threading.Volatile.Read(ref *(int*)address);
gives
; Assembly listing for method Program:Method1(ulong,int):int
G_M25106_IG01: ;; offset=0000H
A9BF7BFD stp fp, lr, [sp,#-0x10]!
910003FD mov fp, sp
;; size=8 bbWeight=1 PerfScore 1.50
G_M25106_IG02: ;; offset=0008H
9000000B adrp x11, [HIGH RELOC #0x40000000004292a8] // function address
9100016B add x11, x11, [LOW RELOC #0x40000000004292a8]
F9400161 ldr x1, [x11]
;; size=12 bbWeight=1 PerfScore 4.00
G_M25106_IG03: ;; offset=0014H
A8C17BFD ldp fp, lr, [sp],#0x10
D61F0020 br x1
;; size=8 bbWeight=1 PerfScore 2.00
; Assembly listing for method Program:Method1(ulong,short):int
G_M46963_IG01: ;; offset=0000H
A9BF7BFD stp fp, lr, [sp,#-0x10]!
910003FD mov fp, sp
;; size=8 bbWeight=1 PerfScore 1.50
G_M46963_IG02: ;; offset=0008H
B9400000 ldr w0, [x0]
;; size=4 bbWeight=1 PerfScore 3.00
G_M46963_IG03: ;; offset=000CH
A8C17BFD ldp fp, lr, [sp],#0x10
D65F03C0 ret lr
;; size=8 bbWeight=1 PerfScore 2.00
; Assembly listing for method Program:Method1(ulong,ubyte):int
G_M27038_IG01: ;; offset=0000H
A9BF7BFD stp fp, lr, [sp,#-0x10]!
910003FD mov fp, sp
;; size=8 bbWeight=1 PerfScore 1.50
G_M27038_IG02: ;; offset=0008H
88DFFC00 ldar w0, [x0]
;; size=4 bbWeight=1 PerfScore 3.00
G_M27038_IG03: ;; offset=000CH
A8C17BFD ldp fp, lr, [sp],#0x10
D65F03C0 ret lr
;; size=8 bbWeight=1 PerfScore 2.00
why is the one which calls VolatileRead
seemingly so much worse? Shouldn't it just use ldar
too?
why is the one which calls VolatileRead seemingly so much worse?
Presumably, it's a cross-assembly inlining in crossgen limitation (I assume your VolatileRead is written in pure IL in a separate module).
Presumably, it's a cross-assembly inlining in crossgen limitation
I think you're right, it seems to generate the correct code when I use clrjit_universal_arm64_x64
instead for int
.
1 other thing to check, the expected thing for VolatileCopyBlock would be dmb ish
before the call to bl CORINFO_HELP_MEMCPY
& the rest is the same as without volatile.
& similar for InitBlock?
It does seem to be not right for this random type I came up with though:
; Method Program:Method1(ulong,int):System.ValueTuple`3[int,ubyte,long]
G_M62410_IG01: ;; offset=0000H
A9BE7BFD stp fp, lr, [sp,#-0x20]!
910003FD mov fp, sp
;; size=8 bbWeight=1 PerfScore 1.50
G_M62410_IG02: ;; offset=0008H
3DC00010 ldr q16, [x0]
3D8007B0 str q16, [fp,#0x10]
F9400BA0 ldr x0, [fp,#0x10]
F9400FA1 ldr x1, [fp,#0x18]
;; size=16 bbWeight=1 PerfScore 8.00
G_M62410_IG03: ;; offset=0018H
A8C27BFD ldp fp, lr, [sp],#0x20
D65F03C0 ret lr
;; size=8 bbWeight=1 PerfScore 2.00
; Total bytes of code: 32
; Method Program:Method1(ulong,short):System.ValueTuple`3[int,ubyte,long]
G_M11179_IG01: ;; offset=0000H
A9BE7BFD stp fp, lr, [sp,#-0x20]!
910003FD mov fp, sp
;; size=8 bbWeight=1 PerfScore 1.50
G_M11179_IG02: ;; offset=0008H
3DC00010 ldr q16, [x0]
3D8007B0 str q16, [fp,#0x10]
F9400BA0 ldr x0, [fp,#0x10]
F9400FA1 ldr x1, [fp,#0x18]
;; size=16 bbWeight=1 PerfScore 8.00
G_M11179_IG03: ;; offset=0018H
A8C27BFD ldp fp, lr, [sp],#0x20
D65F03C0 ret lr
;; size=8 bbWeight=1 PerfScore 2.00
; Total bytes of code: 32
for
public static (int, byte, long) Method1(void* address, int a) => VolatileRead(ref *((int, byte, long)*)address);
public static (int, byte, long) Method1(void* address, short a) => *((int, byte, long)*)address;
(feel free to inform me that I'm wrong if I am)
Just so it's clear to anyone reading, specifically what I would expect to happen with the volatile operations on non-primitives/pointers, or when it's unaligned (since volatile can also be combined with unaligned.
), is:
Implementing the above (for cases which aren't already handled like primitive types & pointers) isn't too difficult, as you need only put the applicable half barrier before/after the standard read/write code - or use the applicable store/write instruction if there's only 1 of them (or the first for write, last for read) and there's a volatile version available of that operation.
For volatile. cpblk/initblk
, on arm we are currently using dmb ish
when we could be using dmb ishst
.
Additional notes from when I investigated it last: also volatile. initblk fails when it gets inlined, and volatile. cpblk
seems to provide only a store barrier, which doesn't seem right.
Description
I'm probably the first person insane enough to try this :)
In IL code you can use the
volatile.
prefix for any ofldind
,stind
,ldfld
,stfld
,ldobj
,stobj
,initblk
,cpblk
,ldsfld
, andstsfld
(III.2.6). Notably, nowhere does it place any form of restrictions on what the type can be for APIs likeldobj
andstobj
; in fact it even specifies behaviour for atomicity, which would not make sense if it wasn't supposed to work on non-builtin typesThey do not provide atomicity, other than that guaranteed by the specification of §I.12.6.6.
(I.12.6.7).I.12.6.6 Atomic reads and writes
specifiesA conforming CLI shall guarantee that read and write access to properly aligned memory locations no larger than the native word size (the size of type native int) is atomic (see §I.12.6.2) when all the write accesses to a location are the same size.
which would mean that for custom types which are structs with multiple fields, it should just do the applicable volatile operation for each field (in an unspecified order).The point of the above is that something like:
should work with:
when
A
orint
or any other type is the generic parameter!!0
; and forA
it should generate something similar to what it would as if I'd done it directly with each field instead.To be clear, I ran into this as a result of determining that this operation would actually be useful to something I was trying to implement lock-free.
Reproduction Steps
Expected behavior
ReadVolatile1(int&)
andReadVolatile2(int&)
have the same (or very similar) codegen, and similarly withA&
. And same with Write.Actual behavior
I get the following codegen from crossgen2
```Asm ; Assembly listing for method Program:ReadVolatile1(byref):A ; Emitting BLENDED_CODE for generic ARM64 CPU - Windows ; ReadyToRun compilation ; optimized code ; fp based frame ; fully interruptible ; No PGO data ; Final local variable assignments ; ; V00 arg0 [V00,T00] ( 3, 3 ) byref -> x0 single-def ;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace" ; ; Lcl frame size = 0 G_M15271_IG01: ;; offset=0000H A9BF7BFD stp fp, lr, [sp,#-0x10]! 910003FD mov fp, sp ;; size=8 bbWeight=1 PerfScore 1.50 G_M15271_IG02: ;; offset=0008H 9000000B adrp x11, [HIGH RELOC #0x40000000004292b8] // function address 9100016B add x11, x11, [LOW RELOC #0x40000000004292b8] F9400161 ldr x1, [x11] ;; size=12 bbWeight=1 PerfScore 4.00 G_M15271_IG03: ;; offset=0014H A8C17BFD ldp fp, lr, [sp],#0x10 D61F0020 br x1 ;; size=8 bbWeight=1 PerfScore 2.00 ; Total bytes of code 28, prolog size 8, PerfScore 10.30, instruction count 7, allocated bytes for code 28 (MethodHash=d3c5c458) for method Program:ReadVolatile1(byref):A ; ============================================================ ; Assembly listing for method Program:ReadVolatile2(byref):A ; Emitting BLENDED_CODE for generic ARM64 CPU - Windows ; ReadyToRun compilation ; optimized code ; fp based frame ; partially interruptible ; No PGO data ; 0 inlinees with PGO data; 2 single block inlinees; 0 inlinees without PGO data ; Final local variable assignments ; ; V00 arg0 [V00,T00] ( 4, 4 ) byref -> x0 single-def ; V01 loc0 [V01 ] ( 3, 3 ) struct ( 8) [fp+18H] do-not-enreg[S] ld-addr-op ;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace" ;* V03 tmp1 [V03 ] ( 0, 0 ) byref -> zero-ref "Inlining Arg" ;* V04 tmp2 [V04 ] ( 0, 0 ) byref -> zero-ref "Inlining Arg" ; V05 tmp3 [V05,T01] ( 2, 2 ) int -> [fp+18H] do-not-enreg[] V01.a(offs=0x00) P-DEP "field V01.a (fldOffset=0x0)" ; V06 tmp4 [V06,T02] ( 2, 2 ) int -> [fp+1CH] do-not-enreg[] V01.b(offs=0x04) P-DEP "field V01.b (fldOffset=0x4)" ; ; Lcl frame size = 16 G_M36164_IG01: ;; offset=0000H A9BE7BFD stp fp, lr, [sp,#-0x20]! 910003FD mov fp, sp ;; size=8 bbWeight=1 PerfScore 1.50 G_M36164_IG02: ;; offset=0008H 88DFFC01 ldar w1, [x0] B9001BA1 str w1, [fp,#0x18] // [V05 tmp3] B9400400 ldr w0, [x0,#0x04] D50339BF dmb ishld B9001FA0 str w0, [fp,#0x1C] // [V06 tmp4] F9400FA0 ldr x0, [fp,#0x18] // [V01 loc0] ;; size=24 bbWeight=1 PerfScore 20.00 G_M36164_IG03: ;; offset=0020H A8C27BFD ldp fp, lr, [sp],#0x20 D65F03C0 ret lr ;; size=8 bbWeight=1 PerfScore 2.00 ; Total bytes of code 40, prolog size 8, PerfScore 27.50, instruction count 10, allocated bytes for code 40 (MethodHash=85ab72bb) for method Program:ReadVolatile2(byref):A ; ============================================================ ; Assembly listing for method Program:ReadVolatile1(byref):int ; Emitting BLENDED_CODE for generic ARM64 CPU - Windows ; ReadyToRun compilation ; optimized code ; fp based frame ; fully interruptible ; No PGO data ; Final local variable assignments ; ; V00 arg0 [V00,T00] ( 3, 3 ) byref -> x0 single-def ;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace" ; ; Lcl frame size = 0 G_M27221_IG01: ;; offset=0000H A9BF7BFD stp fp, lr, [sp,#-0x10]! 910003FD mov fp, sp ;; size=8 bbWeight=1 PerfScore 1.50 G_M27221_IG02: ;; offset=0008H 9000000B adrp x11, [HIGH RELOC #0x40000000004292e8] // function address 9100016B add x11, x11, [LOW RELOC #0x40000000004292e8] F9400161 ldr x1, [x11] ;; size=12 bbWeight=1 PerfScore 4.00 G_M27221_IG03: ;; offset=0014H A8C17BFD ldp fp, lr, [sp],#0x10 D61F0020 br x1 ;; size=8 bbWeight=1 PerfScore 2.00 ; Total bytes of code 28, prolog size 8, PerfScore 10.30, instruction count 7, allocated bytes for code 28 (MethodHash=dc4895aa) for method Program:ReadVolatile1(byref):int ; ============================================================ ; Assembly listing for method Program:ReadVolatile2(byref):int ; Emitting BLENDED_CODE for generic ARM64 CPU - Windows ; ReadyToRun compilation ; optimized code ; fp based frame ; partially interruptible ; No PGO data ; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data ; Final local variable assignments ; ; V00 arg0 [V00,T00] ( 3, 3 ) byref -> x0 single-def ;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace" ; ; Lcl frame size = 0 G_M42934_IG01: ;; offset=0000H A9BF7BFD stp fp, lr, [sp,#-0x10]! 910003FD mov fp, sp ;; size=8 bbWeight=1 PerfScore 1.50 G_M42934_IG02: ;; offset=0008H 88DFFC00 ldar w0, [x0] ;; size=4 bbWeight=1 PerfScore 3.00 G_M42934_IG03: ;; offset=000CH A8C17BFD ldp fp, lr, [sp],#0x10 D65F03C0 ret lr ;; size=8 bbWeight=1 PerfScore 2.00 ; Total bytes of code 20, prolog size 8, PerfScore 8.50, instruction count 5, allocated bytes for code 20 (MethodHash=9e525849) for method Program:ReadVolatile2(byref):int ; ============================================================ ; Assembly listing for method Program:WriteVolatile1(byref,A) ; Emitting BLENDED_CODE for generic ARM64 CPU - Windows ; ReadyToRun compilation ; optimized code ; fp based frame ; fully interruptible ; No PGO data ; Final local variable assignments ; ; V00 arg0 [V00,T00] ( 3, 3 ) byref -> x0 single-def ; V01 arg1 [V01,T01] ( 3, 3 ) struct ( 8) [fp+18H] do-not-enreg[S] single-def ;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace" ; ; Lcl frame size = 16 G_M5886_IG01: ;; offset=0000H A9BE7BFD stp fp, lr, [sp,#-0x20]! 910003FD mov fp, sp F9000FA1 str x1, [fp,#0x18] ;; size=12 bbWeight=1 PerfScore 2.50 G_M5886_IG02: ;; offset=000CH F9400FA1 ldr x1, [fp,#0x18] 9000000B adrp x11, [HIGH RELOC #0x4000000000429358] // function address 9100016B add x11, x11, [LOW RELOC #0x4000000000429358] F9400162 ldr x2, [x11] ;; size=16 bbWeight=1 PerfScore 6.00 G_M5886_IG03: ;; offset=001CH A8C27BFD ldp fp, lr, [sp],#0x20 D61F0040 br x2 ;; size=8 bbWeight=1 PerfScore 2.00 ; Total bytes of code 36, prolog size 12, PerfScore 14.10, instruction count 9, allocated bytes for code 36 (MethodHash=7866e901) for method Program:WriteVolatile1(byref,A) ; ============================================================ ; Assembly listing for method Program:WriteVolatile2(byref,A) ; Emitting BLENDED_CODE for generic ARM64 CPU - Windows ; ReadyToRun compilation ; optimized code ; fp based frame ; partially interruptible ; No PGO data ; 0 inlinees with PGO data; 2 single block inlinees; 0 inlinees without PGO data ; Final local variable assignments ; ; V00 arg0 [V00,T00] ( 4, 4 ) byref -> x0 single-def ; V01 arg1 [V01 ] ( 4, 4 ) struct ( 8) [fp+18H] do-not-enreg[XS] addr-exposed ld-addr-op single-def ;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace" ; V03 tmp1 [V03,T01] ( 2, 4 ) int -> x1 "Inlining Arg" ; V04 tmp2 [V04,T02] ( 2, 4 ) int -> x0 "Inlining Arg" ; ; Lcl frame size = 16 G_M11869_IG01: ;; offset=0000H A9BE7BFD stp fp, lr, [sp,#-0x20]! 910003FD mov fp, sp F9000FA1 str x1, [fp,#0x18] ;; size=12 bbWeight=1 PerfScore 2.50 G_M11869_IG02: ;; offset=000CH B9400001 ldr w1, [x0] D5033BBF dmb ish B9001BA1 str w1, [fp,#0x18] B9400400 ldr w0, [x0,#0x04] D5033BBF dmb ish B9001FA0 str w0, [fp,#0x1C] ;; size=24 bbWeight=1 PerfScore 28.00 G_M11869_IG03: ;; offset=0024H A8C27BFD ldp fp, lr, [sp],#0x20 D65F03C0 ret lr ;; size=8 bbWeight=1 PerfScore 2.00 ; Total bytes of code 44, prolog size 8, PerfScore 36.90, instruction count 11, allocated bytes for code 44 (MethodHash=597cd1a2) for method Program:WriteVolatile2(byref,A) ; ============================================================ ; Assembly listing for method Program:WriteVolatile1(byref,int) ; Emitting BLENDED_CODE for generic ARM64 CPU - Windows ; ReadyToRun compilation ; optimized code ; fp based frame ; fully interruptible ; No PGO data ; Final local variable assignments ; ; V00 arg0 [V00,T00] ( 3, 3 ) byref -> x0 single-def ; V01 arg1 [V01,T01] ( 3, 3 ) int -> x1 single-def ;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace" ; ; Lcl frame size = 0 G_M16972_IG01: ;; offset=0000H A9BF7BFD stp fp, lr, [sp,#-0x10]! 910003FD mov fp, sp ;; size=8 bbWeight=1 PerfScore 1.50 G_M16972_IG02: ;; offset=0008H 9000000B adrp x11, [HIGH RELOC #0x4000000000429388] // function address 9100016B add x11, x11, [LOW RELOC #0x4000000000429388] F9400162 ldr x2, [x11] ;; size=12 bbWeight=1 PerfScore 4.00 G_M16972_IG03: ;; offset=0014H A8C17BFD ldp fp, lr, [sp],#0x10 D61F0040 br x2 ;; size=8 bbWeight=1 PerfScore 2.00 ; Total bytes of code 28, prolog size 8, PerfScore 10.30, instruction count 7, allocated bytes for code 28 (MethodHash=2db1bdb3) for method Program:WriteVolatile1(byref,int) ; ============================================================ ; Assembly listing for method Program:WriteVolatile2(byref,int) ; Emitting BLENDED_CODE for generic ARM64 CPU - Windows ; ReadyToRun compilation ; optimized code ; fp based frame ; partially interruptible ; No PGO data ; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data ; Final local variable assignments ; ; V00 arg0 [V00,T00] ( 3, 3 ) byref -> x0 single-def ; V01 arg1 [V01,T01] ( 3, 3 ) int -> x1 single-def ;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace" ; ; Lcl frame size = 0 G_M63279_IG01: ;; offset=0000H A9BF7BFD stp fp, lr, [sp,#-0x10]! 910003FD mov fp, sp ;; size=8 bbWeight=1 PerfScore 1.50 G_M63279_IG02: ;; offset=0008H 889FFC01 stlr w1, [x0] ;; size=4 bbWeight=1 PerfScore 1.00 G_M63279_IG03: ;; offset=000CH A8C17BFD ldp fp, lr, [sp],#0x10 D65F03C0 ret lr ;; size=8 bbWeight=1 PerfScore 2.00 ; Total bytes of code 20, prolog size 8, PerfScore 6.50, instruction count 5, allocated bytes for code 20 (MethodHash=abe008d0) for method Program:WriteVolatile2(byref,int) ; ============================================================ ```as can be seen above, the
1
variants don't appear to be volatile at all. But they should all have extremely similar codegen to the matching2
variants.Regression?
Probably not :)
Known Workarounds
For specific cases, it's obviously easy enough to work around if you have access to all the fields, or if it's simply a class type or
unmanaged
, but it's basically unsolvable in the general case since gc-refs could be anywhere in it, so you can't just loop over the memory.Configuration
I got the codegen using disasmo. It's for .NET 7 on arm64 using crossgen2. I haven't got a locally built .NET 8, so I haven't checked it, but I doubt it's much better :) (it would be good if someone else was able to test this in .NET 8). I was planning to use this on a .NET 6 project I have, so much for that idea lol. It probably won't be fixed until at least 9 at this point would be my guess (it would be great if we could fix it on 6, 7, and 8 too though 😄).
Other information
I haven't tested all of the other instructions, but it could easily be an issue specific to
ldobj
andstobj
and the rest could be fine, this should be checked.