Closed alyssarosenzweig closed 1 year ago
Looks like the extension bits for AtomicDestinationDesc
were at 40:41 (0001-Atomic-op-register-extensions.patch). Were there any other extension bits you were missing?
Also, atomic threadgroup destinations were breaking because they didn't have a mask register. I did an if/else on self.is_optional
(since it's also used only by the atomic version) but you might want to add a separate flag for it.
For floats, at least on M1 they compile to a cmpxchg loop. Maybe they have instructions for it on M2.
Really makes you wonder why float atomics aren't supported on <Apple7 GPUs, if they were going to polyfill it anyways (or whatever the term is for doing that to an instruction)... Edit: It's probably the simd_fadd
that they use when you target a constant address.
For 64-bit integers, there's only one operation currently supported by Metal's frontend, atomic_max_explicit
(or min) on device atomic_ulong*
s, and it ICEs the M1 backend compiler.
kernel void yay(uint u [[thread_position_in_grid]], device atomic_ulong* a) {
atomic_max_explicit(a + u, 10, memory_order_relaxed);
}
→ Compiler encountered an internal error
Edit: MSL spec section 6.15.2.6 lists these functions and says "See the Metal Feature Set Tables to determine which GPUs support this feature." I checked the feature set tables and couldn't find any listing for the feature. Filed as FB11989038...
Also, while testing, it looks like there's some sort of lsl-like field in the threadgroup_(load|store)
instructions, but they use 1, 2, and 4 instead of 0, 1, 2... Maybe someone should try sticking non-powers-of-two in that field and see what happens.
Looks like the extension bits for AtomicDestinationDesc were at 40:41 (0001-Atomic-op-register-extensions.patch). Were there any other extension bits you were missing?
Awesome -- confirmed with Mesa. I was wondering if maybe the atomic source was split into 6-2 like the destination, but seemingly the contiguous 8 bits works (as in, tests are passing with it contiguous and high bits used, so 🤷♀️ ) Thank you!
For floats, at least on M1 they compile to a cmpxchg loop.
Delightful 🤣 Fair enough, thanks. Makes you wonder why they bothered putting it in metal.
→ Compiler encountered an internal error
Truly, delightful 🤣
I think Apple added some extra atomic instructions to M2, to support Nanite and float atomics. I got an unconfirmed DM that Apple added it to M2 and not other Apple 8 GPUs. More info: https://forums.unrealengine.com/t/lumen-nanite-on-macos/508411/54?u=philipturner. Not a single volunteer owned an M2 GPU and could confirm whether it worked.
For floats, at least on M1 they compile to a cmpxchg loop. Maybe they have instructions for it on M2. ~Really makes you wonder why float atomics aren't supported on <Apple7 GPUs, if they were going to polyfill it anyways (or whatever the term is for doing that to an instruction)...~ Edit: It's probably the
simd_fadd
that they use when you target a constant address.0: 72051004 get_sr r1, sr80 (thread_position_in_grid.x) 4: 0519200e00c01200 device_load 0, i32, x, r3, u0_u1, r1, unsigned c: 3800 wait 0 e: 62000000 mov_imm r0l, 0 12: 2a8946020001 fadd32 r2, r3, 2.0 18: d528208e00d00400 atomic cmpxchg, 0, r5, u0_u1, r1, unsigned, r2, 1, 0, 1, 20 20: 3800 wait 0 22: 92098a622c010190 icmpsel seq, r2.cache, r5.cache, r3.discard, 1, 0 2a: 7e0dca0a8000 mov r3, r5.discard 30: 5294c4000000 while_icmp r0l, seq, r2l.discard, 0, 2 36: 00c0dcffffff jmp_exec_any 0x12 3c: 521600000000 pop_exec r0l, 2 42: 8800 stop
Could someone explain to me how this code works? I have trouble wrapping my head around it. So at 4 the value from [a+u]
is being loaded into r3
, incremented by 2 (with result stored in r2
) in 12, and then we have a compare+exchange at 18, followed by a footer where presumably a new (in case of conflict) value is loaded into r3
, and then we go back to the addition.
But I am confused about the cmpxchg
line. I see the address (u0_u1, r1
) and what to write on success (r2
), but where is the value to be compared with? I'd expect there to be a mention of r3
somewhere as it's the value we want to ensure is still in memory when we do the atomic swap? What am I missing?
Yeah, you're correct. The disassembly isn't quite correct for cmpxchg
. There's a comment in the patch:
# unusual, uses register pair for old/new value
3: 'cmpxchg',
r2
should be disassembled as r2_r3
(i.e. the "what to compare" register is always the next register after the "what to write on success" register)
Ah, I was wondering what that comment meant. Got it, thanks for explanation!
Looks like it is indeed supported on M2
Can't actually run these, but I used metal-nt
to compile for applegpu_g14g
and it seems to work
#include <metal_stdlib>
using namespace metal;
kernel void yay(uint u [[thread_position_in_grid]], device atomic_ulong* a, device atomic_uint* b) {
atomic_max_explicit(a + u, 10, memory_order_relaxed);
atomic_max_explicit(a, 10, memory_order_relaxed);
atomic_max_explicit(a + 1, 10, memory_order_relaxed);
atomic_max_explicit(reinterpret_cast<device atomic_ulong*>(b + u), 1, memory_order_relaxed);
atomic_min_explicit(reinterpret_cast<device atomic_ulong*>(b + u), 1, memory_order_relaxed);
uint expected = 0;
atomic_compare_exchange_weak_explicit(b + u, &expected, expected + 1, memory_order_relaxed, memory_order_relaxed);
}
0: f2191004 get_sr r6.cache, sr80 (thread_position_in_grid.x)
4: fe0180098000 mov r0.cache, u0
a: fe0582098000 mov r1.cache, u1
10: 620d00000000 mov_imm r3, 0
16: 0e13c0c3a4001000 iadd r4_r5, r0_r1.discard, r6, lsl 3
1e: 620500000000 mov_imm r1, 0
24: 62090a000000 mov_imm r2, 10
2a: 620101000000 mov_imm r0, 1
30: d500088500700400 atomic cmpxchg, 0, None, r4_r5, 0, signed, r2, 1, 0, 1, 112
38: d500008d00700400 atomic cmpxchg, 0, None, u0_u1, 0, signed, r2, 1, 0, 1, 112
40: d500208d00700400 atomic cmpxchg, 0, None, u0_u1, 2, signed, r2, 1, 0, 1, 112
48: d500c48e00700000 atomic cmpxchg, 0, None, u2_u3, r6, unsigned, r0, 1, 0, 1, 112
50: d500c48e00300000 atomic cmpxchg, 0, None, u2_u3, r6, unsigned, r0, 1, 0, 1, 48
58: d500c48e00500000 atomic cmpxchg, 0, None, u2_u3, r6, unsigned, r0, 1, 0, 1, 80
60: 8800 stop
I don't think our signed/unsigned bit is correct here, it seems to be getting used for something else
Kind of curious what would you would get if you enabled the 32-bit output register on one of these, but I don't have an M2 to try it...
We got someone with an M2 Max who might test it out soon - I'll hyperlink this which will have the latest info.
I don't think our signed/unsigned bit is correct here, it seems to be getting used for something else
Is it just that my choice of name/syntax is terrible? signed
and unsigned
come from the Ou
(offset unsigned) bit, and indicate whether the preceding operand (the address offset) is sign or zero extended from 32-bit to 64-bit before being added to the address. Arm writes this as LDR X0, [X0, W1, UXTW]
vs LDR X0, [X0, W1, SXTW]
, which doesn't feel super readable, but grouping it into an address expression (unfortunately not a trivial change as currently designed) might clarify this? Edit: or maybe just signed_offset
vs unsigned_offset
? At least that'd be an easy change.
For the first three the offset is an immediate (value of Ou
doesn't matter), and for the last three the offset is unsigned (uint u
), and correctly marked as unsigned, or am I misunderstanding?
(@alyssarosenzweig Are you still interested in working on this, or should we just merge and others can PR fixes and improvements? It's already way better than nothing.)
Edit: or maybe just signed_offset vs unsigned_offset? At least that'd be an easy change.
Yes, that would help. (I read it as affecting the operation itself, and was like "that doesn't make sense here")
Kind of curious what would you would get if you enabled the 32-bit output register on one of these, but I don't have an M2 to try it...
I did get someone on UE5 forums, who has an M2 family, to confirm it's supported. A long shot, but maybe we can ask them if you need to test the output register. The instruction likely doesn't output anything because it's only intended for running the Nanite algorithm.
Nanite is an algorithm for software rasterization, and it needs UInt64 atomics to implement an artificial Z-buffer. For that purpose, there's no need to see the return value. More info here.
A long shot, but maybe we can ask them if you need to test the output register.
You'd need to be able to write your own GPU code to get the output register, either by using Asahi's driver or through this repo's hwtest setup. Metal wouldn't ever generate it, and it's probably useless since it's a 32-bit output from a 64-bit operation. I was just kind of curious, that's all.
(@alyssarosenzweig Are you still interested in working on this, or should we just merge and others can PR fixes and improvements? It's already way better than nothing.)
I guess we can merge
The one we've all been waiting for~
Things that are missing/wrong here:
I'd appreciate if someone else could fill in those gaps
Closes #2