Atomics! - Githubissues

alyssarosenzweig commented 1 year ago

The one we've all been waiting for~

Things that are missing/wrong here:

[ ] some extend bits. all r/e here was done by hand (i.e. without tools), so it's hard to get those
[ ] 64-bit integer atomics. I couldn't get them to work on my version of macOS
[ ] float atomics. these need macOS 13 but if I upgrade my r/e setup is going to break

I'd appreciate if someone else could fill in those gaps

Closes #2

TellowKrinkle commented 1 year ago

Looks like the extension bits for AtomicDestinationDesc were at 40:41 (0001-Atomic-op-register-extensions.patch). Were there any other extension bits you were missing?

Also, atomic threadgroup destinations were breaking because they didn't have a mask register. I did an if/else on self.is_optional (since it's also used only by the atomic version) but you might want to add a separate flag for it.

For floats, at least on M1 they compile to a cmpxchg loop. Maybe they have instructions for it on M2. ~~Really makes you wonder why float atomics aren't supported on <Apple7 GPUs, if they were going to polyfill it anyways (or whatever the term is for doing that to an instruction)...~~ Edit: It's probably the simd_fadd that they use when you target a constant address.

Test Input/Output

```metal kernel void yay(uint u [[thread_position_in_grid]], device atomic_float* a) { atomic_fetch_add_explicit(a + u, 2.0f, memory_order_relaxed); } ``` ``` 0: 72051004 get_sr r1, sr80 (thread_position_in_grid.x) 4: 0519200e00c01200 device_load 0, i32, x, r3, u0_u1, r1, unsigned c: 3800 wait 0 e: 62000000 mov_imm r0l, 0 12: 2a8946020001 fadd32 r2, r3, 2.0 18: d528208e00d00400 atomic cmpxchg, 0, r5, u0_u1, r1, unsigned, r2, 1, 0, 1, 20 20: 3800 wait 0 22: 92098a622c010190 icmpsel seq, r2.cache, r5.cache, r3.discard, 1, 0 2a: 7e0dca0a8000 mov r3, r5.discard 30: 5294c4000000 while_icmp r0l, seq, r2l.discard, 0, 2 36: 00c0dcffffff jmp_exec_any 0x12 3c: 521600000000 pop_exec r0l, 2 42: 8800 stop ```

For 64-bit integers, there's only one operation currently supported by Metal's frontend, atomic_max_explicit (or min) on device atomic_ulong*s, and it ICEs the M1 backend compiler.

kernel void yay(uint u [[thread_position_in_grid]], device atomic_ulong* a) {
    atomic_max_explicit(a + u, 10, memory_order_relaxed);
}

→ Compiler encountered an internal error

Edit: MSL spec section 6.15.2.6 lists these functions and says "See the Metal Feature Set Tables to determine which GPUs support this feature." I checked the feature set tables and couldn't find any listing for the feature. Filed as FB11989038...

Also, while testing, it looks like there's some sort of lsl-like field in the threadgroup_(load|store) instructions, but they use 1, 2, and 4 instead of 0, 1, 2... Maybe someone should try sticking non-powers-of-two in that field and see what happens.

Test Input/Output and Comments

```metal static constant constexpr uint DATA_LEN = 96; struct Fun { atomic_uint data[DATA_LEN]; }; kernel void yay(uint u [[thread_position_in_threadgroup]], threadgroup atomic_uint* ui, threadgroup Fun* fun0, device Fun* fun1) { uint data[DATA_LEN]; for (uint i = 0; i < DATA_LEN; i++) data[i] = atomic_load_explicit(fun1[u].data + i, memory_order_relaxed); threadgroup_barrier(mem_flags::mem_device); uint v = atomic_fetch_add_explicit(ui + u * 2, 1, memory_order_relaxed); atomic_fetch_add_explicit(ui + (u * 2 + 1), v, memory_order_relaxed); threadgroup_barrier(mem_flags::mem_device); for (uint i = 0; i < DATA_LEN; i++) atomic_store_explicit(fun0[u].data + i, data[i], memory_order_relaxed); } ``` ``` 0: f2083000 get_sr r2l.cache, sr48 (thread_position_in_threadgroup.x) 4: 6289000000000030 mov_imm r98, 0 c: 9e0384c010800100 imadd r0_r1.cache, r2l.cache, u6l, u0 14: 0e0e084188301000 iadd r99h, u4l, r2l.cache, lsl 3 1c: 0e05c22218000000 iadd r1, r1.discard, u1 24: 1e0cc4c0100a0130 imadd r99l, r2l.discard, u6l, u5l 2c: 6285010000000030 mov_imm r97, 1 34: 1500008500d3c400 atomic add, 0, r96, r0_r1, 0, signed, r98, 1, 0, 1, 20 3c: 15f8108500d2c400 atomic add, 0, r95, r0_r1, 1, signed, r98, 1, 0, 1, 20 44: 15f0208500d2c400 atomic add, 0, r94, r0_r1, 2, signed, r98, 1, 0, 1, 20 4c: 15e8308500d2c400 atomic add, 0, r93, r0_r1, 3, signed, r98, 1, 0, 1, 20 54: 15e0408500d2c400 atomic add, 0, r92, r0_r1, 4, signed, r98, 1, 0, 1, 20 5c: 15d8508500d2c400 atomic add, 0, r91, r0_r1, 5, signed, r98, 1, 0, 1, 20 64: 15d0608500d2c400 atomic add, 0, r90, r0_r1, 6, signed, r98, 1, 0, 1, 20 6c: 15c8708500d2c400 atomic add, 0, r89, r0_r1, 7, signed, r98, 1, 0, 1, 20 74: 15c080c500d2c400 atomic add, 1, r88, r0_r1, 8, signed, r98, 1, 0, 1, 20 7c: 15b890c500d2c400 atomic add, 1, r87, r0_r1, 9, signed, r98, 1, 0, 1, 20 84: 15b0a0c500d2c400 atomic add, 1, r86, r0_r1, 10, signed, r98, 1, 0, 1, 20 8c: 15a8b0c500d2c400 atomic add, 1, r85, r0_r1, 11, signed, r98, 1, 0, 1, 20 94: 15a0c0c500d2c400 atomic add, 1, r84, r0_r1, 12, signed, r98, 1, 0, 1, 20 9c: 1598d0c500d2c400 atomic add, 1, r83, r0_r1, 13, signed, r98, 1, 0, 1, 20 a4: 1590e0c500d2c400 atomic add, 1, r82, r0_r1, 14, signed, r98, 1, 0, 1, 20 ac: 1588f0c500d2c400 atomic add, 1, r81, r0_r1, 15, signed, r98, 1, 0, 1, 20 b4: 3800 wait 0 b6: 1580008501d2c400 atomic add, 0, r80, r0_r1, 16, signed, r98, 1, 0, 1, 20 be: 1578108501d2c400 atomic add, 0, r79, r0_r1, 17, signed, r98, 1, 0, 1, 20 c6: 1570208501d2c400 atomic add, 0, r78, r0_r1, 18, signed, r98, 1, 0, 1, 20 ce: 1568308501d2c400 atomic add, 0, r77, r0_r1, 19, signed, r98, 1, 0, 1, 20 d6: 1560408501d2c400 atomic add, 0, r76, r0_r1, 20, signed, r98, 1, 0, 1, 20 de: 1558508501d2c400 atomic add, 0, r75, r0_r1, 21, signed, r98, 1, 0, 1, 20 e6: 1550608501d2c400 atomic add, 0, r74, r0_r1, 22, signed, r98, 1, 0, 1, 20 ee: 1548708501d2c400 atomic add, 0, r73, r0_r1, 23, signed, r98, 1, 0, 1, 20 f6: 3800 wait 0 f8: 1540808501d2c400 atomic add, 0, r72, r0_r1, 24, signed, r98, 1, 0, 1, 20 100: 1538908501d2c400 atomic add, 0, r71, r0_r1, 25, signed, r98, 1, 0, 1, 20 108: 1530a08501d2c400 atomic add, 0, r70, r0_r1, 26, signed, r98, 1, 0, 1, 20 110: 1528b08501d2c400 atomic add, 0, r69, r0_r1, 27, signed, r98, 1, 0, 1, 20 118: 1520c08501d2c400 atomic add, 0, r68, r0_r1, 28, signed, r98, 1, 0, 1, 20 120: 1518d08501d2c400 atomic add, 0, r67, r0_r1, 29, signed, r98, 1, 0, 1, 20 128: 1510e08501d2c400 atomic add, 0, r66, r0_r1, 30, signed, r98, 1, 0, 1, 20 130: 1508f08501d2c400 atomic add, 0, r65, r0_r1, 31, signed, r98, 1, 0, 1, 20 138: 3800 wait 0 13a: 1500008502d2c400 atomic add, 0, r64, r0_r1, 32, signed, r98, 1, 0, 1, 20 142: 15f8108502d1c400 atomic add, 0, r63, r0_r1, 33, signed, r98, 1, 0, 1, 20 14a: 15f0208502d1c400 atomic add, 0, r62, r0_r1, 34, signed, r98, 1, 0, 1, 20 152: 15e8308502d1c400 atomic add, 0, r61, r0_r1, 35, signed, r98, 1, 0, 1, 20 15a: 15e0408502d1c400 atomic add, 0, r60, r0_r1, 36, signed, r98, 1, 0, 1, 20 162: 15d8508502d1c400 atomic add, 0, r59, r0_r1, 37, signed, r98, 1, 0, 1, 20 16a: 15d0608502d1c400 atomic add, 0, r58, r0_r1, 38, signed, r98, 1, 0, 1, 20 172: 15c8708502d1c400 atomic add, 0, r57, r0_r1, 39, signed, r98, 1, 0, 1, 20 17a: 3800 wait 0 17c: 15c0808502d1c400 atomic add, 0, r56, r0_r1, 40, signed, r98, 1, 0, 1, 20 184: 15b8908502d1c400 atomic add, 0, r55, r0_r1, 41, signed, r98, 1, 0, 1, 20 18c: 15b0a08502d1c400 atomic add, 0, r54, r0_r1, 42, signed, r98, 1, 0, 1, 20 194: 15a8b08502d1c400 atomic add, 0, r53, r0_r1, 43, signed, r98, 1, 0, 1, 20 19c: 15a0c08502d1c400 atomic add, 0, r52, r0_r1, 44, signed, r98, 1, 0, 1, 20 1a4: 1598d08502d1c400 atomic add, 0, r51, r0_r1, 45, signed, r98, 1, 0, 1, 20 1ac: 1590e08502d1c400 atomic add, 0, r50, r0_r1, 46, signed, r98, 1, 0, 1, 20 1b4: 1588f08502d1c400 atomic add, 0, r49, r0_r1, 47, signed, r98, 1, 0, 1, 20 1bc: 3800 wait 0 1be: 1580008503d1c400 atomic add, 0, r48, r0_r1, 48, signed, r98, 1, 0, 1, 20 1c6: 1578108503d1c400 atomic add, 0, r47, r0_r1, 49, signed, r98, 1, 0, 1, 20 1ce: 1570208503d1c400 atomic add, 0, r46, r0_r1, 50, signed, r98, 1, 0, 1, 20 1d6: 1568308503d1c400 atomic add, 0, r45, r0_r1, 51, signed, r98, 1, 0, 1, 20 1de: 1560408503d1c400 atomic add, 0, r44, r0_r1, 52, signed, r98, 1, 0, 1, 20 1e6: 1558508503d1c400 atomic add, 0, r43, r0_r1, 53, signed, r98, 1, 0, 1, 20 1ee: 1550608503d1c400 atomic add, 0, r42, r0_r1, 54, signed, r98, 1, 0, 1, 20 1f6: 1548708503d1c400 atomic add, 0, r41, r0_r1, 55, signed, r98, 1, 0, 1, 20 1fe: 3800 wait 0 200: 1540808503d1c400 atomic add, 0, r40, r0_r1, 56, signed, r98, 1, 0, 1, 20 208: 1538908503d1c400 atomic add, 0, r39, r0_r1, 57, signed, r98, 1, 0, 1, 20 210: 1530a08503d1c400 atomic add, 0, r38, r0_r1, 58, signed, r98, 1, 0, 1, 20 218: 1528b08503d1c400 atomic add, 0, r37, r0_r1, 59, signed, r98, 1, 0, 1, 20 220: 1520c08503d1c400 atomic add, 0, r36, r0_r1, 60, signed, r98, 1, 0, 1, 20 228: 1518d08503d1c400 atomic add, 0, r35, r0_r1, 61, signed, r98, 1, 0, 1, 20 230: 1510e08503d1c400 atomic add, 0, r34, r0_r1, 62, signed, r98, 1, 0, 1, 20 238: 1508f08503d1c400 atomic add, 0, r33, r0_r1, 63, signed, r98, 1, 0, 1, 20 240: 3800 wait 0 242: 1500008504d1c400 atomic add, 0, r32, r0_r1, 64, signed, r98, 1, 0, 1, 20 24a: 15f8108504d0c400 atomic add, 0, r31, r0_r1, 65, signed, r98, 1, 0, 1, 20 252: 15f0208504d0c400 atomic add, 0, r30, r0_r1, 66, signed, r98, 1, 0, 1, 20 25a: 15e8308504d0c400 atomic add, 0, r29, r0_r1, 67, signed, r98, 1, 0, 1, 20 262: 15e0408504d0c400 atomic add, 0, r28, r0_r1, 68, signed, r98, 1, 0, 1, 20 26a: 15d8508504d0c400 atomic add, 0, r27, r0_r1, 69, signed, r98, 1, 0, 1, 20 272: 15d0608504d0c400 atomic add, 0, r26, r0_r1, 70, signed, r98, 1, 0, 1, 20 27a: 15c8708504d0c400 atomic add, 0, r25, r0_r1, 71, signed, r98, 1, 0, 1, 20 282: 3800 wait 0 284: 15c0808504d0c400 atomic add, 0, r24, r0_r1, 72, signed, r98, 1, 0, 1, 20 28c: 15b8908504d0c400 atomic add, 0, r23, r0_r1, 73, signed, r98, 1, 0, 1, 20 294: 15b0a08504d0c400 atomic add, 0, r22, r0_r1, 74, signed, r98, 1, 0, 1, 20 29c: 15a8b08504d0c400 atomic add, 0, r21, r0_r1, 75, signed, r98, 1, 0, 1, 20 2a4: 15a0c08504d0c400 atomic add, 0, r20, r0_r1, 76, signed, r98, 1, 0, 1, 20 2ac: 1598d08504d0c400 atomic add, 0, r19, r0_r1, 77, signed, r98, 1, 0, 1, 20 2b4: 1590e08504d0c400 atomic add, 0, r18, r0_r1, 78, signed, r98, 1, 0, 1, 20 2bc: 1588f08504d0c400 atomic add, 0, r17, r0_r1, 79, signed, r98, 1, 0, 1, 20 2c4: 3800 wait 0 2c6: 1580008505d0c400 atomic add, 0, r16, r0_r1, 80, signed, r98, 1, 0, 1, 20 2ce: 1578108505d0c400 atomic add, 0, r15, r0_r1, 81, signed, r98, 1, 0, 1, 20 2d6: 1570208505d0c400 atomic add, 0, r14, r0_r1, 82, signed, r98, 1, 0, 1, 20 2de: 1568308505d0c400 atomic add, 0, r13, r0_r1, 83, signed, r98, 1, 0, 1, 20 2e6: 1560408505d0c400 atomic add, 0, r12, r0_r1, 84, signed, r98, 1, 0, 1, 20 2ee: 1558508505d0c400 atomic add, 0, r11, r0_r1, 85, signed, r98, 1, 0, 1, 20 2f6: 1550608505d0c400 atomic add, 0, r10, r0_r1, 86, signed, r98, 1, 0, 1, 20 2fe: 1548708505d0c400 atomic add, 0, r9, r0_r1, 87, signed, r98, 1, 0, 1, 20 306: 3800 wait 0 308: 1540808505d0c400 atomic add, 0, r8, r0_r1, 88, signed, r98, 1, 0, 1, 20 310: 1538908505d0c400 atomic add, 0, r7, r0_r1, 89, signed, r98, 1, 0, 1, 20 318: 1530a08505d0c400 atomic add, 0, r6, r0_r1, 90, signed, r98, 1, 0, 1, 20 320: 1528b08505d0c400 atomic add, 0, r5, r0_r1, 91, signed, r98, 1, 0, 1, 20 328: 1520c08505d0c400 atomic add, 0, r4, r0_r1, 92, signed, r98, 1, 0, 1, 20 330: 1518d08505d0c400 atomic add, 0, r3, r0_r1, 93, signed, r98, 1, 0, 1, 20 338: 1510e08505d0c400 atomic add, 0, r2, r0_r1, 94, signed, r98, 1, 0, 1, 20 340: 1500f08505d0c400 atomic add, 0, r0, r0_r1, 95, signed, r98, 1, 0, 1, 20 348: 3800 wait 0 34a: 3801 wait 1 34c: f596 memory_barrier 1, 2, 9 34e: 6800 threadgroup_barrier 350: 198507005481000cc200 threadgroup_atomic add, r1, r99h, 0, r97, 0, 2, 129, 0 35a: 198107101401000c0200 threadgroup_atomic add, None, r99h, 1, r1, 0, 2, 1, 0 364: f596 memory_barrier 1, 2, 9 366: 6800 threadgroup_barrier 368: 298106021481003c threadgroup_store i32, x, r96, r99l, 0 370: 29fd06121481002c threadgroup_store i32, x, r95, r99l, 1 378: 29f906121482002c threadgroup_store i32, x, r94, r99l, 1 380: 29f506321481002c threadgroup_store i32, x, r93, r99l, 3 388: 29f106121484002c threadgroup_store i32, x, r92, r99l, 1 390: 29ed06521481002c threadgroup_store i32, x, r91, r99l, 5 398: 29e906321482002c threadgroup_store i32, x, r90, r99l, 3 3a0: 29e506721481002c threadgroup_store i32, x, r89, r99l, 7 3a8: 29e106221484002c threadgroup_store i32, x, r88, r99l, 2 3b0: 29dd06921481002c threadgroup_store i32, x, r87, r99l, 9 3b8: 29d906521482002c threadgroup_store i32, x, r86, r99l, 5 3c0: 29d506b21481002c threadgroup_store i32, x, r85, r99l, 11 3c8: 29d106321484002c threadgroup_store i32, x, r84, r99l, 3 3d0: 29cd06d21481002c threadgroup_store i32, x, r83, r99l, 13 3d8: 29c906721482002c threadgroup_store i32, x, r82, r99l, 7 3e0: 29c506f21481002c threadgroup_store i32, x, r81, r99l, 15 3e8: 29c106421484002c threadgroup_store i32, x, r80, r99l, 4 3f0: 29bd06121581002c threadgroup_store i32, x, r79, r99l, 17 3f8: 29b906921482002c threadgroup_store i32, x, r78, r99l, 9 400: 29b506321581002c threadgroup_store i32, x, r77, r99l, 19 408: 29b106521484002c threadgroup_store i32, x, r76, r99l, 5 410: 29ad06521581002c threadgroup_store i32, x, r75, r99l, 21 418: 29a906b21482002c threadgroup_store i32, x, r74, r99l, 11 420: 29a506721581002c threadgroup_store i32, x, r73, r99l, 23 428: 29a106621484002c threadgroup_store i32, x, r72, r99l, 6 430: 299d06921581002c threadgroup_store i32, x, r71, r99l, 25 438: 299906d21482002c threadgroup_store i32, x, r70, r99l, 13 440: 299506b21581002c threadgroup_store i32, x, r69, r99l, 27 448: 299106721484002c threadgroup_store i32, x, r68, r99l, 7 450: 298d06d21581002c threadgroup_store i32, x, r67, r99l, 29 458: 298906f21482002c threadgroup_store i32, x, r66, r99l, 15 460: 298506f21581002c threadgroup_store i32, x, r65, r99l, 31 468: 298106821484002c threadgroup_store i32, x, r64, r99l, 8 470: 29fd06121681001c threadgroup_store i32, x, r63, r99l, 33 478: 29f906121582001c threadgroup_store i32, x, r62, r99l, 17 480: 29f506321681001c threadgroup_store i32, x, r61, r99l, 35 488: 29f106921484001c threadgroup_store i32, x, r60, r99l, 9 490: 29ed06521681001c threadgroup_store i32, x, r59, r99l, 37 498: 29e906321582001c threadgroup_store i32, x, r58, r99l, 19 4a0: 29e506721681001c threadgroup_store i32, x, r57, r99l, 39 4a8: 29e106a21484001c threadgroup_store i32, x, r56, r99l, 10 4b0: 29dd06921681001c threadgroup_store i32, x, r55, r99l, 41 4b8: 29d906521582001c threadgroup_store i32, x, r54, r99l, 21 4c0: 29d506b21681001c threadgroup_store i32, x, r53, r99l, 43 4c8: 29d106b21484001c threadgroup_store i32, x, r52, r99l, 11 4d0: 29cd06d21681001c threadgroup_store i32, x, r51, r99l, 45 4d8: 29c906721582001c threadgroup_store i32, x, r50, r99l, 23 4e0: 29c506f21681001c threadgroup_store i32, x, r49, r99l, 47 4e8: 29c106c21484001c threadgroup_store i32, x, r48, r99l, 12 4f0: 29bd06121781001c threadgroup_store i32, x, r47, r99l, 49 4f8: 29b906921582001c threadgroup_store i32, x, r46, r99l, 25 500: 29b506321781001c threadgroup_store i32, x, r45, r99l, 51 508: 29b106d21484001c threadgroup_store i32, x, r44, r99l, 13 510: 29ad06521781001c threadgroup_store i32, x, r43, r99l, 53 518: 29a906b21582001c threadgroup_store i32, x, r42, r99l, 27 520: 29a506721781001c threadgroup_store i32, x, r41, r99l, 55 528: 29a106e21484001c threadgroup_store i32, x, r40, r99l, 14 530: 299d06921781001c threadgroup_store i32, x, r39, r99l, 57 538: 299906d21582001c threadgroup_store i32, x, r38, r99l, 29 540: 299506b21781001c threadgroup_store i32, x, r37, r99l, 59 548: 299106f21484001c threadgroup_store i32, x, r36, r99l, 15 550: 298d06d21781001c threadgroup_store i32, x, r35, r99l, 61 558: 298906f21582001c threadgroup_store i32, x, r34, r99l, 31 560: 298506f21781001c threadgroup_store i32, x, r33, r99l, 63 568: 298106021584001c threadgroup_store i32, x, r32, r99l, 16 570: 29fd06121481010c threadgroup_store i32, x, r31, r99l, 65 578: 29f906121682000c threadgroup_store i32, x, r30, r99l, 33 580: 29f506321481010c threadgroup_store i32, x, r29, r99l, 67 588: 29f106121584000c threadgroup_store i32, x, r28, r99l, 17 590: 29ed06521481010c threadgroup_store i32, x, r27, r99l, 69 598: 29e906321682000c threadgroup_store i32, x, r26, r99l, 35 5a0: 29e506721481010c threadgroup_store i32, x, r25, r99l, 71 5a8: 29e106221584000c threadgroup_store i32, x, r24, r99l, 18 5b0: 29dd06921481010c threadgroup_store i32, x, r23, r99l, 73 5b8: 29d906521682000c threadgroup_store i32, x, r22, r99l, 37 5c0: 29d506b21481010c threadgroup_store i32, x, r21, r99l, 75 5c8: 29d106321584000c threadgroup_store i32, x, r20, r99l, 19 5d0: 29cd06d21481010c threadgroup_store i32, x, r19, r99l, 77 5d8: 29c906721682000c threadgroup_store i32, x, r18, r99l, 39 5e0: 29c506f21481010c threadgroup_store i32, x, r17, r99l, 79 5e8: 29c106421584000c threadgroup_store i32, x, r16, r99l, 20 5f0: 29bd06121581010c threadgroup_store i32, x, r15, r99l, 81 5f8: 29b906921682000c threadgroup_store i32, x, r14, r99l, 41 600: 29b506321581010c threadgroup_store i32, x, r13, r99l, 83 608: 29b106521584000c threadgroup_store i32, x, r12, r99l, 21 610: 29ad06521581010c threadgroup_store i32, x, r11, r99l, 85 618: 29a906b21682000c threadgroup_store i32, x, r10, r99l, 43 620: 29a506721581010c threadgroup_store i32, x, r9, r99l, 87 628: 29a106621584000c threadgroup_store i32, x, r8, r99l, 22 630: 299d06921581010c threadgroup_store i32, x, r7, r99l, 89 638: 299906d21682000c threadgroup_store i32, x, r6, r99l, 45 640: 299506b21581010c threadgroup_store i32, x, r5, r99l, 91 648: 299106721584000c threadgroup_store i32, x, r4, r99l, 23 650: 298d06d21581010c threadgroup_store i32, x, r3, r99l, 93 658: 298906f21682000c threadgroup_store i32, x, r2, r99l, 47 660: 298106f21581010c threadgroup_store i32, x, r0, r99l, 95 668: 8800 stop ``` If you match them up with the matching device atomic loads (compiled as add 0), the stores should store to sequential addresses with offsets 0 - 95. Instead, if you pair up the current O field with bits 40:42, you get `[0, 1] [1, 1] [1, 2] [3, 1] [1, 4] [5, 1] [3, 2]`... They make the correct offsets if you multiply the two together, but the compiler only ever puts powers of two into the 40:42 field...

alyssarosenzweig commented 1 year ago

Looks like the extension bits for AtomicDestinationDesc were at 40:41 (0001-Atomic-op-register-extensions.patch). Were there any other extension bits you were missing?

Awesome -- confirmed with Mesa. I was wondering if maybe the atomic source was split into 6-2 like the destination, but seemingly the contiguous 8 bits works (as in, tests are passing with it contiguous and high bits used, so 🤷‍♀️ ) Thank you!

For floats, at least on M1 they compile to a cmpxchg loop.

Delightful 🤣 Fair enough, thanks. Makes you wonder why they bothered putting it in metal.

→ Compiler encountered an internal error

Truly, delightful 🤣

philipturner commented 1 year ago

I think Apple added some extra atomic instructions to M2, to support Nanite and float atomics. I got an unconfirmed DM that Apple added it to M2 and not other Apple 8 GPUs. More info: https://forums.unrealengine.com/t/lumen-nanite-on-macos/508411/54?u=philipturner. Not a single volunteer owned an M2 GPU and could confirm whether it worked.

mr-mobster commented 1 year ago

For floats, at least on M1 they compile to a cmpxchg loop. Maybe they have instructions for it on M2. ~Really makes you wonder why float atomics aren't supported on <Apple7 GPUs, if they were going to polyfill it anyways (or whatever the term is for doing that to an instruction)...~ Edit: It's probably the simd_fadd that they use when you target a constant address.
  0: 72051004             get_sr           r1, sr80 (thread_position_in_grid.x)
  4: 0519200e00c01200     device_load      0, i32, x, r3, u0_u1, r1, unsigned
  c: 3800                 wait             0
  e: 62000000             mov_imm          r0l, 0
 12: 2a8946020001         fadd32           r2, r3, 2.0
 18: d528208e00d00400     atomic           cmpxchg, 0, r5, u0_u1, r1, unsigned, r2, 1, 0, 1, 20
 20: 3800                 wait             0
 22: 92098a622c010190     icmpsel          seq, r2.cache, r5.cache, r3.discard, 1, 0
 2a: 7e0dca0a8000         mov              r3, r5.discard
 30: 5294c4000000         while_icmp       r0l, seq, r2l.discard, 0, 2
 36: 00c0dcffffff         jmp_exec_any     0x12
 3c: 521600000000         pop_exec         r0l, 2
 42: 8800                 stop         

Could someone explain to me how this code works? I have trouble wrapping my head around it. So at 4 the value from [a+u] is being loaded into r3, incremented by 2 (with result stored in r2) in 12, and then we have a compare+exchange at 18, followed by a footer where presumably a new (in case of conflict) value is loaded into r3, and then we go back to the addition.

But I am confused about the cmpxchg line. I see the address (u0_u1, r1) and what to write on success (r2), but where is the value to be compared with? I'd expect there to be a mention of r3 somewhere as it's the value we want to ensure is still in memory when we do the atomic swap? What am I missing?

dougallj commented 1 year ago

Yeah, you're correct. The disassembly isn't quite correct for cmpxchg. There's a comment in the patch:

    # unusual, uses register pair for old/new value
    3: 'cmpxchg',

r2 should be disassembled as r2_r3 (i.e. the "what to compare" register is always the next register after the "what to write on success" register)

mr-mobster commented 1 year ago

Ah, I was wondering what that comment meant. Got it, thanks for explanation!

TellowKrinkle commented 1 year ago

Looks like it is indeed supported on M2 Can't actually run these, but I used metal-nt to compile for applegpu_g14g and it seems to work

#include <metal_stdlib>
using namespace metal;
kernel void yay(uint u [[thread_position_in_grid]], device atomic_ulong* a, device atomic_uint* b) {
    atomic_max_explicit(a + u, 10, memory_order_relaxed);
    atomic_max_explicit(a, 10, memory_order_relaxed);
    atomic_max_explicit(a + 1, 10, memory_order_relaxed);
    atomic_max_explicit(reinterpret_cast<device atomic_ulong*>(b + u), 1, memory_order_relaxed);
    atomic_min_explicit(reinterpret_cast<device atomic_ulong*>(b + u), 1, memory_order_relaxed);
    uint expected = 0;
    atomic_compare_exchange_weak_explicit(b + u, &expected, expected + 1, memory_order_relaxed, memory_order_relaxed);
}

   0: f2191004             get_sr           r6.cache, sr80 (thread_position_in_grid.x)
   4: fe0180098000         mov              r0.cache, u0
   a: fe0582098000         mov              r1.cache, u1
  10: 620d00000000         mov_imm          r3, 0
  16: 0e13c0c3a4001000     iadd             r4_r5, r0_r1.discard, r6, lsl 3
  1e: 620500000000         mov_imm          r1, 0
  24: 62090a000000         mov_imm          r2, 10
  2a: 620101000000         mov_imm          r0, 1
  30: d500088500700400     atomic           cmpxchg, 0, None, r4_r5, 0, signed, r2, 1, 0, 1, 112
  38: d500008d00700400     atomic           cmpxchg, 0, None, u0_u1, 0, signed, r2, 1, 0, 1, 112
  40: d500208d00700400     atomic           cmpxchg, 0, None, u0_u1, 2, signed, r2, 1, 0, 1, 112
  48: d500c48e00700000     atomic           cmpxchg, 0, None, u2_u3, r6, unsigned, r0, 1, 0, 1, 112
  50: d500c48e00300000     atomic           cmpxchg, 0, None, u2_u3, r6, unsigned, r0, 1, 0, 1, 48
  58: d500c48e00500000     atomic           cmpxchg, 0, None, u2_u3, r6, unsigned, r0, 1, 0, 1, 80
  60: 8800                 stop

I don't think our signed/unsigned bit is correct here, it seems to be getting used for something else

Kind of curious what would you would get if you enabled the 32-bit output register on one of these, but I don't have an M2 to try it...

For anyone who wants to try this themselves

If the metal code is saved into `test.metal` and the following json is in `desc.mtlp-json`... ```json { "pipelines": { "compute_pipelines": [ { "compute_function": "yay" } ] } } ``` ```sh xcrun metal test.metal -o test.metallib xcrun metal-nt -arch applegpu_g14g test.metallib -N desc.mtlp-json -o test-m2.metallib python3 compiler_explorer.py test-m2.metallib ```

philipturner commented 1 year ago

We got someone with an M2 Max who might test it out soon - I'll hyperlink this which will have the latest info.

dougallj commented 1 year ago

I don't think our signed/unsigned bit is correct here, it seems to be getting used for something else

Is it just that my choice of name/syntax is terrible? signed and unsigned come from the Ou (offset unsigned) bit, and indicate whether the preceding operand (the address offset) is sign or zero extended from 32-bit to 64-bit before being added to the address. Arm writes this as LDR X0, [X0, W1, UXTW] vs LDR X0, [X0, W1, SXTW], which doesn't feel super readable, but grouping it into an address expression (unfortunately not a trivial change as currently designed) might clarify this? Edit: or maybe just signed_offset vs unsigned_offset? At least that'd be an easy change.

For the first three the offset is an immediate (value of Ou doesn't matter), and for the last three the offset is unsigned (uint u), and correctly marked as unsigned, or am I misunderstanding?

(@alyssarosenzweig Are you still interested in working on this, or should we just merge and others can PR fixes and improvements? It's already way better than nothing.)

TellowKrinkle commented 1 year ago

Edit: or maybe just signed_offset vs unsigned_offset? At least that'd be an easy change.

Yes, that would help. (I read it as affecting the operation itself, and was like "that doesn't make sense here")

philipturner commented 1 year ago

Kind of curious what would you would get if you enabled the 32-bit output register on one of these, but I don't have an M2 to try it...

I did get someone on UE5 forums, who has an M2 family, to confirm it's supported. A long shot, but maybe we can ask them if you need to test the output register. The instruction likely doesn't output anything because it's only intended for running the Nanite algorithm.

Nanite is an algorithm for software rasterization, and it needs UInt64 atomics to implement an artificial Z-buffer. For that purpose, there's no need to see the return value. More info here.

TellowKrinkle commented 1 year ago

A long shot, but maybe we can ask them if you need to test the output register.

You'd need to be able to write your own GPU code to get the output register, either by using Asahi's driver or through this repo's hwtest setup. Metal wouldn't ever generate it, and it's probably useless since it's a 32-bit output from a 64-bit operation. I was just kind of curious, that's all.

alyssarosenzweig commented 1 year ago

(@alyssarosenzweig Are you still interested in working on this, or should we just merge and others can PR fixes and improvements? It's already way better than nothing.)

I guess we can merge

dougallj / applegpu

Atomics! #25