ROCm / triton

Development repository for the Triton language and compiler
MIT License
83 stars 27 forks source link

Replace inline assembly in commonShflSync with intrinsics #418

Closed binarman closed 8 months ago

binarman commented 9 months ago

Inline assembly does not take into account instructions around, and in general can not avoid data hazards. Replacing inline asm with intrinsics solves this problem. This particular code behaved incorrectly in one of mfma dot tests:

Code generated with help of inline assembly:

  v_mfma_f32_4x4x4f16 v[4:7], v[4:5], v[6:7], 0
  ds_swizzle_b32 v3, v4, offset:swizzle(SWAP:4)

Correct code generated with intrinsics:

  v_mfma_f32_4x4x4f16 v[4:7], v[4:5], v[6:7], 0
  s_nop 4
  ds_swizzle_b32 v3, v4, offset:swizzle(SWAP:4)