JuliaGPU / CUDA.jl

CUDA programming in Julia.
https://juliagpu.org/cuda/
Other
1.21k stars 221 forks source link

WMMA test failure #1700

Closed maleadt closed 1 year ago

maleadt commented 1 year ago

Latest GPUCompiler enables IR verification on CI, and it immediately spotted the following bug:

device/intrinsics/wmma: Error During Test at /var/lib/buildkite-agent/builds/gpuci-8/julialang/cuda-dot-jl/test/device/intrinsics/wmma.jl:32
  Got exception outside of a @test
  LLVM error: Intrinsic has incorrect return type!
  { i32 } (i8*, i32)* @llvm.nvvm.wmma.m32n8k16.load.b.row.stride.u8.p0i8
maleadt commented 1 year ago

So let's look-up the intrinsic and see what LLVM thinks:

julia> intr = Intrinsic("llvm.nvvm.wmma.m32n8k16.load.b.row.stride.u8.p0i8")
Intrinsic(6150): overloaded intrinsic

julia> ctx = Context()
Context(Ptr{LLVM.API.LLVMOpaqueContext} @0x00000000014b8160)

julia> name(intr, [LLVM.PointerType(LLVM.Int8Type(ctx), 0)])
"llvm.nvvm.wmma.m32n8k16.load.b.row.stride.u8.p0i8"

julia> LLVM.FunctionType(intr, [LLVM.PointerType(LLVM.Int8Type(ctx), 0)])
i32 (i8*, i32)

It's the struct-wrapping that LLVM doesn't like (but the layout should be compatible, is probably why it didn't result in any errors yet).

cc @thomasfaingnaert; thoughts on why the fragtypes are always wrapped in a struct?

thomasfaingnaert commented 1 year ago

cc @thomasfaingnaert; thoughts on why the fragtypes are always wrapped in a struct?

Because they (normally) should be: e.g. for Volta-generation WMMA with FP16 (fragment element type = <2 x half>, fragment size = 8), the signature of the intrinsic is

{ <2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half> } @llvm.nvvm.wmma.m16n16k16.load.b.col.stride.f16.p1i8(i8 addrspace(1)* %ptr, i32 %stride)

However, the u8 and i8 variants of WMMA are a bit special, as the fragment element type is not UInt8 or Int8 as one would expect, but Int32 instead. That means that for some WMMA shapes (m32n8k16 for a load.b, and m8n32k16 for a load.a), the fragment size is actually 1. I assume that LLVM doesn't wrap the return value in a struct in that case.

Should be relatively easy to fix, though. In https://github.com/JuliaGPU/CUDA.jl/blob/c0ba21de288a47d61cb501e525d3b7e4de53c39a/src/device/intrinsics/wmma.jl#L199, we'll need to use $frag_ty instead of $struct_ty{$frag_ty} if sz == 1, and ensure the convert(...) doesn't fail when converting T to NTuple{1, T}.