Closed maleadt closed 1 year ago
So let's look-up the intrinsic and see what LLVM thinks:
julia> intr = Intrinsic("llvm.nvvm.wmma.m32n8k16.load.b.row.stride.u8.p0i8")
Intrinsic(6150): overloaded intrinsic
julia> ctx = Context()
Context(Ptr{LLVM.API.LLVMOpaqueContext} @0x00000000014b8160)
julia> name(intr, [LLVM.PointerType(LLVM.Int8Type(ctx), 0)])
"llvm.nvvm.wmma.m32n8k16.load.b.row.stride.u8.p0i8"
julia> LLVM.FunctionType(intr, [LLVM.PointerType(LLVM.Int8Type(ctx), 0)])
i32 (i8*, i32)
It's the struct-wrapping that LLVM doesn't like (but the layout should be compatible, is probably why it didn't result in any errors yet).
cc @thomasfaingnaert; thoughts on why the fragtypes are always wrapped in a struct?
cc @thomasfaingnaert; thoughts on why the fragtypes are always wrapped in a struct?
Because they (normally) should be: e.g. for Volta-generation WMMA with FP16 (fragment element type = <2 x half>
, fragment size = 8), the signature of the intrinsic is
{ <2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half> } @llvm.nvvm.wmma.m16n16k16.load.b.col.stride.f16.p1i8(i8 addrspace(1)* %ptr, i32 %stride)
However, the u8
and i8
variants of WMMA are a bit special, as the fragment element type is not UInt8
or Int8
as one would expect, but Int32
instead. That means that for some WMMA shapes (m32n8k16
for a load.b
, and m8n32k16
for a load.a
), the fragment size is actually 1. I assume that LLVM doesn't wrap the return value in a struct in that case.
Should be relatively easy to fix, though. In https://github.com/JuliaGPU/CUDA.jl/blob/c0ba21de288a47d61cb501e525d3b7e4de53c39a/src/device/intrinsics/wmma.jl#L199, we'll need to use $frag_ty
instead of $struct_ty{$frag_ty}
if sz == 1
, and ensure the convert(...)
doesn't fail when converting T
to NTuple{1, T}
.
Latest GPUCompiler enables IR verification on CI, and it immediately spotted the following bug: