Open Yanksi opened 5 hours ago
The original example that I provided gave me no errors. In that thread, we discuss all kinds of changes that are possible including swapping out the MMA, changing the SMEM layouts, and modifying the copy patterns.
You can inspect the Tensors/Layouts pre-partitioning and post-partitioning with print
, print_tensor
, and print_latex
to sanity check any concerns. Yes, tile_to_shape
should be working fine (and you can verify that by inspecting the output), but we would also need to see the configuration of the TiledMMA
s, the TiledCopy
s, etc. Many static assertions in the original example code to catch common incompatibilities.
Despite our best efforts to check everything statically that can be checked, out-of-bounds accesses and run-time failures can still occur. This can often be due to problem sizes that are not a multiple of the tile sizes -- the example provided does not perform any predication.
Swizzle layouts are primarily a bank access pattern optimization, so if you believe that Swizzle is the problem, then test it for correctness without a Swizzle first. Only then, optimize any SMEM bank conflicts by using a swizzle pattern (and verifying the post-partitioned layouts still make sense).
When I was running the (code example)[https://github.com/user-attachments/files/17388059/sgemm_sm80_tmp.txt] provided by @ccecka in another issue, I got
cudaDeviceSynchronize(): cudaErrorLaunchFailure: unspecified launch failure
when running under the "TN" set up. With a bit of investigation, I found that the problem seems to occur due to theSwizzle
layout used by the code.By changing the
swizzle_atom
tocomposition(Swizzle<3,3,3>{}, Layout<Shape <_8, _32>, Stride<_32, _1>>{})
andbK
to_32
, the code runs without the previous mentioned error. However, I am not able to find any other setup that also works. And what it seems to be strange to me is that the value ofbK
also affect the correctness of the code, when I changebK
to_64
while keepingswizzle_atom
to becomposition(Swizzle<3,3,3>{}, Layout<Shape <_8, _32>, Stride<_32, _1>>{})
, the code also failed to work. Shouldn'ttile_to_shape
handle this kind of situation automatically?