[QST] Question Regarding To The Use Of `Swizzle`

NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines

Other

5.73k stars 982 forks source link

When I was running the (code example)[https://github.com/user-attachments/files/17388059/sgemm_sm80_tmp.txt] provided by @ccecka in another issue, I got cudaDeviceSynchronize(): cudaErrorLaunchFailure: unspecified launch failure when running under the "TN" set up. With a bit of investigation, I found that the problem seems to occur due to the Swizzle layout used by the code.

By changing the swizzle_atom to composition(Swizzle<3,3,3>{}, Layout<Shape <_8, _32>, Stride<_32, _1>>{}) and bK to _32, the code runs without the previous mentioned error. However, I am not able to find any other setup that also works. And what it seems to be strange to me is that the value of bK also affect the correctness of the code, when I change bK to _64 while keeping swizzle_atom to be composition(Swizzle<3,3,3>{}, Layout<Shape <_8, _32>, Stride<_32, _1>>{}), the code also failed to work. Shouldn't tile_to_shape handle this kind of situation automatically?

The original example that I provided gave me no errors. In that thread, we discuss all kinds of changes that are possible including swapping out the MMA, changing the SMEM layouts, and modifying the copy patterns.

You can inspect the Tensors/Layouts pre-partitioning and post-partitioning with print, print_tensor, and print_latex to sanity check any concerns. Yes, tile_to_shape should be working fine (and you can verify that by inspecting the output), but we would also need to see the configuration of the TiledMMAs, the TiledCopys, etc. Many static assertions in the original example code to catch common incompatibilities.

Despite our best efforts to check everything statically that can be checked, out-of-bounds accesses and run-time failures can still occur. This can often be due to problem sizes that are not a multiple of the tile sizes -- the example provided does not perform any predication.

Swizzle layouts are primarily a bank access pattern optimization, so if you believe that Swizzle is the problem, then test it for correctness without a Swizzle first. Only then, optimize any SMEM bank conflicts by using a swizzle pattern (and verifying the post-partitioned layouts still make sense).

NVIDIA / cutlass

[QST] Question Regarding To The Use Of `Swizzle` #1927