This is because CuTe's TiledCopy function raises an error when the Layout is created with runtime values.
"Copy_Traits: src failed to vectorize into registers. Layout is incompatible with this CopyOp.");
However, I am wondering if this is an issue with CuTe's overly strict check. Since I have commented out the static check during compile time, it does not affect the correctness of the implementation.
The
cp.async
feature is currently disabled in the implementation.https://github.com/TiledTensor/TiledCUDA/blob/8ad3974e2b62b58f5009f78622d7b9adcf60df68/include/cell/traits/gemm.hpp#L60
This is because CuTe's
TiledCopy
function raises an error when theLayout
is created with runtime values.However, I am wondering if this is an issue with CuTe's overly strict check. Since I have commented out the static check during compile time, it does not affect the correctness of the implementation.
https://github.com/NVIDIA/cutlass/blob/033d9efd2db0bbbcf3b3b0650acde6c472f3948e/include/cute/atom/copy_traits.hpp#L122-L125