Closed naoyam closed 2 years ago
We assumed for ParallelType::Vectorization
that the inner-most dimension is evenly divisible by the vector size.
It should fail at this runtime check.
ParallelType::MisalignedVectorization
should handle the remaining elements.
Updated the test kernel with its output. As far as I see, it is not detected by the validation. I'll look into it.
The runtime check only looks at the input and output tensors. If the tensor is an intermediate / shared memory, it isn't checked.
Vector load from SMEM seems to have a problem.
The generated kernel:
Test output:
The vectorized load from
T1
toT2
works fine whenT0.size[0]
is divisible by 4, but otherwise, the load is not done for the last remaining elements.This problem doesn't seem like a new problem, but I don't exactly remember it's supposed to have been fixed or it's still a known problem.