To support using swizzled shared memory layout in storing tensor core's register tile, improved implementation is required to use layout for computing pointer offsets inside a BaseTile instead of manually computing offsets.
Add a straightforward implementation to use vectorized instructions for accessing shared memory.
This PR fixes two issues:
BaseTile
instead of manually computing offsets.