[SME] Utilize predication in fp32 matmul and conv2d schedules

lhutton1 commented 4 months ago

Prior to this commit, the matmul and conv2d schedules required padding of the inputs to some multiple of vscale and a final "unpadding" stage.

Instead, we can leverage predicated operations to avoid the the requirement for padding. Both the transpose interleave and outer product fp32 intrinsics are updated to use predication. The get_active_lane_mask intrinsic is utilized to generate a variably sized mask of active lanes depending on the global position the tensor intrinsic is operating on.

For now this relies on using offset_of and stride information from the tensor we're predicating an access on. Likely we will want to build on this in the future with a more intuitive API for determining the current tile location.

Support for batched conv2d was removed since this causes numerical issues which is suspected to be due to how the current tile is determined (paragraph above).

~Note: this should not be merged until after https://github.com/apache/tvm/pull/17048~

cc @ekalda @Anndrey24

Anndrey24 commented 3 months ago

LGTM, too! Seems there's just a small merge conflict that has come up.

ekalda commented 3 months ago

Thanks @lhutton1 and @Anndrey24!

apache / tvm

[SME] Utilize predication in fp32 matmul and conv2d schedules #17054