Prior to this commit, the matmul and conv2d schedules required padding of the inputs to some multiple of vscale and a final "unpadding" stage.
Instead, we can leverage predicated operations to avoid the the requirement for padding. Both the transpose interleave and outer product fp32 intrinsics are updated to use predication. The get_active_lane_mask intrinsic is utilized to generate a variably sized mask of active lanes depending on the global position the tensor intrinsic is operating on.
For now this relies on using offset_of and stride information from the tensor we're predicating an access on. Likely we will want to build on this in the future with a more intuitive API for determining the current tile location.
Support for batched conv2d was removed since this causes numerical issues which is suspected to be due to how the current tile is determined (paragraph above).
Prior to this commit, the matmul and conv2d schedules required padding of the inputs to some multiple of vscale and a final "unpadding" stage.
Instead, we can leverage predicated operations to avoid the the requirement for padding. Both the transpose interleave and outer product fp32 intrinsics are updated to use predication. The
get_active_lane_mask
intrinsic is utilized to generate a variably sized mask of active lanes depending on the global position the tensor intrinsic is operating on.For now this relies on using
offset_of
andstride
information from the tensor we're predicating an access on. Likely we will want to build on this in the future with a more intuitive API for determining the current tile location.Support for batched conv2d was removed since this causes numerical issues which is suspected to be due to how the current tile is determined (paragraph above).
~Note: this should not be merged until after https://github.com/apache/tvm/pull/17048~
cc @ekalda @Anndrey24