Previously sliced ops were placed next to a sliced dot operation. This may change the frequency of sliced ops if they are not in the same block with the dot op. For example, a flash attention kernel has Q tensor loaded outside the dot loop. The dot slicing should better not move the sliced loads inside the loop. This change fixes that by placing the sliced ops next to the original op for such out-place ops. While we may lose the benefit of instruction recording for ops in different blocks of the same loop, I hope a later reordering pass can get it.
Previously sliced ops were placed next to a sliced dot operation. This may change the frequency of sliced ops if they are not in the same block with the dot op. For example, a flash attention kernel has
Q
tensor loaded outside the dot loop. The dot slicing should better not move the sliced loads inside the loop. This change fixes that by placing the sliced ops next to the original op for such out-place ops. While we may lose the benefit of instruction recording for ops in different blocks of the same loop, I hope a later reordering pass can get it.