Open jacobhinkle opened 6 days ago
After this, we can actually generate a proper kernel and run it. I will rebase #3406 onto this and modify the test to compile and run in that PR so we can inspect the generated kernel there. We can keep this PR for discussing the inlining changes only.
Does this only apply to broadcast IDs added by TensorView::broadcast()
?
Does this only apply to broadcast IDs added by
TensorView::broadcast()
?
Yes, that's the intention. I am using tv->domain()->additionalIDs()
, which I think is only those broadcasts?
Does this only apply to broadcast IDs added by
TensorView::broadcast()
?Yes, that's the intention. I am using
tv->domain()->additionalIDs()
, which I think is only those broadcasts?
Yes. @zasdfgbnm, when you added this, were you thinking about having non-broadcast IDs in additional_ids_
?
Does this only apply to broadcast IDs added by
TensorView::broadcast()
?Yes, that's the intention. I am using
tv->domain()->additionalIDs()
, which I think is only those broadcasts?Yes. @zasdfgbnm, when you added this, were you thinking about having non-broadcast IDs in
additional_ids_
?
To be safe I'll check the IterType when skipping.
Yes. @zasdfgbnm, when you added this, were you thinking about having non-broadcast IDs in
additional_ids_
?
No, I added it primarily for storing these new broadcasts.
In the latest pushed changes, I do a BFS from producer logical to producer allocation and from consumer root to consumer loop. This lets me collect the IDs that are used for indexing (assuming no shorter paths are discovered later). I then restrict the strictAreMapped
check to the case where at least one of the producer or consumer ID is in that path. That covers loop broadcasts automatically as they're not used for indexing, and lets us inline around them if they appear in the same position as another ID that's not used in indexing that particular producer, as is the case for the mma use case I have in mind.
Stacked on #3414
This PR enables us to inline an MmaOp properly when its inputs are missing broadcast dimensions. We do this by always allowing inlining past loop broadcasts or their transforms. For example
As long as the operation
foo
properly maps its arguments despite the missing logical dimensions (asMmaOp
does as of #3391), then we should be able to fully inline this case because the loop broadcastsbS5
andbS6
are imaginary in the sense that they don't impact indexing.