Open naoyam opened 9 months ago
Tagging: @zasdfgbnm, @jacobhinkle
@zasdfgbnm, @jacobhinkle, please let me know if it's clear for you what the problem is. My take for now is it's an interesting problem to address but we would need a clear need as it would require non-trivial work.
Also tagging for viz: @wujingyue
Should the blue lines from IDs 9->11->13 be yellow instead or does that matter? I.e. does the broadcast need to be concretized in t6 and t7 for us to encounter this?
Also just for clarification: we think that if the scheduler manages to inline t1 at position 2 and t2 at position 1, then the rest of our machinery will work and produce the ideal loop nest you showed first, right? Stated another way: do we think that currently this is only a limitation in how we set inlining in inlineMost
, or is it more insidious?
Since this seems to be an issue with missing broadcast domains due to this squeeze+broadcast pattern, would it be possible to work around this by doing a flood fill of consumers of squeeze to find possible broadcasts that re-introduce the squeeze, and bypassing the squeeze for those branches (in case we need a quick fix)?
Should the blue lines from IDs 9->11->13 be yellow instead or does that matter? I.e. does the broadcast need to be concretized in t6 and t7 for us to encounter this?
It should be a concretized domain. If not, we can just reorder the broadcast domains to the inner position and ignore them. That would allow t1
to inline the non-broadcast domain, which should be sufficient.
Also just for clarification: we think that if the scheduler manages inline t1 at position 2 and t2 at position 1, then the rest of our machinery will work and produce the ideal loop nest you showed first, right? Stated another way: do we think that currently this is only a limitation in how we set inlining in
inlineMost
, or is it more insidious?
I believe you're right, but since we never manage to inline tensors like t1
at position 2 yet, there may be some other things failing.
Since this seems to be an issue with missing broadcast domains due to this squeeze+broadcast pattern, would it be possible to work around this by doing a flood fill of consumers of squeeze to find possible broadcasts that re-introduce the squeeze, and bypassing the squeeze for those branches (in case we need a quick fix)?
I suppose we don't need a quick fix, but I started liking the approach of removing broadcast domains as advocated by @zasdfgbnm. I think in this case we could first just remove the broadcast domains and schedule all the tensors without them, then MaxPosCalculator
should just be able to figure out the ideal inlining positions. Then, we would add back necessary broadcast domains when we actually inline tenors.
I feel this approach would be better than extending MaxPosCalculator
.
I think I understand the problem, but I am not sure if I understand the proposed solution. If instead of having
t2 = squeeze(t1);
t3 = broadcast(t2);
we had
t2 = squeeze(t1);
t2_1 = set(t2);
t2_2 = set(t2_1);
t2_3 = set(t2_2);
t2_4 = set(t2_3);
t3 = broadcast(t2_4 );
will this solution correctly handle this?
Should the blue lines from IDs 9->11->13 be yellow instead or does that matter?
@jacobhinkle I think the color here are just a tagging for different disjoint sets, it has no meaning about being broadcasting or not.
I think in this case we could first just remove the broadcast domains and schedule all the tensors without them, then
MaxPosCalculator
should just be able to figure out the ideal inlining positions.
This approach will also fix https://github.com/NVIDIA/Fuser/issues/1759, because forwarding would not exist at all!
If we needed to urgently fix this, the best workaround that I can think of is to detect this very specific pattern, reject it in canSchedule
to force segmentation.
If not urgent, I still think the best approach is to make my statement
What is important is the concretization of broadcasting. All the history of broadcasting IterDomains (IDs) before concretization are not relevant, and I am not interested in inspecting it. Especially, I don’t care which broadcasting ID is mapped to which broadcasting ID in root domain map.
truely happen.
I am not sure how others feel about this direction, but I think this is at least something that worth explore if we decide to solve this problem instead of workaround it.
My question for now is: does it make sense to make another round of discussion of "how to design broadcasting"? Discussion on difficult topics like this can be time consuming, so I think we should know: 1. should we discuss? 2. if yes, when to discuss? this month? in 2 months? 3 months?, before putting real efforts on this topic.
I largely agree with @zasdfgbnm, but in terms of timeline, since these issues are either not critical or we already found a WAR, I don't have any particular timeline. I personally would look forward to having an extra time to dig into this, but I'm not very optimistic.
In the approach of "extending MaxPosCalculator
", is this sufficient? Should we also modify our permissive mapping?
(Originally found in #1728)
tldr: Inlining position can depend on consumer inlining
Problem
Squeeze followed by broadcast can cause inlining to fail. Consider the following example. Blue indicates broadcast domains.
Inputs:
t0
andt5
Output:t7
We should be able to fully inline all tensors like:
However, due to the producer-consumer pair of
t1
andt2
, wheret2
has no corresponding domain for thet1
broadcast domain,the inlining position of
t1` is set as 0, meaning it doesn't get inlined at all. So, what we would get is:This can further make consistent scheduling difficult as the broadcast domain of
t1
is scheduled without concretization, whereas the corresponding domains of the other tensors are scheduled as concretized domains. As a result, it could mean inconsistent parallelizations, i.e., requiring a grid sync.The problem here is that it should be valid to inline
t1
as the actual loop structure oft2
does have the corresponding concrete domain for the squeezed broadcast domain. What it means is that when we calculate the max inlinng position, we also need to take the inlining of the consumer tensor into consideration. Here's what we currently do: https://github.com/NVIDIA/Fuser/blob/main/csrc/inlining.cpp#L126.Impact
I'm not sure how critical to support this pattern. Currently, we would fail explicitly like the missing sync error. I don't this should result in a silent error.
Approaches to consider
I think we would somehow need to extend
MaxPosCalculator
to understand the actual loop structure of a consumer tensor. However, this would mean that there would be a dependency of the consumer inlining position to its producer inlining position.Furthermore, inlining of a producer could change the loop structure of its consumer tensor (e.g.,
Indexing19
test of the IdMap PR). So, that seems to suggest there might be a circular dependency between a producer and a consumer, i.e.:Estimated time scale
Unclear yet but minimum a month.