NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
Other
260 stars 51 forks source link

Scheduling failure with squeeze+broadcast #1757

Open naoyam opened 8 months ago

naoyam commented 8 months ago

(Originally found in #1728)

tldr: Inlining position can depend on consumer inlining

Problem

Squeeze followed by broadcast can cause inlining to fail. Consider the following example. Blue indicates broadcast domains.

Inputs: t0 and t5 Output: t7

OM_1707788576259 1

We should be able to fully inline all tensors like:

for 
   for 
       t1 = t0
       t2 = t1
       t3 = t2
       t6 = t5 + t3
       t4 = t1
       t7 = t6 + t4

However, due to the producer-consumer pair of t1 and t2, where t2 has no corresponding domain for the t1 broadcast domain, the inlining position oft1` is set as 0, meaning it doesn't get inlined at all. So, what we would get is:

for
   t1 = t0
for 
   for 
       t2 = t1
       t3 = t2
       t6 = t5 + t3
       t4 = t1
       t7 = t6 + t4

This can further make consistent scheduling difficult as the broadcast domain of t1 is scheduled without concretization, whereas the corresponding domains of the other tensors are scheduled as concretized domains. As a result, it could mean inconsistent parallelizations, i.e., requiring a grid sync.

The problem here is that it should be valid to inline t1 as the actual loop structure of t2 does have the corresponding concrete domain for the squeezed broadcast domain. What it means is that when we calculate the max inlinng position, we also need to take the inlining of the consumer tensor into consideration. Here's what we currently do: https://github.com/NVIDIA/Fuser/blob/main/csrc/inlining.cpp#L126.

Impact

I'm not sure how critical to support this pattern. Currently, we would fail explicitly like the missing sync error. I don't this should result in a silent error.

Approaches to consider

I think we would somehow need to extend MaxPosCalculator to understand the actual loop structure of a consumer tensor. However, this would mean that there would be a dependency of the consumer inlining position to its producer inlining position.

Furthermore, inlining of a producer could change the loop structure of its consumer tensor (e.g., Indexing19 test of the IdMap PR). So, that seems to suggest there might be a circular dependency between a producer and a consumer, i.e.:

Estimated time scale

Unclear yet but minimum a month.

naoyam commented 8 months ago

Tagging: @zasdfgbnm, @jacobhinkle

naoyam commented 8 months ago

@zasdfgbnm, @jacobhinkle, please let me know if it's clear for you what the problem is. My take for now is it's an interesting problem to address but we would need a clear need as it would require non-trivial work.

naoyam commented 8 months ago

Also tagging for viz: @wujingyue

jacobhinkle commented 8 months ago

Should the blue lines from IDs 9->11->13 be yellow instead or does that matter? I.e. does the broadcast need to be concretized in t6 and t7 for us to encounter this?

Also just for clarification: we think that if the scheduler manages to inline t1 at position 2 and t2 at position 1, then the rest of our machinery will work and produce the ideal loop nest you showed first, right? Stated another way: do we think that currently this is only a limitation in how we set inlining in inlineMost, or is it more insidious?

Since this seems to be an issue with missing broadcast domains due to this squeeze+broadcast pattern, would it be possible to work around this by doing a flood fill of consumers of squeeze to find possible broadcasts that re-introduce the squeeze, and bypassing the squeeze for those branches (in case we need a quick fix)?

naoyam commented 8 months ago

Should the blue lines from IDs 9->11->13 be yellow instead or does that matter? I.e. does the broadcast need to be concretized in t6 and t7 for us to encounter this?

It should be a concretized domain. If not, we can just reorder the broadcast domains to the inner position and ignore them. That would allow t1 to inline the non-broadcast domain, which should be sufficient.

Also just for clarification: we think that if the scheduler manages inline t1 at position 2 and t2 at position 1, then the rest of our machinery will work and produce the ideal loop nest you showed first, right? Stated another way: do we think that currently this is only a limitation in how we set inlining in inlineMost, or is it more insidious?

I believe you're right, but since we never manage to inline tensors like t1 at position 2 yet, there may be some other things failing.

Since this seems to be an issue with missing broadcast domains due to this squeeze+broadcast pattern, would it be possible to work around this by doing a flood fill of consumers of squeeze to find possible broadcasts that re-introduce the squeeze, and bypassing the squeeze for those branches (in case we need a quick fix)?

I suppose we don't need a quick fix, but I started liking the approach of removing broadcast domains as advocated by @zasdfgbnm. I think in this case we could first just remove the broadcast domains and schedule all the tensors without them, then MaxPosCalculator should just be able to figure out the ideal inlining positions. Then, we would add back necessary broadcast domains when we actually inline tenors.

I feel this approach would be better than extending MaxPosCalculator.

zasdfgbnm commented 8 months ago

I think I understand the problem, but I am not sure if I understand the proposed solution. If instead of having

t2 = squeeze(t1);
t3 = broadcast(t2);

we had

t2 = squeeze(t1);
t2_1 = set(t2);
t2_2 = set(t2_1);
t2_3 = set(t2_2);
t2_4 = set(t2_3);
t3 = broadcast(t2_4 );

will this solution correctly handle this?

Should the blue lines from IDs 9->11->13 be yellow instead or does that matter?

@jacobhinkle I think the color here are just a tagging for different disjoint sets, it has no meaning about being broadcasting or not.

I think in this case we could first just remove the broadcast domains and schedule all the tensors without them, then MaxPosCalculator should just be able to figure out the ideal inlining positions.

This approach will also fix https://github.com/NVIDIA/Fuser/issues/1759, because forwarding would not exist at all!

zasdfgbnm commented 8 months ago

If we needed to urgently fix this, the best workaround that I can think of is to detect this very specific pattern, reject it in canSchedule to force segmentation.

If not urgent, I still think the best approach is to make my statement

What is important is the concretization of broadcasting. All the history of broadcasting IterDomains (IDs) before concretization are not relevant, and I am not interested in inspecting it. Especially, I don’t care which broadcasting ID is mapped to which broadcasting ID in root domain map.

truely happen.

I am not sure how others feel about this direction, but I think this is at least something that worth explore if we decide to solve this problem instead of workaround it.

My question for now is: does it make sense to make another round of discussion of "how to design broadcasting"? Discussion on difficult topics like this can be time consuming, so I think we should know: 1. should we discuss? 2. if yes, when to discuss? this month? in 2 months? 3 months?, before putting real efforts on this topic.

naoyam commented 8 months ago

I largely agree with @zasdfgbnm, but in terms of timeline, since these issues are either not critical or we already found a WAR, I don't have any particular timeline. I personally would look forward to having an extra time to dig into this, but I'm not very optimistic.

zasdfgbnm commented 8 months ago

In the approach of "extending MaxPosCalculator", is this sufficient? Should we also modify our permissive mapping?