NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
Other
270 stars 53 forks source link

Inlining RoPE-like patterns require an analysis akin to ComputeAtLogicalDomainMap #3068

Open naoyam opened 1 month ago

naoyam commented 1 month ago

In RoPE-like fusions, where a domain is sliced and then padded back to the original domain, inlining seems to need to consider a constraint that is similar to the persistent constraint in normalization fusions.

Simplified example:

t0: [i0]
t1 = t0 // [i1]
t2 = t1 // [i2]
t3 = t2[0:2] // [i3]
t4 = pad(t23 {0, i0 - 2}) // [i4]
t5 = t1 + t4 // [i5]

Image

Here, i2 cannot be inlined into i3 since the extent of i2 is larger than that of i3. That also means i1 cannot be inlined into i2 since if that's done, i5 would be also pulled into the same inlined loop, which would then mean i3 and i4 would need to be pulled together. Since i2 and i3 cannot be inlined together, that inling pattern is invalid.

This is quite similar to the inlining constraint due to the reduction-broadcast pattern in normalization. I think ideally we should generalize the analysis to consider constraints like this case.

zasdfgbnm commented 1 month ago

Should we also handle something like this:

t0: [i0]
t1 = t0 // [i1]
t2 = t1 // [i2]
t3 = t2 // [i3/2, 2]
t4 = t3 // [i4]
t5 = t1 + t4 // [i5]

we can not inline t1 at 1, although i2 is mapped with i5.

naoyam commented 1 month ago

I think that's true in our current inlining system. In the original design of computeAt, however, the split of t3 would be propagated across the fusion to make them inlinable.

In general, I think that the analysis of inlinability needs to be a global analysis. ComputeAtLogicalDomainMap does that for the reduction-broadcast pattern, but that's not the only case that affects inlinability.