NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
Other
271 stars 53 forks source link

[Proposal] Support root->logical transforms in Fusion inputs #3366

Closed jacobhinkle closed 2 weeks ago

jacobhinkle commented 2 weeks ago

NOTICE: See #3372

This is a proposal to fully support Fusion input TensorViews to contain non-trivial root domains. The ATen tensor passed should then match the root domain of the fusion input, not the logical domain.

Motivation

The primary motivation for this proposal is basically #1628. Usually for Hopper matmul we will want to load both operands to smem using TMA, then directly call the mma instruction using those smem operands. If the Fusion inputs are [M, K] and [N, K], they must be broadcasted to [M, 1, K] and [1, N, K] before they can pass through the MmaOp, which we do using a BroadcastOp in mma_utils::MatmulPattern::translateToMmaOp(). This introduces a tensor that we can't get rid of in our current system.

Approach

I propose that we do the following:

I believe this is all that is needed, since we don't actually use the root domain for input tensors and broadcasts should not affect the actual memory layout so the allocation domain matching the logical instead of what is in the ATen tensor is not a problem.

Details

Suppose we have

Inputs:
  tv0 [ i0 ]
  tv1 [ i0, i1 ]

tv2 [ i0 ] = neg(tv0)
tv3 [ i0, b1 ] = broadcast(tv2)
tv4 [ i0, i1 ] = mul(tv3, tv1)

We can translate this to the following:

Inputs:
  tv7 [ i0, b1 ] (root = [ i0 ])
  tv1 [ i0, i1 ]

tv6 [ i0, b1 ] = neg(tv7)
tv5 [ i0, i1 ] = mul(tv6, tv1)

Specifically, what was done:

Possible challenges

Allreduce

One challenge is "allreduce", which is a pattern we detect at lowering/codegen where we reduce a dimension then broadcast a new dimension in its place immediately.

tv0 [ i0, i1 ]
tv1 [ i0, r1 ] = sum(tv0)
tv2 [ i0, b1] = broadcast(tv1)

If we ignore this pattern while zipping up BroadcastOp then we might translate this to

tv0 [ i0, i1, b2 ]
tv1 [ i0, r1, b2 ] = sum(tv0)

I think patterns like this are easy to detect and we can leave the BroadcastOp in place in these cases, but we should be careful.

I think this is the only way we could actually have a BroadcastOp in the fusion if we implement this proposal as a preseg pass. In that case, we could also go ahead and be done with BroadcastOp once and for all if we did something like introduce IterType::AllReduce to replace the reduced+broadcasted axis.

Aliasing

If an input tensor has a root domain and it is aliased with an output tensor, should this be allowed? I think so but I haven't thought very deeply about it, so I'd probably refuse to do such aliasing until needed.

Summary

Originally we can make light use of this and only apply it to the prologue of translated matmuls. However if it works well it might be a nice simplifying step that we could run as a preseg pass.

Related:

naoyam commented 2 weeks ago

My general concern would be that these approaches would not retain the same information as what BroadcastOp has, specifically, BroadcastOp::getBroadcastDimFlags would be lost. I think that, more generally speaking, anything represented with Expr could be moved around without losing information. Reordering and adding broadcast IDs are really TensorDomain ops, and they are not recorded like split and merge, so they may not be replayable as precisely as split and merge.

I'd feel more comfortable if these scheduling were only done by a scheduler rather than more globally as a preseg pass. I'm doing something similar for slice and concat.

jjsjann123 commented 2 weeks ago

Update ExpressionEvaluator and bind tv->getMaybeRootDomain() instead of tv->getLogicalDomain() to the received shapes of input tensors.

FYI: Instead of root, I think @wujingyue was thinking about binding allocation domain instead for distributed support: https://github.com/NVIDIA/Fuser/issues/3282

wujingyue commented 2 weeks ago

Thanks for tagging me! I think we are trying to overload this poor at::Tensor with too many meanings :) I was thinking of letting at::Tensor to match allocation because it has limitations representing more "abstract" tensor domains like logical. I suspect allocation would also work for this case as long as transforms don't have to go one direction (today, it typically flows logical to allocation). Wdyt?

jacobhinkle commented 2 weeks ago

We can revisit this later if needed. For now, because of simplicity and smaller scope, I'm going to pursue #3372 instead.

jacobhinkle commented 2 weeks ago

I was thinking of letting at::Tensor to match allocation because it has limitations representing more "abstract" tensor domains like logical. I suspect allocation would also work for this case as long as transforms don't have to go one direction (today, it typically flows logical to allocation). Wdyt?

Yeah I like that. The allocation domain is really telling us how the input should look in memory which is all we need. Really once the fusion is defined I think the only reason we care at all about the logical size of input at::Tensors is because that lets us bind some values to ExpressionEvaluator.