NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
Other
250 stars 49 forks source link

[Feature Request] better memory format decision for outputs #1756

Open jjsjann123 opened 7 months ago

jjsjann123 commented 7 months ago

Background

nvfuser codegen should make smart decision on memory format for outputs, rather than naively assume canonical contiguous tensor with descending strides.

An example as demonstrated in #1567: codegen was given inputs with a specific stride order:

### Inputs
inputs = [
  torch.randn(32, 1024, 25, 25, device="cuda").as_strided((32, 1024, 25, 25), (640000, 1, 25600, 1024)).half().requires_grad_(),
  torch.randn(1, 1024, 1, 1, device="cuda").as_strided((1, 1024, 1, 1), (1024, 1, 1024, 1024)).half(),
  torch.randn(1, 1024, 1, 1, device="cuda").as_strided((1, 1024, 1, 1), (1024, 1, 1024, 1024)).half(),
]

A practitioner who defines the fusion shouldn't concern about the how and would likely define their fusion math without specifying stride_order for output tensors. Something in the example would be a computation defined like below:

    with nvfuser.FusionDefinition() as fd:
        x = partially_contig_tensor(fd, tensors[0])
        s = partially_contig_tensor(fd, tensors[1])
        b = partially_contig_tensor(fd, tensors[2])
        z = fd.define_scalar(0)

        T0 = fd.ops.mul(x, s)
        T1 = fd.ops.add(T0, b)
        T2 = fd.ops.relu(T1)
        T3 = fd.ops.cast(T2, dtype=nvfuser.DataType.Half)
        T4 = fd.ops.gt(T1, z)

        fd.add_output(T3)
        fd.add_output(T4)

Note the difference here between the code above to what we have in #1567, where it effectively have to specify output format as below vvv, otherwise it's leaving performance on the by going through a transpose scheduler, instead of a pointwise scheduler (since all I/O tensors are in consistent memory format):

fd.add_output(T3, (3, 0, 2, 1))
fd.add_output(T4, (3, 0, 2, 1))

Feature Request

nvfuser should have a mechanism to decide on output tensor memory format, of which there's no explicit allocation domain specified, to optimize kernel performance!

Pitch

The prototype is to have a layout_optimization pass that runs as part of pre-segment passed. The pass works as:

  1. It looks up the permutation from rfactor_dom to allocation_dom on input TensorViews and record the permutation as MemoryFormat for those tensors;
  2. The pass traverse the fusion to propagate MemoryFormat. It uses a set of propagation rules, where it compute & record MemoryFormat of outputs from the recorded MemoryFormat of inputs;
  3. Lastly, the pass iterates through all output tensors and try to specify their allocation domain as per recorded MemoryFormat.

I'm pushing strongly for the above solution for its light workload. I intentionally leave some tricky designs out of the scope, because I think it requires more effort and should be resolved at a higher level in the system (segmenter / alias analysis / schedulers) and the scope would be too great with not enough benefit to justify the effort in the short term.

Future Directions

I think we need something more comprehensive to optimize allocation domain on tensors in our fusion. Currently to keep it simple, we are running propagation pass at the whole fusion IR level, without considering segmentation. This doesn't make sense, since layout optimization needs to be done at the same granularity of what a scheduler would look at. This shares some similar challenges with the alias analysis work that @wujingyue is working on. But I do believe it's a bit trickier in the case of format propagation. e.g. A simple example is, for a fusion definition, we might have both pointwise scheduler and transpose scheduler capable of handling the fusion if we assign allocation domain on outputs. How do we break tie might not always be straightforward.

Another question is that, should the propagation rule be defined universally per operation, or should each scheduler maintain a unique set of rules? Currently, matmul scheduler manipulates allocation domain, which could be considered as a special case for propagation.

Progress

I think it's important that we work towards a long-term goal to have a reasonable system working along each scheduler to optimize allocation domain at each fusion segment.

Meanwhile, enabling a global propagation for the entire fusion is a low hanging fruit that could simplify our short-term goal in supporting channels-last kernel support without a large amount of refactor. Progress tracked here:

~#1744 adding basic layout propagation pass and rule~ <- closed and broken into #1788 #1790 #1792

expand propagation rule.

wujingyue commented 7 months ago

Meanwhile, enabling a global propagation for the entire fusion is a low hanging fruit

Yeah -- that sounds pretty reasonable to me!

e.g. A simple example is, for a fusion definition, we might have both pointwise scheduler and transpose scheduler capable of handling the fusion if we assign allocation domain on outputs. How do we break tie might not always be straightforward.

I don't think I understand the problem. How bad is it to break the tie arbitrarily?

This shares some similar challenges with the alias analysis work

For me, changing allocation domains in scheduling introduced bugs that I didn't have time to understand deep enough to fix. IIRC, it has something to do with getVectorizationFactor on a TensorView between a meta-op-only segment and a pointwise segment. (Sorry, I wish I had written down what I found at that time.)

wujingyue commented 7 months ago

I saw in your WIP LayoutInference runs after MarkAliasesPrepare: https://github.com/NVIDIA/Fuser/pull/1755/files#diff-80e6f09fab6c65cd77015043ddf1e2a7c8c8d49746c96e7498c047f208725bce. Would it be beneficial to combine them (or some of them)? For example, for

y = transpose(x);
z = pointwise(y);
w = transpose(z);

MarkAliasesPrepare stops at pointwise because it's not a meta operation. However, a combined layout inference should be able to propagate the layout of x all the way through. I guess seeing more use cases would help us decide what to do?

jjsjann123 commented 7 months ago

How bad is it to break the tie arbitrarily?

Well, in the ideal world it shouldn't matter. But practically one scheduler might generate a better kernel than the other. In the example at the beginning of the issue, without the propagation the kernel would still be scheduled as a transpose kernel, I'm see about 1000 GB/s achieved bandwidth on A100. Meanwhile, if we setup a consistent memory format on output, it'll be picked up by a pointwise kernel and gets a 1500 GB/s bandwidth kernel.

changing allocation domains in scheduling introduced bugs that I didn't have time to understand deep enough to fix.

I did patch something in vectorization factor computation a few weeks ago regarding allocation domain, that analysis is still probably fragile though. I can help take a look if you can throw me a repro.

Would it be beneficial to combine them (or some of them)?

I backed out from that idea. :yum: The two shares similar logic, I almost felt that the alias analysis can use a full featured memory format propagation as tool to match the stride order from outputs to inputs.

There's still differences like consideration of contiguity. alias analysis only needs to cover a much smaller set of meta operations. I felt these differences might as well be enough to justify having it as a separate pass. But yeah, I'm open to revisit this as we see more use cases down the road.

zasdfgbnm commented 7 months ago

I haven't read your PR yet, but I think the proposed solution here in this issue makes sense to me.

Another question is that, should the propagation rule be defined universally per operation, or should each scheduler maintain a unique set of rules? Currently, matmul scheduler manipulates allocation domain, which could be considered as a special case for propagation.

Shouldn't the contract be, schedulers must respect the allocation domain specified by the input/output tensor, but it should just ignore the allocation domain of all the intermediate tensors, and are free to set their allocation domain to whatever makes sense to this scheduler? I think as long as this contract is followed, matmul schedulers should be happy with it, and there is no need to special handle anything.

jjsjann123 commented 7 months ago

schedulers must respect the allocation domain specified by the input/output tensor, but it should just ignore the allocation domain of all the intermediate tensors, and are free to set their allocation domain to whatever makes sense to this scheduler?

This protocol still holds. i.e. scheduler can make any change that's not visible from out-side, including updating allocation domain on its intermediates.

I think as long as this contract is followed, matmul schedulers should be happy with it, and there is no need to special handle anything.

I'm more thinking towards, would it make sense for other scheduler to use this utility to figure out how to modify allocation domain on intermediates. Or would that be something that's better kept independent per scheduler. This felt like the old conversation with vectorization factor analysis.

jjsjann123 commented 5 months ago

bookkeeping some offline discussion.

@csarofeen @naoyam has suggested that we should switch to propagation through mapping iter domain, instead of a per operation propagation rule.

We discussed about walking through inputs and elect a candidate to propagate its allocation domain transformation to other output tensors. This is going to be a more general approach since we no longer needs to define per-operator propagation rule. (also hoping it would work better for things like reshape. But admittedly our current propagation doesn't handle that well neither :laughing: ).

I needed to have permutation for my vision model investigation and I'll use this opportunity to refactor the work.

jjsjann123 commented 3 months ago

linking #2425