Bind sharded input/output tensors with DID-parallelized allocation domains.

wujingyue commented 1 month ago

The shape of an input/output at::Tensor matches logical, even when the corresponding TensorView has an non-trivial allocation domain. See https://github.com/NVIDIA/Fuser/blob/2e4ed54ebd09946adef8d38b0580f5f062ffc0ec/csrc/expr_evaluator.cpp#L165 for inputs and https://github.com/NVIDIA/Fuser/blob/85c22a2af3fc756d9b301c93ac29b2bba8abf2af/csrc/runtime/allocations.cpp#L769-L770 for outputs.

This will be problematic for multi-GPU when we start to parallelize loop domain. Say the input TV has logical=[D*S] and allocation=[DID{D},S]. D=number of devices, and S=local tensor size. We can't allocate the input to have the entire logical shape.

After discussing this with @jjsjann123, the most natural solution is to let FusionExecutorCache expect input tensors to match allocation domains and allocate output tensors according to allocation domains. For example, in the case above, the input tensor will have shape [1,S] because D in the allocation domain is parallelized.

One challenge is to not break the stride_order support needed by Thunder. I think we'll need extra permutations in addition to translating stride order to allocation domain.

wujingyue commented 4 weeks ago

Here's my initial idea and a list of proposed changes:

The input tensor fed to FusionExecutorCache::runFusionWithInputs need to match the corresponding input TensorView's allocation domain instead of logical. This means we'll change all C++ tests that set allocation domains of fusion inputs/outputs.
When binding a TensorView to a tensor, transform the tensor's sizes/strides from mapping allocation to mapping logical and then bind the logical domain. Part of this can be done using this helper: https://github.com/NVIDIA/Fuser/blob/f097e58ea1ae9d2c70115ce6dfabc7b27763fcb3/csrc/runtime/allocations.cpp#L688. Binding the logical domain is necessary because the logical domain usually dominates the allocation domain -- binding the allocation domain alone will leave some extents Val* unbound.
When allocating an output tensor, allocate it according to the allocation domain instead of logical. For example, a TensorView with logical domain of [i0,i1,i2] and allocation domain of [i2,i1,i0] will be allocated as shape [i2,i1,i0]. Currently, inferShapeOfOutput does that already however it converts the shape to logical domain after that: https://github.com/NVIDIA/Fuser/blob/f097e58ea1ae9d2c70115ce6dfabc7b27763fcb3/csrc/runtime/allocations.cpp#L770.
FusionDefinition::execute permutes input/output by stride order if the stride order is specified. This way, we hide this change from Python users, which customize allocation domain only via stride order. It's unclear to me whether FusionDefinition knows which inputs/outputs are stride-ordered. We may need to collect that from recording_state_. Note: intermediate fd.ops.stride_orders will remain the same -- they get translated to a set with an allocation domain attached to the output. If they sit on segment boundaries, they'll be allocated differently (again logical => allocation), but the outside of the fusion definition will see the same behavior.

Update on 11/04: I'll try to change 1 and 4 to add another API to FusionExecutorCache that takes tensors matching allocation domains. If works, this will avoid having to change all C++ tests that set allocation domains.

cc @naoyam and @jjsjann123

It feels like a lot of work. But if it is what it takes to fix this problem, I'm willing to give that a shot.

naoyam commented 3 weeks ago

Makes sense to me.

jjsjann123 commented 3 weeks ago

SGTM as well.

Regarding point 4 here

FusionDefinition::execute permutes input/output by stride order if the stride order is specified. This way, we hide this change from Python users, which customize allocation domain only via stride order. It's unclear to me whether FusionDefinition knows which inputs/outputs are stride-ordered. We may need to collect that from recordingstate. Note: intermediate fd.ops.stride_orders will remain the same -- they get translated to a set with an allocation domain attached to the output. If they sit on segment boundaries, they'll be allocated differently (again logical => allocation), but the outside of the fusion definition will see the same behavior.

A heads up here that there might be some obsolete logic in permutation in our runtime that's left from the TorchScript days. I'll try to remove that.

jjsjann123 commented 3 weeks ago

FusionDefinition::execute permutes input/output by stride order if the stride order is specified.

Nitpicking on the implementation details, should this part be on cpp side instead? i.e. we still have cpp user/tests where we need to ensure that we have correct semantics from the returned results.

wujingyue commented 3 weeks ago

Nitpicking on the implementation details, should this part be on cpp side instead? i.e. we still have cpp user/tests where we need to ensure that we have correct semantics from the returned results.

I like not having to change cpp users/tests, but I haven't found an easy way to do this all on C++ side. Where it broke is that allocation domain is the only construct on C++ side to represent memory format. We don't know whether a particular allocation domain is dictated by stride order or device split or anything else in the future.

There may be a fragile way of doing so by pattern-matching the logical=>allocation transforms. If logical=>allocation is a permutation, assume it's done by stride order and convert the allocated output tensor back to logical; otherwise, assume device split and keep it consistent with allocation. Would that be a good intermediate step?

wujingyue commented 3 weeks ago

(My braindump)

I think the root cause is that at::Tensor, being used in the FusionExecutorCache::runFusionWithInputs interface, can't always represent both the logical domain and allocation in a TensorView.

Currently, with stride order being the only use case, at::Tensor barely represents the logical domain in its sizes and allocation in strides, which is fine. However, with the new multi-GPU use case mentioned in OP, the logical=>allocation transforms become more complicated than what strides can represent.

This is not only a problem for multi-GPU. Consider binding an arange(6) to the following TensorView:

logical = [i0{6}]
i1{2}, i2{3} = split(i0{6})
i3{6} = merge(i2{3}, i1{2})
allocation = [i3{6}]

The storage will look like [0,3,1,4,2,5], and the logical view has to remain [0,1,2,3,4,5] to match the semantics of arange. Using at::Tensor, there's no way to set the strides to give us that logical view.

There are three general directions that we can consider:

Stick with using at::Tensor in the interface and accept its limitations. So far, we've been talking about letting at::Tensor match the logical domain or the allocation domain. Matching the logical domain means it wouldn't be able to represent a sharded tensor, as pointed out in OP. Matching the allocation domain can work but leads to a bad user experience: the user of FusionExecutorCache would have to reconstruct the logical view for the framework.
Let runFusionWithInputs consume/produce a "rich" tensor that can represent both logical and allocation. I'm afraid it's hard to cover all cases; we would have to reimplement the same types of transforms for concrete tensors, e.g., split and merge.
Similar to 2, but make the tensor rich enough to barely support certain use cases. For example, we could build something similar to DTensor that only allow device split, not arbitrary transforms. We would probably have to fork the API to consume different tensor representations, e.g., DTensor vs local tensor with stride orders, etc.

jjsjann123 commented 3 weeks ago

Stick with using at::Tensor in the interface and accept its limitations.

QQ: What does accept its limitations mean? that we'll just not be able to support loop DID parallelization?

wujingyue commented 3 weeks ago

QQ: What does accept its limitations mean? that we'll just not be able to support loop DID parallelization?

It means to bind at::Tensor to allocation domain as proposed here and accept the UX limitation.

samnordmann commented 2 weeks ago

Thanks for the write-up, it makes sense to me in general but will need more time to process all the details. While I'm reading, I'm having a couple of questions:

1) Currently, at::Tensor is already not matching the logical domain, since a sharded dimension is of size 1 on the at::Tensor and not on the logical domain. This needs to be like that, otherwise we couldn't accept sharded I/O. Do you agree with this, and is this what you have in mind in your item "1"?

2) what do you mean by:

Matching the allocation domain can work but leads to a bad user experience: the user of FusionExecutorCache would have to reconstruct the logical view for the framework

Do you mean we need to apply tranforms to retrieve the logical domain back from the Aten shape? With that, I agree

3) I don't understand what you mean by "rich tensor". Could make this idea more precise? I don't see how nor why using something else than at::Tensor

wujingyue commented 2 weeks ago

since a sharded dimension is of size 1 on the at::Tensor and not on the logical domain.

I should have defined "match" clearly. A logical domain (or any tensor domain) is a list of IterDomains, each of which can be parallelized or not. A parallelized IterDomain and a tensor dimension size of 1 are considered to match. The extent of that IterDomain however will be bound to the size of the mesh along the corresponding dimension. Under this definition, at::Tensor today matches logical but not allocation.

Do you mean we need to apply tranforms to retrieve the logical domain back from the Aten shape? With that, I agree

Yes. @naoyam reminded me today that this is not even always possible with uneven splits. For that reason, I'm now leaning towards always taking the logical tensor (however it can be on device meta) and optionally the allocation tensor. Will run some experiments to verify the ideas...

naoyam commented 2 weeks ago

Yes. @naoyam reminded me today that this is not even always possible with uneven splits. For that reason, I'm now leaning towards always taking the logical tensor (however it can be on device meta) and optionally the allocation tensor. Will run some experiments to verify the ideas...

It may be also possible to work around by not having logical sizes. One primary use of logical sizes is predicating each expression. However, for pointwise ops, we can just eliminate predicates. Some threads may use data that's out of logical boundaries, but that should be fine as long it isn't propagated to final outputs. Non-pointwise ops do need more careful processing. For example, sum reductions should be fine as long as the buffer is initialized to zero.

Overall, however, I'm not sure how this WAR would work generally. It may easily hit limitations.

samnordmann commented 2 weeks ago

since a sharded dimension is of size 1 on the at::Tensor and not on the logical domain.

I should have defined "match" clearly. A logical domain (or any tensor domain) is a list of IterDomains, each of which can be parallelized or not. A parallelized IterDomain and a tensor dimension size of 1 are considered to match. The extent of that IterDomain however will be bound to the size of the mesh along the corresponding dimension.

Ok I understand now what you mean by "match", thanks for explaining. However, I am not sure to understand:

Under this definition, at::Tensor today matches logical but not allocation.

Can you point to a concrete example? [Edit] maybe my question boils down to understanding in what case, currently, the logical domain doesn't match the allocation domain.

Do you mean we need to apply tranforms to retrieve the logical domain back from the Aten shape? With that, I agree

Yes. @naoyam reminded me today that this is not even always possible with uneven splits. For that reason, I'm now leaning towards always taking the logical tensor (however it can be on device meta) and optionally the allocation tensor. Will run some experiments to verify the ideas...

I would be in favor of not considering (or even forbidding) indivisible splits for now to fix ideas. That sounds like a reasonable assumption.

But anyway, it should always be possible to play the transform backward even for indivisible splits. ceilDiv is not a one-to-one mapping, so it is not invertible, yet it shouldn't matter as we can always choose the "canonical" preimage that make the division even. Let me illustrate with an example:

TensorView* tv0 = makeContigTensor(1); // of symbolic extent [i1]
fusion.addInput(tv0);

tv0->split(/*axis=*/0, /*factor=*/8); // now of symbolic extents [ceilDivision(i1, 8), 8] 

// [...]

at::Tensor concrete_input = at::rand({4, 8});

Then, binding the concrete input to infer i1 is not a one-to-one map, in the sense that all integers in the range [8*3+1, 8*4] are licite choices for i1 ; however we can choose the "canonical" value i1=8*4 which avoids indivisible splits.

Of course this way of doing wouldn't support wild corner-case scenario like involving a symbolic extent in two different indivisible splits, but that sounds like a reasonable restriction.

Is there a problem with this approach? Wdyt?

naoyam commented 2 weeks ago

Could you elaborate this?

however we can choose the "canonical" value i1=8*4 which avoids indivisible splits.

Did you mean we just assume i1 is 32 even when it's actually, for example, 31?

samnordmann commented 2 weeks ago

Could you elaborate this?

however we can choose the "canonical" value i1=8*4 which avoids indivisible splits.

Did you mean we just assume i1 is 32 even when it's actually, for example, 31?

I guess this is what I mean. However, I am not sure what you mean by "it is actually 31". Since the extent is symbolic, how can we say what it "actually is", and what does it mean?

NVIDIA / Fuser

Bind sharded input/output tensors with DID-parallelized allocation domains. #3282