Open wujingyue opened 1 month ago
Here's my initial idea and a list of proposed changes:
Val*
unbound. FusionDefinition::execute
permutes input/output by stride order if the stride order is specified. This way, we hide this change from Python users, which customize allocation domain only via stride order. It's unclear to me whether FusionDefinition
knows which inputs/outputs are stride-ordered. We may need to collect that from recording_state_
. Note: intermediate fd.ops.stride_order
s will remain the same -- they get translated to a set
with an allocation domain attached to the output. If they sit on segment boundaries, they'll be allocated differently (again logical => allocation), but the outside of the fusion definition will see the same behavior. Update on 11/04: I'll try to change 1 and 4 to add another API to FusionExecutorCache that takes tensors matching allocation domains. If works, this will avoid having to change all C++ tests that set allocation domains.
cc @naoyam and @jjsjann123
It feels like a lot of work. But if it is what it takes to fix this problem, I'm willing to give that a shot.
Makes sense to me.
SGTM as well.
Regarding point 4 here
FusionDefinition::execute permutes input/output by stride order if the stride order is specified. This way, we hide this change from Python users, which customize allocation domain only via stride order. It's unclear to me whether FusionDefinition knows which inputs/outputs are stride-ordered. We may need to collect that from recordingstate. Note: intermediate fd.ops.stride_orders will remain the same -- they get translated to a set with an allocation domain attached to the output. If they sit on segment boundaries, they'll be allocated differently (again logical => allocation), but the outside of the fusion definition will see the same behavior.
A heads up here that there might be some obsolete logic in permutation in our runtime that's left from the TorchScript days. I'll try to remove that.
FusionDefinition::execute permutes input/output by stride order if the stride order is specified.
Nitpicking on the implementation details, should this part be on cpp side instead? i.e. we still have cpp user/tests where we need to ensure that we have correct semantics from the returned results.
Nitpicking on the implementation details, should this part be on cpp side instead? i.e. we still have cpp user/tests where we need to ensure that we have correct semantics from the returned results.
I like not having to change cpp users/tests, but I haven't found an easy way to do this all on C++ side. Where it broke is that allocation domain is the only construct on C++ side to represent memory format. We don't know whether a particular allocation domain is dictated by stride order or device split or anything else in the future.
There may be a fragile way of doing so by pattern-matching the logical=>allocation transforms. If logical=>allocation is a permutation, assume it's done by stride order and convert the allocated output tensor back to logical; otherwise, assume device split and keep it consistent with allocation. Would that be a good intermediate step?
(My braindump)
I think the root cause is that at::Tensor, being used in the FusionExecutorCache::runFusionWithInputs interface, can't always represent both the logical domain and allocation in a TensorView.
Currently, with stride order being the only use case, at::Tensor barely represents the logical domain in its sizes and allocation in strides, which is fine. However, with the new multi-GPU use case mentioned in OP, the logical=>allocation transforms become more complicated than what strides can represent.
This is not only a problem for multi-GPU. Consider binding an arange(6)
to the following TensorView:
logical = [i0{6}]
i1{2}, i2{3} = split(i0{6})
i3{6} = merge(i2{3}, i1{2})
allocation = [i3{6}]
The storage will look like [0,3,1,4,2,5], and the logical view has to remain [0,1,2,3,4,5] to match the semantics of arange
. Using at::Tensor, there's no way to set the strides to give us that logical view.
There are three general directions that we can consider:
Stick with using at::Tensor in the interface and accept its limitations.
QQ: What does accept its limitations
mean? that we'll just not be able to support loop DID parallelization?
QQ: What does accept its limitations mean? that we'll just not be able to support loop DID parallelization?
It means to bind at::Tensor to allocation domain as proposed here and accept the UX limitation.
Thanks for the write-up, it makes sense to me in general but will need more time to process all the details. While I'm reading, I'm having a couple of questions:
1) Currently, at::Tensor
is already not matching the logical domain, since a sharded dimension is of size 1
on the at::Tensor
and not on the logical domain. This needs to be like that, otherwise we couldn't accept sharded I/O. Do you agree with this, and is this what you have in mind in your item "1"?
2) what do you mean by:
Matching the allocation domain can work but leads to a bad user experience: the user of FusionExecutorCache would have to reconstruct the logical view for the framework
Do you mean we need to apply tranforms to retrieve the logical domain back from the Aten shape? With that, I agree
3) I don't understand what you mean by "rich tensor". Could make this idea more precise? I don't see how nor why using something else than at::Tensor
since a sharded dimension is of size 1 on the at::Tensor and not on the logical domain.
I should have defined "match" clearly. A logical domain (or any tensor domain) is a list of IterDomain
s, each of which can be parallelized or not. A parallelized IterDomain and a tensor dimension size of 1 are considered to match. The extent of that IterDomain however will be bound to the size of the mesh along the corresponding dimension. Under this definition, at::Tensor today matches logical but not allocation.
Do you mean we need to apply tranforms to retrieve the logical domain back from the Aten shape? With that, I agree
Yes. @naoyam reminded me today that this is not even always possible with uneven splits. For that reason, I'm now leaning towards always taking the logical tensor (however it can be on device meta) and optionally the allocation tensor. Will run some experiments to verify the ideas...
Yes. @naoyam reminded me today that this is not even always possible with uneven splits. For that reason, I'm now leaning towards always taking the logical tensor (however it can be on device meta) and optionally the allocation tensor. Will run some experiments to verify the ideas...
It may be also possible to work around by not having logical sizes. One primary use of logical sizes is predicating each expression. However, for pointwise ops, we can just eliminate predicates. Some threads may use data that's out of logical boundaries, but that should be fine as long it isn't propagated to final outputs. Non-pointwise ops do need more careful processing. For example, sum reductions should be fine as long as the buffer is initialized to zero.
Overall, however, I'm not sure how this WAR would work generally. It may easily hit limitations.
since a sharded dimension is of size 1 on the at::Tensor and not on the logical domain.
I should have defined "match" clearly. A logical domain (or any tensor domain) is a list of
IterDomain
s, each of which can be parallelized or not. A parallelized IterDomain and a tensor dimension size of 1 are considered to match. The extent of that IterDomain however will be bound to the size of the mesh along the corresponding dimension.
Ok I understand now what you mean by "match", thanks for explaining. However, I am not sure to understand:
Under this definition, at::Tensor today matches logical but not allocation.
Can you point to a concrete example? [Edit] maybe my question boils down to understanding in what case, currently, the logical domain doesn't match the allocation domain.
Do you mean we need to apply tranforms to retrieve the logical domain back from the Aten shape? With that, I agree
Yes. @naoyam reminded me today that this is not even always possible with uneven splits. For that reason, I'm now leaning towards always taking the logical tensor (however it can be on device meta) and optionally the allocation tensor. Will run some experiments to verify the ideas...
I would be in favor of not considering (or even forbidding) indivisible splits for now to fix ideas. That sounds like a reasonable assumption.
But anyway, it should always be possible to play the transform backward even for indivisible splits. ceilDiv
is not a one-to-one mapping, so it is not invertible, yet it shouldn't matter as we can always choose the "canonical" preimage that make the division even. Let me illustrate with an example:
TensorView* tv0 = makeContigTensor(1); // of symbolic extent [i1]
fusion.addInput(tv0);
tv0->split(/*axis=*/0, /*factor=*/8); // now of symbolic extents [ceilDivision(i1, 8), 8]
// [...]
at::Tensor concrete_input = at::rand({4, 8});
Then, binding the concrete input to infer i1
is not a one-to-one map, in the sense that all integers in the range [8*3+1, 8*4]
are licite choices for i1
; however we can choose the "canonical" value i1=8*4
which avoids indivisible splits.
Of course this way of doing wouldn't support wild corner-case scenario like involving a symbolic extent in two different indivisible splits, but that sounds like a reasonable restriction.
Is there a problem with this approach? Wdyt?
Could you elaborate this?
however we can choose the "canonical" value i1=8*4 which avoids indivisible splits.
Did you mean we just assume i1
is 32 even when it's actually, for example, 31?
Could you elaborate this?
however we can choose the "canonical" value i1=8*4 which avoids indivisible splits.
Did you mean we just assume
i1
is 32 even when it's actually, for example, 31?
I guess this is what I mean. However, I am not sure what you mean by "it is actually 31". Since the extent is symbolic, how can we say what it "actually is", and what does it mean?
The shape of an input/output at::Tensor matches logical, even when the corresponding TensorView has an non-trivial allocation domain. See https://github.com/NVIDIA/Fuser/blob/2e4ed54ebd09946adef8d38b0580f5f062ffc0ec/csrc/expr_evaluator.cpp#L165 for inputs and https://github.com/NVIDIA/Fuser/blob/85c22a2af3fc756d9b301c93ac29b2bba8abf2af/csrc/runtime/allocations.cpp#L769-L770 for outputs.
This will be problematic for multi-GPU when we start to parallelize loop domain. Say the input TV has
logical=[D*S]
andallocation=[DID{D},S]
. D=number of devices, and S=local tensor size. We can't allocate the input to have the entire logical shape.After discussing this with @jjsjann123, the most natural solution is to let FusionExecutorCache expect input tensors to match allocation domains and allocate output tensors according to allocation domains. For example, in the case above, the input tensor will have shape
[1,S]
becauseD
in the allocation domain is parallelized.One challenge is to not break the stride_order support needed by Thunder. I think we'll need extra permutations in addition to translating stride order to allocation domain.