Is this a new feature, an improvement, or a change to existing functionality?
Improvement
How would you describe the priority of this feature request
Low (would be nice)
Please provide a clear description of problem you would like to solve.
Problem exists in model-parallel settings where not all ranks have valid tensors, mainly around gather and scatter routines.
Scatter
scatter assumes a single tensor on a source rank which is distributed in parts across other ranks
to be able to receive these chunks, however, these ranks need to know the dtype, and other meta-information like requires_grad to not break training pipelines
current solutions either require the user to specify these things on each rank, or assume empty "dummy" tensors on each rank that carry these information, these, however, might be not robust when registered in compute-graphs of autograd-frameworks
Gather
the backward pass of gather is a a scatter call, so similar problems arise, although this case can be handled more easily by e.g. storing meta-data in the corresponding context of the torch.autograd.Function
main issue rather arises in upstream layers if gather returns None on all participating ranks, as it could be more informative to have an object carrying information about this None just being the null-part of a distributed tensor which currently is valid on rank X
Potential Solution
in general, we should make things more consistent throughout
a potential solution would be to define something like a TensorPlaceholder which carries meta-data on ranks where the tensor is currently not valid and is more informative than just None
Is this a new feature, an improvement, or a change to existing functionality?
Improvement
How would you describe the priority of this feature request
Low (would be nice)
Please provide a clear description of problem you would like to solve.
Problem exists in model-parallel settings where not all ranks have valid tensors, mainly around gather and scatter routines.
Scatter
dtype
, and other meta-information likerequires_grad
to not break training pipelinesGather
torch.autograd.Function
None
on all participating ranks, as it could be more informative to have an object carrying information about thisNone
just being the null-part of a distributed tensor which currently is valid on rank XPotential Solution
TensorPlaceholder
which carries meta-data on ranks where the tensor is currently not valid and is more informative than justNone
Describe any alternatives you have considered
No response