Open naoyam opened 9 months ago
I think it probably makes sense to have the allocation size logic in one place, for all the memory types. As I understand it the computed sizes will not be constant, is that correct? That is, we'll be computing scalar Val*
s during lowering that the executor (or segmented fusion?) will be evaluating to allocate the actual memory. I think this differs from current executor code which computes concrete sizes for global buffers.
I think @wujingyue had also considered moving allocation out of the executor in order to handle aliasing at the complete fusion level, but I'm not sure what the verdict was in that case.
Thanks Naoya for creating this issue. Let me add a specific request from the multidevice side: we should have a standalone function to allocate intermediate tensors given the concrete shape of the Fusion's inputs.
As I understand it the computed sizes will not be constant, is that correct? That is, we'll be computing scalar Val*s during lowering that the executor (or segmented fusion?) will be evaluating to allocate the actual memory. I think this differs from current executor code which computes concrete sizes for global buffers.
Correct.
Minor point: since we map rfactor to root at segmentation edges and rewrite the extents there, doing the allocation size computation in lowering means we will have scalars referring to the new extents. So the segmenter will need to either track the remapped extents in order to substitute back into the computed sizes or evaluate the size scalars sequentially in execution order.
I think @wujingyue had also considered moving allocation out of the executor in order to handle aliasing at the complete fusion level, but I'm not sure what the verdict was in that case.
Thanks for tagging me, @jacobhinkle. What #1502 option 2 needs is a way to decide whether to allocate a segment I/O (or compute it as an alias) and the order between allocating segment I/Os and executing these segments. I believe this requires our memory allocation to see the whole fusion rather than one segment at a time.
csrc/device_lower/pass/allocation.cpp sounds like a good potential platform for the above need. At least, it's a pass that can potentially reason about the whole fusion. The current API seems to be pretty shoehorned for intermediate tensors within a kernel IR (and therefore within a segment). In order to support #1502 option 2, I think it'll need quite an overhaul, e.g., to be aware of segments.
At least, it's a pass that can potentially reason about the whole fusion.
The insertAllocations
pass is done at lowering, so it only sees one segment at a time. However, under this proposal it will be inspecting global allocations and recording them somewhere like in the kernel summary, as Val
s. That could then be used to build an allocation/aliasing mechanism at the FusionKernelRuntime
/SegmentedFusion
level that allocates outputs, since it is able to see all the executors/kernels and knows execution order of the segments.
@jacobhinkle That makes sense. You are proposing to let insertAllocations
collect enough local information so the upper level (e.g. FusionKernelRuntime
and SegmentedFusion
) can decide how to allocate for the whole fusion. That should work.
We will need to analyze the parallelization types of each IterDomain and the DeviceMesh since the mesh determines which devices a tensor is sharded over. In the extreme case, a tensor will not be allocated on a device. For example, pipeline parallelism will require this behavior.
In the extreme case, a tensor will not be allocated on a device
Can you clarify when this can happen? I thought each device in the mesh allocates a slice of a multi-device tensor.
This happens when the TensorView's device mesh has less devices than the total number of devices. For example
TensorView tv0 = ...
tv0->setDeviceMesh({0})
TensorView tv1 = ...
tv1->setDeviceMesh({0,1})
tv0 is only allocated on device 0 and tv1 is allocated on 0 and 1.
Ah yes, for sure.
Currently there are two places to calculate allocation sizes: the allocation lowering pass and the executor. Currently, the global allocation is mostly done by the executor (except for tensors with halo), which has been fine as we always allocate global memory tensors as a whole, however, that will not be the case anymore with distributed-memory tensors.
I think that the allocation pass should be the one to calculate allocation sizes since we would need to analyze the memory and parallelization types of each IterDomain even for global memory tensors.
Pinging: @samnordmann, @cowanmeg, @jacobhinkle, @zasdfgbnm