Open liqiangxl opened 1 week ago
they can be re-calculated in segmentation-2 instead of being written out in segmentation-1 and read back in segmentation-2.
When you say "re-calculate", are you proposing to compute the three tensors for segment-1 as well as segment-2? If yes, it would be something like rematerialization that trades more compute for less I/O. Otherwise, it indeed sounds like a where-to-put-boundary-in-the-DAG kind of problem that's in the domain of segmentation.
It belongs to rematerialization
.
It belongs to
rematerialization
.
I see -- that's a harder problem.
My gut feeling says we would want a segmentation-aware rematerialization pass. One possible implementation is to run segmentation first and then rematerialize only TensorViews across segments (because we know data copy via global mem is expensive). This reminds me of the existing rematerialization pass in Thunder. Any lessons we can learn from there? @IvanYashchuk and @jjsjann123
Is this a case where there isn’t a better min-cut for the segmentation so we need to rematerialize? Otherwise is it that rematerializing a tensor seems to be the most straight forward implementation? If it’s the former I would imagine having a post segmentation pass that tries to move tensors from one group to another to find a better min-cut (as long as it doesn’t change the heuristic choice) may be an effective strategy to start looking at memory cost aware segmentation.
Rematerialization of course could be another option (I’m not doubting that), I’m just wondering what that algorithm would look like. This is something that could be particularly valuable if we have a full forward-backward graph.
This reminds me of the existing rematerialization pass in Thunder. Any lessons we can learn from there?
I'll let @IvanYashchuk cover that question :)
As @csarofeen pointed out, find a better min-cut (as long as it doesn’t change the heuristic choice)
nvfuser segments needs to comply to what scheduler can handle. There's another constraint on top of the min-cut.
Is this a case where there isn’t a better min-cut for the segmentation so we need to rematerialize? Otherwise is it that rematerializing a tensor seems to be the most straight forward implementation? If it’s the former I would imagine having a post segmentation pass that tries to move tensors from one group to another to find a better min-cut (as long as it doesn’t change the heuristic choice) may be an effective strategy to start looking at memory cost aware segmentation.
In this case, there doesn't seem to have a bettter min-cut by moving tensors from one group to another. All these inter-segment tensors (12
, 14
, and 15
) are needed to calculate the reductions in each segments. A simplified version of the fusion is as follows:
https://github.com/NVIDIA/Fuser/issues/2473#issuecomment-2197631460:
This reminds me of the existing rematerialization pass in Thunder. Any lessons we can learn from there?
Rematerialization in Thunder does the min-cut on the producer-consumer graph with the restriction that no node is allowed to be moved from consumer to producer. Currently, Thunder doesn't use any shape information on the tensors because I initially thought symbolic shapes would be more important in Thunder. There are only two preferences encoded as weights:
My gut feeling says we would want a segmentation-aware rematerialization pass.
However, if you decide using min-cut for this "multiway cut" is an NP-hard problem. In Thunder, we go once through each producer-consumer pair sequentially in the order of producers appearing in the trace and each min-cut computation sees the updated producers and consumers.
For Thunder, it would be useful if there was an ability to query nvFuser's FusionDefinition object for current segmentation boundaries and what intermediates would result in global tensors. Ideally, it should be possible without seeing real strided Tensor inputs. This information could be used in Thunder's rematerialization and memory usage estimation.
Motivation: Seems like we need an IO-aware segmenter to reduce number of tensors between different segments. A real example is from #2146, where the tensor is segmented into 2 kernels with 3 inter-segment tensors.
These tensors can be calculated pointwisely from other inputs in segmentation-2. In other words, they can be re-calculated in segmentation-2 instead of being written out in segmentation-1 and read back in segmentation-2.
Potential fix: instead of greedily merge as many exprs as possible, may also check the influecne on IO bytes. In other words, may change the target from minimize number of segments to minimize total IO bytes.
Example from #2146
Reproduce:
NVFUSER_DUMP=segmented_fusion python v0_2146.py 2>&1 |tee 1.log