Open jjsjann123 opened 5 months ago
I don’t really understand this issue. Could you please explain in more detail. I don’t think it’s good practice to reference a comment in an issue. Please summarize the problem, and include examples of current behavior and what behavior we’d like to see.
I don’t really understand this issue. Could you please explain in more detail. I don’t think it’s good practice to reference a comment in an issue. Please summarize the problem, and include examples of current behavior and what behavior we’d like to see.
Sorry about the vague issue in the first place. Attempted a rewrite hopefully it's looking more concrete this time :crossed_fingers: .
Sounds like what we are seeing here is that the allocation order pass, which is a preseg pass, does something, but that may need to be changed by a scheduler because it's not optimal. I thought preseg passes were mostly just for straightforward optimization that should always be applied, but looks like it's getting more complex than that. Am I understanding correctly?
Am I understanding correctly?
Yeah you got it right.
I thought preseg passes were mostly just for straightforward optimization that should always be applied, but looks like it's getting more complex than that
That could be true. But then we just don't have a good place for allocation order pass. Arguably allocation order pass should have done a better job and not give non-optimal allocation domain. But I don't know how realistic that expectation is.
Am I understanding correctly?
Yeah you got it right.
I thought preseg passes were mostly just for straightforward optimization that should always be applied, but looks like it's getting more complex than that
That could be true. But then we just don't have a good place for allocation order pass. Arguably allocation order pass should have done a better job and not give non-optimal allocation domain. But I don't know how realistic that expectation is.
I'm more interested in what would happen if this inference is done entirely by the schedulers. Presumably, that would solve the issue, right? Why should it be done as a preseg pass?
Am I understanding correctly? Yeah you got it right.
I thought preseg passes were mostly just for straightforward optimization that should always be applied, but looks like it's getting more complex than that
That could be true. But then we just don't have a good place for allocation order pass. Arguably allocation order pass should have done a better job and not give non-optimal allocation domain. But I don't know how realistic that expectation is.
I'm more interested in what would happen if this inference is done entirely by the schedulers. Presumably, that would solve the issue, right? Why should it be done as a preseg pass?
Re: this inference is done entirely by the schedulers
. We had discussed that possibility before and we agreed that's something worth exploring. Going in that direction, I think we would still want something similar to the proposed change in this issue.
Scheduler-run allocation domain inference is going to need work done at segmentation, i.e. if a scheduler takes a segment and decides to change the allocation domain on boundary tensors (be it input or output), that needs to be updated and communicated to the corresponding consumer (or producer) segments. Like in the existing problem, we'll need to be able to tell the difference between scheduler imposed allocation domain
vs computation defined allocation domain
. So when different scheduler makes different decision on an intermediate tensor, we will be able to know to which TensorView's allocation domain can be altered.
Restarting the conversation since this issue is scheduled for Q4.
I will draft a design doc describing the work needed for separating the allocation_domain_inference
from user-specified stride order
. Based on @jjsjann123's recommendation above, broadly:
fd.add_output(tv, stride_order)
will be additionally used to update the stride_order_
map stored in the fusion
object (https://github.com/NVIDIA/Fuser/blob/b064b18e489ac019414975e00c90cb2592f3f91a/csrc/python_frontend/fusion_record.h#L1506).ExprEvalScheduler
will only adhere to the stride_order
map (currently AllocationDomainPass
does not generate optimal layout for MatmulOp/LinearOp
, see comment) and ignore any allocation_domain
inferred in cases where no layout was specified by user at this point. AllocationDomainPass
should largely stay unmodified. For all other schedulers, we can continue querying getMaybeAllocationDomain
.The distinction will allow us to compare the allocation_domain
set for a tensorview with the stride_order
map in cases where scheduler may need to reject the propagated allocation domain (currently ExprEvalScheduler
).
Please let me know if this sounds reasonable/further discussion is needed on alternate approaches.
CC: @kevinstephano @naoyam
ExprEvalScheduler will only adhere to the stride_order map (currently AllocationDomainPass does not generate optimal layout for MatmulOp/LinearOp, see https://github.com/NVIDIA/Fuser/issues/2354#issuecomment-2164165853) and ignore the allocation_domain if it is different from strideorder map entry.
Given that stride_order
is the strict requirement, why does AllocationDomainPass
generate a different order?
ExprEvalScheduler will only adhere to the stride_order map (currently AllocationDomainPass does not generate optimal layout for MatmulOp/LinearOp, see #2354 (comment)) and ignore the allocation_domain if it is different from strideorder map entry.
Given that
stride_order
is the strict requirement, why doesAllocationDomainPass
generate a different order?
edited the original comment to be more clear.
You are correct, if stride_order
is set for an output, that is the allocation domain to be adhered to.
For ExprEvalScheduler
we will only adhere to the layout specified by stride_order_
when present. If no stride order was specified by the user, but an allocation domain was inferred (currently disabled by https://github.com/NVIDIA/Fuser/blob/2424ad6a79b748df0f00d087809299dcbc22c50e/csrc/preseg_passes/allocation_order_inference.cpp#L347-L357), it will be disregarded.
proposal
AllocationDomainPass
modifies the allocation domain in the fusion IR directly to reflect the optimized stride order for output. Which makes the allocation domain as a requirement in the program that schedulers have to follow.It's important to distinguish which layouts are requirement of the user program that we must respect (i.e. a computation definition where it is explicitly asking an output in a certain memory layout), from what is an optimization from the codegen system, like the allocation order inference, which we could revert in later phase.
The proposal here is to assume that any
AllocationDomain
on output TVs are just an optimization hint. For computation defined requirement, we put in a separate stride order map inFusion
, similar to how we specify output aliases.Later transformation would just need to ensure that allocation domain mutated on
outputs
remains consistent across segmented fusion and no violation against the ground truth inoutput_stride_order_
.issue
One issue we hit was that the pass tries to modify the output stride order for Linear, which is not yet supported (by ExprEvaluator). More importantly, the allocation domain inference result isn't optimal neither, since we are naively mapping the order of IterDomain in the allocation domain of a reference tensor.
For linear, we have inputs with shape
b, m, k
and weight with shapen, k
producing an output with shapeb, m, n
. The existing logic in the pass picks the higher rank input'sb, m, k
as the reference and tries to order mapped IDs on output with the same order (givesb, m
) and push un-mapped IDs on the outside (givesn, b, m
).This means that the pass will try to produce a permuted output, given non permuted inputs for linear and this might not be the optimal decision.
The problem is once we have modified the allocation domain on the output, it's impossible for the scheduler to tell the difference on whether this is just an optimization hint or rather a user requirement. So we can not undo the changes later.
The issue originated from the comment.
more context
We should remove all the old permutation support code in Fusion: https://github.com/NVIDIA/Fuser/blob/bad998ae277ffc2f43fdc28dca07d01d737a1623/csrc/fusion.h#L495-L499 since those are obsolete TorchScript code.