NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
Other
271 stars 53 forks source link

`allocation order inference` should be a hint rather than a requirement. #2425

Open jjsjann123 opened 5 months ago

jjsjann123 commented 5 months ago

proposal

AllocationDomainPass modifies the allocation domain in the fusion IR directly to reflect the optimized stride order for output. Which makes the allocation domain as a requirement in the program that schedulers have to follow.

It's important to distinguish which layouts are requirement of the user program that we must respect (i.e. a computation definition where it is explicitly asking an output in a certain memory layout), from what is an optimization from the codegen system, like the allocation order inference, which we could revert in later phase.

The proposal here is to assume that any AllocationDomain on output TVs are just an optimization hint. For computation defined requirement, we put in a separate stride order map in Fusion, similar to how we specify output aliases.

  std::optional<std::vector<int64_t>> getOutputStrideOrder(const TensorView* tv) const;
  void setOutputStrideOrder(const TensorView* tv, const std::vector<int64_t>& stride_order);

  std::unordered_map<const TensorView*, std::vector<int64_t>> output_stride_order_;

Later transformation would just need to ensure that allocation domain mutated on outputs remains consistent across segmented fusion and no violation against the ground truth in output_stride_order_.

issue

One issue we hit was that the pass tries to modify the output stride order for Linear, which is not yet supported (by ExprEvaluator). More importantly, the allocation domain inference result isn't optimal neither, since we are naively mapping the order of IterDomain in the allocation domain of a reference tensor.

For linear, we have inputs with shape b, m, k and weight with shape n, k producing an output with shape b, m, n. The existing logic in the pass picks the higher rank input's b, m, k as the reference and tries to order mapped IDs on output with the same order (gives b, m) and push un-mapped IDs on the outside (gives n, b, m).

This means that the pass will try to produce a permuted output, given non permuted inputs for linear and this might not be the optimal decision.

The problem is once we have modified the allocation domain on the output, it's impossible for the scheduler to tell the difference on whether this is just an optimization hint or rather a user requirement. So we can not undo the changes later.

The issue originated from the comment.

more context

We should remove all the old permutation support code in Fusion: https://github.com/NVIDIA/Fuser/blob/bad998ae277ffc2f43fdc28dca07d01d737a1623/csrc/fusion.h#L495-L499 since those are obsolete TorchScript code.

csarofeen commented 4 months ago

I don’t really understand this issue. Could you please explain in more detail. I don’t think it’s good practice to reference a comment in an issue. Please summarize the problem, and include examples of current behavior and what behavior we’d like to see.

jjsjann123 commented 4 months ago

I don’t really understand this issue. Could you please explain in more detail. I don’t think it’s good practice to reference a comment in an issue. Please summarize the problem, and include examples of current behavior and what behavior we’d like to see.

Sorry about the vague issue in the first place. Attempted a rewrite hopefully it's looking more concrete this time :crossed_fingers: .

naoyam commented 4 months ago

Sounds like what we are seeing here is that the allocation order pass, which is a preseg pass, does something, but that may need to be changed by a scheduler because it's not optimal. I thought preseg passes were mostly just for straightforward optimization that should always be applied, but looks like it's getting more complex than that. Am I understanding correctly?

jjsjann123 commented 4 months ago

Am I understanding correctly?

Yeah you got it right.

I thought preseg passes were mostly just for straightforward optimization that should always be applied, but looks like it's getting more complex than that

That could be true. But then we just don't have a good place for allocation order pass. Arguably allocation order pass should have done a better job and not give non-optimal allocation domain. But I don't know how realistic that expectation is.

naoyam commented 4 months ago

Am I understanding correctly?

Yeah you got it right.

I thought preseg passes were mostly just for straightforward optimization that should always be applied, but looks like it's getting more complex than that

That could be true. But then we just don't have a good place for allocation order pass. Arguably allocation order pass should have done a better job and not give non-optimal allocation domain. But I don't know how realistic that expectation is.

I'm more interested in what would happen if this inference is done entirely by the schedulers. Presumably, that would solve the issue, right? Why should it be done as a preseg pass?

jjsjann123 commented 4 months ago

Am I understanding correctly? Yeah you got it right.

I thought preseg passes were mostly just for straightforward optimization that should always be applied, but looks like it's getting more complex than that

That could be true. But then we just don't have a good place for allocation order pass. Arguably allocation order pass should have done a better job and not give non-optimal allocation domain. But I don't know how realistic that expectation is.

I'm more interested in what would happen if this inference is done entirely by the schedulers. Presumably, that would solve the issue, right? Why should it be done as a preseg pass?

Re: this inference is done entirely by the schedulers. We had discussed that possibility before and we agreed that's something worth exploring. Going in that direction, I think we would still want something similar to the proposed change in this issue.

Scheduler-run allocation domain inference is going to need work done at segmentation, i.e. if a scheduler takes a segment and decides to change the allocation domain on boundary tensors (be it input or output), that needs to be updated and communicated to the corresponding consumer (or producer) segments. Like in the existing problem, we'll need to be able to tell the difference between scheduler imposed allocation domain vs computation defined allocation domain. So when different scheduler makes different decision on an intermediate tensor, we will be able to know to which TensorView's allocation domain can be altered.

Priya2698 commented 2 months ago

Restarting the conversation since this issue is scheduled for Q4.

I will draft a design doc describing the work needed for separating the allocation_domain_inference from user-specified stride order. Based on @jjsjann123's recommendation above, broadly:

  1. The user-specified stride order fd.add_output(tv, stride_order) will be additionally used to update the stride_order_ map stored in the fusion object (https://github.com/NVIDIA/Fuser/blob/b064b18e489ac019414975e00c90cb2592f3f91a/csrc/python_frontend/fusion_record.h#L1506).
  2. ExprEvalScheduler will only adhere to the stride_order map (currently AllocationDomainPass does not generate optimal layout for MatmulOp/LinearOp, see comment) and ignore any allocation_domain inferred in cases where no layout was specified by user at this point.
  3. AllocationDomainPass should largely stay unmodified. For all other schedulers, we can continue querying getMaybeAllocationDomain.

The distinction will allow us to compare the allocation_domain set for a tensorview with the stride_order map in cases where scheduler may need to reject the propagated allocation domain (currently ExprEvalScheduler).

Please let me know if this sounds reasonable/further discussion is needed on alternate approaches.

CC: @kevinstephano @naoyam

naoyam commented 2 months ago

ExprEvalScheduler will only adhere to the stride_order map (currently AllocationDomainPass does not generate optimal layout for MatmulOp/LinearOp, see https://github.com/NVIDIA/Fuser/issues/2354#issuecomment-2164165853) and ignore the allocation_domain if it is different from strideorder map entry.

Given that stride_order is the strict requirement, why does AllocationDomainPass generate a different order?

Priya2698 commented 2 months ago

ExprEvalScheduler will only adhere to the stride_order map (currently AllocationDomainPass does not generate optimal layout for MatmulOp/LinearOp, see #2354 (comment)) and ignore the allocation_domain if it is different from strideorder map entry.

Given that stride_order is the strict requirement, why does AllocationDomainPass generate a different order?

edited the original comment to be more clear. You are correct, if stride_order is set for an output, that is the allocation domain to be adhered to. For ExprEvalScheduler we will only adhere to the layout specified by stride_order_ when present. If no stride order was specified by the user, but an allocation domain was inferred (currently disabled by https://github.com/NVIDIA/Fuser/blob/2424ad6a79b748df0f00d087809299dcbc22c50e/csrc/preseg_passes/allocation_order_inference.cpp#L347-L357), it will be disregarded.