jacobhinkle commented 9 months ago

(This issue might be updated as our discussion evolves)

Context

Broadcast IterDomains signify that a future operation might perform broadcasting involving this dimension. However, broadcasting is fundamentally a behavior of an operation, not an attribute of a piece of data. In fact it is the concretization of a broadcast that matters. The reason it is attached to tensors is inherited from numpy and PyTorch, which need a convenient way to specify that an op should do broadcasting. In an eager system, implementing broadcast semantics can be done locally to each op. However, shadowing that system in nvFuser means that we carry around Broadcast IterDomains that require a lot of special handling. Furthermore, some operations like broadcast, squeeze, and expand are really no-ops, but representing them in our IR can lead to unintended data transfer.

Summary of Proposal

In a side conversation, @zasdfgbnm said this (see #1628):

What is important is the concretization of broadcasting. All the history of broadcasting IterDomains (IDs) before concretization are not relevant, and I am not interested in inspecting it. Especially, I don’t care which broadcasting ID is mapped to which broadcasting ID in root domain map.
Broadcasting IterDomain are just auxiliary placeholders. They should be capable of being created and annihilated arbitrarily.

Indeed, we should be able to move all broadcasts to the point just before the broadcast is concretized. This proposal goes slightly further and just attaches the broadcast concretization info to the concretizing op.

This proposal seeks to remove IterType::Broadcast as well as BroadcastOp, SqueezeOp, ExpandOp, and their related C++ and Python frontend operations. They will be replaced by an attribute in Expr that tracks which axes in each input are broadcasts concretized by that Expr. Broadcast concretization will need to be explicitly declared during fusion definition.

To work an actual example from the recent issue #1757 :

Following the principle that only concretization of broadcasts matters, in this example we see that mapping IterDomains is simplified and the gap introduced by the squeeze+broadcast sequence is avoided, since it is unrelated to concretization.

Details of Approach

Questions

How to handle Fusion inputs/outputs? For inputs, we can simply squeeze the input tensors. Outputs might potentially need to include a broadcast. This can be done with a metadata op at some level of the executor cache, but will require integration code to specify this, and we'll need to represent it somehow in the IR.
Expanded broadcasts. An expanded broadcast is really just a regular broadcast that is guaranteed to only be concretized to a certain size. However, it acts like an Iteration domain when it is reduced. We currently have some special logic for reductions of expanded broadcasts, and this proposal might require lifting that into integration, as we would no longer be able to represent the expanded broadcasts at all.

Tentative Plan

Here is a sketch of how we could proceed. Each stage is likely somewhat involved, but any conceptual misunderstandings should be encountered in the first stages. The later stages will involve updating our frontend so we can coordinate with integration for a safe transition to the new interface.

Introduce std::vector<std::vector<bool>> Expr::broadcastAxes() const. PairwiseRootDomainMap and friends will need updating and tests will explicitly mark these in Expr nodes instead of using ops like add.
Add optional arguments to ops to enable explicit concretization, e.g. TensorView* tv2 = add(tv0, tv1, /*broadcast_axes=*/{ {1}, {} }).
Add pass (to be removed later) RemoveBroadcastIterDomains that removes all broadcasts from Fusion and places explicit broadcast concretizations on Exprs. This can start as a pre-lowering pass then become a preseg pass as we build out support.
Remove propagation of Broadcast IterDomains through ops. This will necessitate explicit broadcasting in ops, so requires a major version bump.
Remove the RemoveBroadcastIterDomains pass and remove IterType::Broadcast.

Alternatives

Making Broadcasting a "TensorDomain op"

As proposed by @zasdfgbnm, we could keep IterType::Broadcast but remove BroadcastOp and SqueezeOp. Instead, broadcasts and squeezes would be inferred by the relation between root and rfactor domains within a TensorDomain.

Foreseen Downsides

The biggest downside might be that upstream integrations would need to be aware that nvfuser expects a description of its graph with all the tensors squeezed. They would also need to specify axes for operations like sum relative to the squeezed tensors, which could introduce some bugs.

csarofeen commented 9 months ago

To be very clear, concretization is not a compute time decision. Concretization requires both the computation from the front end and the parallelization of the operations. Only at that point do we know where "just in time" concretization happens.

zasdfgbnm commented 9 months ago

To be very clear, concretization is not a compute time decision. Concretization requires both the computation from the front end and the parallelization of the operations. Only at that point do we know where "just in time" concretization happens.

Are you talking about code around the following places?

Show code

https://github.com/NVIDIA/Fuser/blob/5fac2c1e59dd79c162a4f181f227f62b5adfdeaf/csrc/kernel.cpp#L181-L192 https://github.com/NVIDIA/Fuser/blob/5fac2c1e59dd79c162a4f181f227f62b5adfdeaf/csrc/codegen.cpp#L1076-L1110

Regarding the term "concretization", I think it is important to distinguish "logical concretization" vs "physical concretization".

Just like how we should distinguish "physical permutation" vs "logical permutation":

This fusion has neither a logical permutation nor a physical permutation: ``` T1 = set(T0) T1: root domain [5, 7] allocation domain ([5, 7]) T0: root domain [5, 7] allocation domain ([5, 7]) ``` This fusion has a logical permutation but not a physical permutation: ``` T1 = transpose(T0) T1: root domain [7, 5] allocation domain ([5, 7]) T0: root domain [5, 7] allocation domain ([5, 7]) ``` This fusion has a physical permutation but not a logical permutation: ``` T1 = set(T0) T1: root domain [5, 7] allocation domain ([7, 5]) T0: root domain [5, 7] allocation domain ([5, 7]) ``` This fusion has both a logical permutation and a physical permutation: ``` T1 = transpose(T0) T1: root domain [7, 5] allocation domain ([7, 5]) T0: root domain [5, 7] allocation domain ([5, 7]) ```

As long as we have a logical one-to-multiple in our tensor operation, we have a "logical concretization", but this may or may not be translated to a "physical concretization" depending on the schedule. For the case of "physical concretization", I totally agree with @csarofeen. But I think in @jacobhinkle's words, he is mostly referring to "logical concretization".

Thanks @csarofeen for mentioning this. I think we still need to figure out how we should analyze and generate code when a logical concretization is translated into a physical concretization.

zasdfgbnm commented 9 months ago

Thanks @jacobhinkle a lot for summarizing it as an issue. I do not have a comment on this proposal yet, because I need to spend more time thinking about this topic.

So far what I am confident is about my statement 1:

What is important is the concretization of broadcasting. All the history of broadcasting IterDomains (IDs) before concretization are not relevant, and I am not interested in inspecting it. Especially, I don’t care which broadcasting ID is mapped to which broadcasting ID in root domain map.

Note that in my above statement, I am referring to "logical concretization". I think we should be interested in all "logical concretization", and we should analyze it to figure out things like: 1. does it translate to physical concretization? 2. How to index it?, etc.

On the other hand, it looks like my second statement:

Broadcasting IterDomain are just auxiliary placeholders. They should be capable of being created and annihilated arbitrarily.

is in the opposite direction as the proposal @jacobhinkle is proposing here. So far, I do not have any strong opinion about which is better than which.

Leaving aside what we have not reached an agreement yet, am I correct that, we all agree with the first principles on how a broadcasting behave, and we are mostly trying to figure out "how to mechanically encode what we agree on"?

I think https://github.com/NVIDIA/Fuser/issues/1759, https://github.com/NVIDIA/Fuser/issues/1628 and https://github.com/NVIDIA/Fuser/issues/1757 are all failing because of mechanical reasons. And we need to make sure, if we refactor and mechanically encode things differently, we should be capable of handling these cases naturally.

zasdfgbnm commented 9 months ago

The biggest downside might be that upstream integrations would need to be aware that nvfuser expects a description of its graph with all the tensors squeezed. They would also need to specify axes for operations like sum relative to the squeezed tensors, which could introduce some bugs.

Not necessarily. I remembered that, at some point, @jjsjann123 brought up some idea in https://github.com/NVIDIA/Fuser/issues/900 about marking the entire fusion "symbolic" so that in our integration, we do not have to special handle things like shape == 1, stride == 0, shape == 0. I further recommended that, if the entire fusion is "symbolic", why not just split our IR into two levels, a higher level IR for easier integration, and a lower level IR for easier lowering, and we handle the conversion from one to the other? I think this direction is interesting, and if we do decide to proceed with the the proposal here, we might want the higher level IR to contain broadcasting, then further remove it when we convert it to lower level IR.

csarofeen commented 9 months ago

I don't understand the difference between logical and physical concretization.

zasdfgbnm commented 9 months ago

I don't understand the difference between logical and physical concretization.

For example, if we do a

T1[5, 1] = T0[5, 1]
T3[5, 7] = T1[5, 1] + T2[5, 7]

then the 1 is logically concretized to 7, regardless of the schedule.

But if we parallelize the fusion as [TIDx, Serial], then 1 will not be physically concretized, because the code is just:

float T1[1];
T1[0] = T0[threadIdx.x];
for (int i = 0; i < 7; i++) {
  T3[threadIdx.x * 7 + i] = T1[0] + T2[threadIdx.x * 7 + i]
}

we never allocate a 7 for T1 or T0.

jacobhinkle commented 9 months ago

Could you give an example also of a "physical" broadcast concretization? Does that mean only a broadcast across threads? Because I thought broadcast basically was equivalent to "multiple loads at each position from the same consumer" and I'm still having trouble seeing the distinction.

NVIDIA / Fuser

[Proposal] Remove Broadcast IterType #1778

Context

Summary of Proposal

Details of Approach

Questions

Tentative Plan

Alternatives

Making Broadcasting a "TensorDomain op"

Foreseen Downsides