Open jacobhinkle opened 9 months ago
To be very clear, concretization is not a compute time decision. Concretization requires both the computation from the front end and the parallelization of the operations. Only at that point do we know where "just in time" concretization happens.
To be very clear, concretization is not a compute time decision. Concretization requires both the computation from the front end and the parallelization of the operations. Only at that point do we know where "just in time" concretization happens.
Are you talking about code around the following places?
Regarding the term "concretization", I think it is important to distinguish "logical concretization" vs "physical concretization".
As long as we have a logical one-to-multiple in our tensor operation, we have a "logical concretization", but this may or may not be translated to a "physical concretization" depending on the schedule. For the case of "physical concretization", I totally agree with @csarofeen. But I think in @jacobhinkle's words, he is mostly referring to "logical concretization".
Thanks @csarofeen for mentioning this. I think we still need to figure out how we should analyze and generate code when a logical concretization is translated into a physical concretization.
Thanks @jacobhinkle a lot for summarizing it as an issue. I do not have a comment on this proposal yet, because I need to spend more time thinking about this topic.
So far what I am confident is about my statement 1:
What is important is the concretization of broadcasting. All the history of broadcasting IterDomains (IDs) before concretization are not relevant, and I am not interested in inspecting it. Especially, I don’t care which broadcasting ID is mapped to which broadcasting ID in root domain map.
Note that in my above statement, I am referring to "logical concretization". I think we should be interested in all "logical concretization", and we should analyze it to figure out things like: 1. does it translate to physical concretization? 2. How to index it?, etc.
On the other hand, it looks like my second statement:
Broadcasting IterDomain are just auxiliary placeholders. They should be capable of being created and annihilated arbitrarily.
is in the opposite direction as the proposal @jacobhinkle is proposing here. So far, I do not have any strong opinion about which is better than which.
Leaving aside what we have not reached an agreement yet, am I correct that, we all agree with the first principles on how a broadcasting behave, and we are mostly trying to figure out "how to mechanically encode what we agree on"?
I think https://github.com/NVIDIA/Fuser/issues/1759, https://github.com/NVIDIA/Fuser/issues/1628 and https://github.com/NVIDIA/Fuser/issues/1757 are all failing because of mechanical reasons. And we need to make sure, if we refactor and mechanically encode things differently, we should be capable of handling these cases naturally.
The biggest downside might be that upstream integrations would need to be aware that nvfuser expects a description of its graph with all the tensors squeezed. They would also need to specify axes for operations like
sum
relative to the squeezed tensors, which could introduce some bugs.
Not necessarily. I remembered that, at some point, @jjsjann123 brought up some idea in https://github.com/NVIDIA/Fuser/issues/900 about marking the entire fusion "symbolic" so that in our integration, we do not have to special handle things like shape == 1
, stride == 0
, shape == 0
. I further recommended that, if the entire fusion is "symbolic", why not just split our IR into two levels, a higher level IR for easier integration, and a lower level IR for easier lowering, and we handle the conversion from one to the other? I think this direction is interesting, and if we do decide to proceed with the the proposal here, we might want the higher level IR to contain broadcasting, then further remove it when we convert it to lower level IR.
I don't understand the difference between logical and physical concretization.
I don't understand the difference between logical and physical concretization.
For example, if we do a
T1[5, 1] = T0[5, 1]
T3[5, 7] = T1[5, 1] + T2[5, 7]
then the 1
is logically concretized to 7
, regardless of the schedule.
But if we parallelize the fusion as [TIDx, Serial]
, then 1
will not be physically concretized, because the code is just:
float T1[1];
T1[0] = T0[threadIdx.x];
for (int i = 0; i < 7; i++) {
T3[threadIdx.x * 7 + i] = T1[0] + T2[threadIdx.x * 7 + i]
}
we never allocate a 7
for T1
or T0
.
Could you give an example also of a "physical" broadcast concretization? Does that mean only a broadcast across threads? Because I thought broadcast basically was equivalent to "multiple loads at each position from the same consumer" and I'm still having trouble seeing the distinction.
(This issue might be updated as our discussion evolves)
Context
Broadcast IterDomains signify that a future operation might perform broadcasting involving this dimension. However, broadcasting is fundamentally a behavior of an operation, not an attribute of a piece of data. In fact it is the concretization of a broadcast that matters. The reason it is attached to tensors is inherited from numpy and PyTorch, which need a convenient way to specify that an op should do broadcasting. In an eager system, implementing broadcast semantics can be done locally to each op. However, shadowing that system in nvFuser means that we carry around Broadcast IterDomains that require a lot of special handling. Furthermore, some operations like
broadcast
,squeeze
, andexpand
are really no-ops, but representing them in our IR can lead to unintended data transfer.Summary of Proposal
In a side conversation, @zasdfgbnm said this (see #1628):
Indeed, we should be able to move all broadcasts to the point just before the broadcast is concretized. This proposal goes slightly further and just attaches the broadcast concretization info to the concretizing op.
This proposal seeks to remove
IterType::Broadcast
as well asBroadcastOp
,SqueezeOp
,ExpandOp
, and their related C++ and Python frontend operations. They will be replaced by an attribute inExpr
that tracks which axes in each input are broadcasts concretized by thatExpr
. Broadcast concretization will need to be explicitly declared during fusion definition.To work an actual example from the recent issue #1757 :
Following the principle that only concretization of broadcasts matters, in this example we see that mapping IterDomains is simplified and the gap introduced by the squeeze+broadcast sequence is avoided, since it is unrelated to concretization.
Details of Approach
Questions
Tentative Plan
Here is a sketch of how we could proceed. Each stage is likely somewhat involved, but any conceptual misunderstandings should be encountered in the first stages. The later stages will involve updating our frontend so we can coordinate with integration for a safe transition to the new interface.
std::vector<std::vector<bool>> Expr::broadcastAxes() const
.PairwiseRootDomainMap
and friends will need updating and tests will explicitly mark these in Expr nodes instead of using ops likeadd
.TensorView* tv2 = add(tv0, tv1, /*broadcast_axes=*/{ {1}, {} })
.RemoveBroadcastIterDomains
that removes all broadcasts fromFusion
and places explicit broadcast concretizations onExpr
s. This can start as a pre-lowering pass then become a preseg pass as we build out support.RemoveBroadcastIterDomains
pass and removeIterType::Broadcast
.Alternatives
Making Broadcasting a "TensorDomain op"
As proposed by @zasdfgbnm, we could keep
IterType::Broadcast
but removeBroadcastOp
andSqueezeOp
. Instead, broadcasts and squeezes would be inferred by the relation between root and rfactor domains within aTensorDomain
.Foreseen Downsides
The biggest downside might be that upstream integrations would need to be aware that nvfuser expects a description of its graph with all the tensors squeezed. They would also need to specify axes for operations like
sum
relative to the squeezed tensors, which could introduce some bugs.