Open Priya2698 opened 11 months ago
cc @zasdfgbnm
Is this call to commitLeafToRfactor intended or a workaround of another problem? It would make sense if the commit were done to the allocation domain, but making the rfactor domain of a LoadStoreOp
beyond a root-domain permutation seems unnecessary.
making the rfactor domain of a
LoadStoreOp
beyond a root-domain permutation seems unnecessary.
I actually have the opposite feeling. I am feeling that having ViewOp
, BroadcastOp
and SqueezeOp
is redundant. They should be just a LoadStoreOp
with non-trivial rFactor domain. But I haven't bring this idea up with the team to discuss, so not sure if other people agree with it or not. But at least, in the past, there was a TransposeOp
, and changing it to LoadStoreOp
reduces the complexity of our system.
For today, this commitLeafToRfactor
is mostly a convenient utility for defining a fusion "hey, look, we could do this, and it will just work".
Regarding the test clean task for @Priya2698, I would recommend just leave this issue open, and work on something else.
But at least, in the past, there was a
TransposeOp
, and changing it toLoadStoreOp
reduces the complexity of our system.
BTW, this change not only makes our system cleaner, but also make it possible to support NN
memory format of matmul thanks to the added flexibility.
I am feeling that having ViewOp, BroadcastOp and SqueezeOp is redundant. They should be just a LoadStoreOp with non-trivial rFactor domain.
Interesting -- I made the exactly opposite move in XLA :) We used to have a Reshape
HLO that optionally does a transpose. Splitting that to a view-only reshape and an explicit transpose simplified analysis, optimization and codegen, because per-op semantics got simpler and the added combination effect (having to deal with a chain of reshape/transpose ops) was something we needed to worry about anyway.
BTW, this change not only makes our system cleaner, but also make it possible to support NN memory format of matmul thanks to the added flexibility.
I'm very curious about that. Why would the other way make it impossible to support NN?
Interesting -- I made the exactly opposite move in XLA :) We used to have a
Reshape
HLO that optionally does a transpose. Splitting that to a view-only reshape and an explicit transpose simplified analysis, optimization and codegen, because per-op semantics got simpler and the added combination effect (having to deal with a chain of reshape/transpose ops) was something we needed to worry about anyway.
It's great to know that! Thanks for sharing this information!
I'm very curious about that. Why would the other way make it impossible to support NN?
Very good question!
For Ampere matmul, our hardware supports TN memory format only, which means, the input shapes are [M, K]
and [N, K]
and the result shape is [M, N]
. In order to load matrix from smem into registers, we need to use ldmatrix
and ldmatrix.trans
, which are both LoadStoreOp
. Originally, LoadStoreOp
can not have fused transpose, which means, we need to define our fusion like this:
[M, K]
and [K, N]
, we broadcast inputs into [M, K, 1]
and [1, K, N]
, after multiplication, we have [M, K, N]
, and after reduction, we get [M, N]
. [M, K]
and [N, K]
, we broadcast inputs into [M, 1, K]
and [1, N, K]
, after multiplication, we have [M, N, K]
, and after reduction, we get [M, N]
.[K, M]
and [K, N]
, we broadcast inputs into [K, M, 1]
and [K, 1, N]
, after multiplication, we have [K, M, N]
, and after reduction, we get [M, N]
.But for NN, the input shapes are [K, M]
and [N, K]
, there is no such way to get an output of [M, N]
. It is only possible to get [N, M]
.
In order to support NN, I removed TransposeOp
, and changed LoadStoreOp
so allow fused permutation. This way, we would be able to let ldmatrix.trans
do a transpose [K, M] -> [M, K]
, so that we can again use broadcast-mul-sum to get [M, N]
This issue https://github.com/NVIDIA/Fuser/issues/203 contains more information, but be warned that a great portion of this issue is obselete, so don't be confused :P
Gotcha, thank you! To summarize my understanding, we need a fused ldmatrix.trans to make codegen easy, so we combined TransposeOp
into LoadStoreOp
. In the alternative world, we could have TransposeOp
and LoadStoreOp
separate in the high-level IR to benefit high-level analysis, optimization and interpretation (e.g. ExpressionEvaluator), and have ldmatrix.trans
in the low-level IR to benefit low-level codegen. I believe kir
is sort of that low-level IR but in practice it inherits many (or most?) ops from fusion IR.
The current implementation for
LoadStoreOp::evaluate
method fails for two tests intest_allocation_domain.cpp
. In the tests that fail, there are 2 cases:The current implementation checks for permutation of root domain if the output has a rfactor domain.
CC: @wujingyue