Additional cases in `LoadStoreOp::evaluate`

Priya2698 commented 11 months ago

The current implementation for LoadStoreOp::evaluate method fails for two tests in test_allocation_domain.cpp. In the tests that fail, there are 2 cases:

The output rfactor is a reshape of the root domain (Input root domain = [N1, N2, HWC] , Output Rfactor = [N1*N2, C, H, W]) (https://github.com/NVIDIA/Fuser/blob/342aafc83afd91e2be01c568f57540dae19b20f3/test/test_allocation_domain.cpp#L252C30-L252C55)
The output rfactor is permutation of reshape of the root domain (Input root domain = [N1, N2, HWC] , Output Rfactor = [N1*N2, H, W, C]) (https://github.com/NVIDIA/Fuser/blob/342aafc83afd91e2be01c568f57540dae19b20f3/test/test_allocation_domain.cpp#L196C30-L196C48)

The current implementation checks for permutation of root domain if the output has a rfactor domain.

CC: @wujingyue

wujingyue commented 11 months ago

cc @zasdfgbnm

Is this call to commitLeafToRfactor intended or a workaround of another problem? It would make sense if the commit were done to the allocation domain, but making the rfactor domain of a LoadStoreOp beyond a root-domain permutation seems unnecessary.

zasdfgbnm commented 11 months ago

making the rfactor domain of a LoadStoreOp beyond a root-domain permutation seems unnecessary.

I actually have the opposite feeling. I am feeling that having ViewOp, BroadcastOp and SqueezeOp is redundant. They should be just a LoadStoreOp with non-trivial rFactor domain. But I haven't bring this idea up with the team to discuss, so not sure if other people agree with it or not. But at least, in the past, there was a TransposeOp, and changing it to LoadStoreOp reduces the complexity of our system.

For today, this commitLeafToRfactor is mostly a convenient utility for defining a fusion "hey, look, we could do this, and it will just work".

Regarding the test clean task for @Priya2698, I would recommend just leave this issue open, and work on something else.

zasdfgbnm commented 11 months ago

But at least, in the past, there was a TransposeOp, and changing it to LoadStoreOp reduces the complexity of our system.

BTW, this change not only makes our system cleaner, but also make it possible to support NN memory format of matmul thanks to the added flexibility.

wujingyue commented 11 months ago

I am feeling that having ViewOp, BroadcastOp and SqueezeOp is redundant. They should be just a LoadStoreOp with non-trivial rFactor domain.

Interesting -- I made the exactly opposite move in XLA :) We used to have a Reshape HLO that optionally does a transpose. Splitting that to a view-only reshape and an explicit transpose simplified analysis, optimization and codegen, because per-op semantics got simpler and the added combination effect (having to deal with a chain of reshape/transpose ops) was something we needed to worry about anyway.

BTW, this change not only makes our system cleaner, but also make it possible to support NN memory format of matmul thanks to the added flexibility.

I'm very curious about that. Why would the other way make it impossible to support NN?

zasdfgbnm commented 11 months ago

Interesting -- I made the exactly opposite move in XLA :) We used to have a Reshape HLO that optionally does a transpose. Splitting that to a view-only reshape and an explicit transpose simplified analysis, optimization and codegen, because per-op semantics got simpler and the added combination effect (having to deal with a chain of reshape/transpose ops) was something we needed to worry about anyway.

It's great to know that! Thanks for sharing this information!

I'm very curious about that. Why would the other way make it impossible to support NN?

Very good question!

For Ampere matmul, our hardware supports TN memory format only, which means, the input shapes are [M, K] and [N, K] and the result shape is [M, N]. In order to load matrix from smem into registers, we need to use ldmatrix and ldmatrix.trans, which are both LoadStoreOp. Originally, LoadStoreOp can not have fused transpose, which means, we need to define our fusion like this:

For TT, the input shapes are [M, K] and [K, N], we broadcast inputs into [M, K, 1] and [1, K, N], after multiplication, we have [M, K, N], and after reduction, we get [M, N].
For TN, the input shapes are [M, K] and [N, K], we broadcast inputs into [M, 1, K] and [1, N, K], after multiplication, we have [M, N, K], and after reduction, we get [M, N].
For NT, the input shapes are [K, M] and [K, N], we broadcast inputs into [K, M, 1] and [K, 1, N], after multiplication, we have [K, M, N], and after reduction, we get [M, N].

But for NN, the input shapes are [K, M] and [N, K], there is no such way to get an output of [M, N]. It is only possible to get [N, M].

In order to support NN, I removed TransposeOp, and changed LoadStoreOp so allow fused permutation. This way, we would be able to let ldmatrix.trans do a transpose [K, M] -> [M, K], so that we can again use broadcast-mul-sum to get [M, N]

This issue https://github.com/NVIDIA/Fuser/issues/203 contains more information, but be warned that a great portion of this issue is obselete, so don't be confused :P

wujingyue commented 11 months ago

Gotcha, thank you! To summarize my understanding, we need a fused ldmatrix.trans to make codegen easy, so we combined TransposeOp into LoadStoreOp. In the alternative world, we could have TransposeOp and LoadStoreOp separate in the high-level IR to benefit high-level analysis, optimization and interpretation (e.g. ExpressionEvaluator), and have ldmatrix.trans in the low-level IR to benefit low-level codegen. I believe kir is sort of that low-level IR but in practice it inherits many (or most?) ops from fusion IR.

NVIDIA / Fuser

Additional cases in `LoadStoreOp::evaluate` #1343