Open crcrpar opened 1 week ago
@Priya2698 , it looks like this K=1 case with a batch dimension is creating a 4D tensorview output which should be 3D:
Inputs:
T0_g[ iS0{i0}, iS1{i1}, bS2{1} ], float
T1_g[ iS18{i0}, bS4{1}, iS5{i6} ], float
Outputs: T3_g[ rS20{i0}, iS9{i6} ], float
T5_g[ rS14{i0}, iS15{i1}, iS16{i6}, bS17{1} ], float
%kernel_math {
T2_l[ iS19{i0}, iS7{i6} ]
= squeeze( T1_g[ iS18{i0}, bS4{1}, iS5{i6} ] )
T3_g[ rS20{i0}, iS9{i6} ]
= reduction( T2_l[ iS19{i0}, iS7{i6} ], op = add, initial value = float(0), allreduce = false )
T4_l[ iS10{i0}, iS11{i1}, iS12{i6}, bS13{1} ]
= matmul(T0_g[ iS0{i0}, iS1{i1}, bS2{1} ],
T1_g[ iS18{i0}, bS4{1}, iS5{i6} ])
T5_g[ rS14{i0}, iS15{i1}, iS16{i6}, bS17{1} ]
= reduction( T4_l[ iS10{i0}, iS11{i1}, iS12{i6}, bS13{1} ], op = add, initial value = float(0), allreduce = false )
}
That last axis in T4_l
should be Reduction
not Broadcast
.
[nav] In [1]: import torch
[nav] In [2]: x = torch.randn([5, 5, 1])
[nav] In [3]: y = torch.randn([5, 1, 7])
[nav] In [4]: z = torch.matmul(x,y)
[nav] In [5]: z.shape
Out[5]: torch.Size([5, 5, 7])
That last axis in
T4_l
should beReduction
notBroadcast
.
Actually I think we will hit errors with Reductions mapping to Iteration domains. Instead we should just not include this Reduction dimension when K=1.
There is a related issue here that ops::newOutputIterDomain
is not respecting force_iter_type
when all the inputs are Broadcast.
That last axis in
T4_l
should beReduction
notBroadcast
.Actually I think we will hit errors with Reductions mapping to Iteration domains. Instead we should just not include this Reduction dimension when K=1.
I prefer keeping the reduction axis uniform across all cases. Can you elaborate more on what issues we may run into?
There is a related issue here that
ops::newOutputIterDomain
is not respectingforce_iter_type
when all the inputs are Broadcast.
That's right, will fix that. force_iter_type
should be respected regardless of if it is a broadcast axis.
I prefer keeping the reduction axis uniform across all cases. Can you elaborate more on what issues we may run into?
When you do that in the K=1 case, then you will have a mapping from reduction domain to broadcast domains causing an IdModel error in my brief testing. Possibly you could do it if you don't map the K domain but that would lead to other problems possibly.
I prefer keeping the reduction axis uniform across all cases. Can you elaborate more on what issues we may run into?
When you do that in the K=1 case, then you will have a mapping from reduction domain to broadcast domains causing an IdModel error in my brief testing. Possibly you could do it if you don't map the K domain but that would lead to other problems possibly.
Yes, not mapping them will probably lead to other errors downstream. But iteration domains are mapped to reduction iterdomains in the consumer and ideally, the same should be allowed for broadcast as well. I will look into the IDModel errors and if we need to revise our constraints.
nvfuser commit: d75fc93 cuda: 12.5
encountered this error when running